This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher. If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.
Editorial Advisory Board
Troels Andreasen Roskilde University, Demark
Abonyi János Pannon University, Hungary
Isabelle Bloch Ecole Nationale Supérieure des Télécommunications, France
Janusz Kacprzyk Polish Academy of Sciences, Poland
Patrick Bosc IRISA/ENSSAT Technopole Anticipa, France Rita de Caluwe Ghent University, Belgium Guy de Tré Ghent University, Belgium Didier Dubois Université Paul Sabatier, France
Juan Miguel Medina Universidad de Granada, Spain Witold Pedrycz University of Alberta, Canada Olga Pons Universidad de Granada, Spain Henri Prade Université Paul Sabatier, France Ronald R. Yager Iona College, USA
Program Committee
M. Carmen Aranda University of Málaga, Spain
Gloria Bordogna Consiglio Nazionale delle Richerche, Italy
Sofiane Achiche Danmarks Tekniske University, Denmark
Patrice Buche Institut National de la Recherche Agronomique, France
Sergio Alonso Universidad de Granada, Spain Francisco Araque Universidad de Granada, Spain Wai-Ho Au University of Medfordshire, UK Mehmet Emin Aydin University of Bedforshire, UK Carlos D. Barranco Universidad Pablo de Olavide, Spain Rafael Bello Universidad Central de Las Villas, Cuba Radim Belohlavek Binghamton University, USA Malcolm J. Beynon Cardiff University, UK Arijit Bhattacharya The Patent Office in West Bengal, India Ignacio José Blanco Universidad de Granada, Spain
Henrik Bulskov Roskilde Universitetscenter, Denmark Jesús Campaña Universidad de Granada, Spain Ramón Alberto Carrasco Universidad de Granada, Spain Jesús Chamorro Universidad de Granada, Spain Yan Chen Louisiana State University, USA Jianhua Chen Louisiana State University, USA Juan Carlos Cubero Universidad de Granada, Spain Miguel Delgado Universidad de Granada, Spain Jesús María Doña Universidad de Málaga, Spain
Luminita Dumitriu Universitatea “Dunărea de Jos” din Galaţi, Romania Manuel Enciso Universidad de Málaga, Spain Céline Fiot Université Montpellier 2, France Joaquín Fernández-Valdivia Universidad de Granada, Spain Adem Göleç Erciyes Üniversitesi, Turkey Antonio González Universidad de Granada, Spain Claudia González Universidad Simón Bolívar, Venezuela Allel Hadjali Université de Rennes 1, France Leoncio Jiménez Universidad Católica del Maule, Chile Stanislav Krajči Univerzita Pavla Jozefa Šafárika, Slovakia Cemallettin Kubat Sakarya Üniversitesi, Turkey Sid Kulkarni University of Ballarat, Australia Hongbo Liu Dalian Maritime University, China Manuel Lozano Universidad de Granada, Spain Gabriel Jesús Luque Universidad de Málaga, Spain Carlos Mantas Universidad de Granada, Spain
María José Martín Universidad de Granada, Spain Carlos Morell Universidad Central de Las Villas, Cuba Juan Moreno Universidad de Castilla-La Mancha, Spain José Ángel Olivas Universidad de Castilla-La Mancha, Spain Carlos Ortiz Universidad de Navarra, Spain Jan Outrata Univerzita Palackého v Olomouci, Czech Republic Olivier Pivert Université de Rennes 1, France Giuseppe Psaila Università degli studi di Bergamo, Italy Guillaume Raschia Université de Nantes, France José María Rodríguez Universidad de Cádiz, Spain Graham H. Rong Massachusetts Institute of Technology, USA Daniel Sánchez Universidad de Granada, Spain Cristián R. Sepúlveda ACL Aplicaciones Computacionales Ltda., Chile José María Serrano Universidad de Jaén, Spain Srđan Škrbić Univerzitet u Novom Sadu, Serbia Aleksandar Takači Univerzitet u Novom Sadu, Serbia
Harun Taşkin Sakarya Üniversitesi, Turkey
Marcela Varas Universidad de Concepción, Chile
Oliver Thomas Universität des Saarlandes, Germany
Pandian Vasant Universiti Teknologi Petronas, Malaysia
Rallou Thomopoulos Institut National de la Recherche Agronomique, France
María Amparo Vila Universidad de Granada, Spain
Leonid Tineo Universidad Simón Bolívar, Venezuela Cornelia Tudorie Universitatea "Dunărea de Jos" din Galaţi, Romania Safiye Turgay Abant İzzet Baysal Üniversitesi, Turkey Hamid Haidarian Shahri University of Maryland, USA Awadhesh Kumar Sharma Madan Mohan Malviya Engineering College, India Angélica Urrutia Universidad Católica del Maule, Chile Dao Van Tuyet Vietnamese Academy of Science and Technology, Vietnam
W. Amenel Voglozin Université de Nantes, France Peter Vojtáš Univerzita Karlova v Prague, Czech Republic Vilem Vychodil Univerzita Palackého v Olomouci, Czech Republic Shyue Liang Wang New York Institute of Technology, USA Yi Wang Cardiff University, UK Nicolas Werro Université de Fribourg, Switzerland Geraldo Xexéo Universidade Federal do Rio de Janeiro, Brazil Qi Yang University of Wisconsin at Platteville, USA
List of Contributors
Abonyi, Janos / University of Pannonia, Hungary............................................................................... 55 Andreasen, Troels / Roskilde University, Denmark........................................................................... 325 Araque, F. / Universidad de Granada, Spain .................................................................................... 563 Au, Wai-Ho / Microsoft Corporation, USA ....................................................................................... 685 Barranco, Carlos D. / Pablo de Olavide University, Spain .............................................................. 435 Belohlavek, Radim / Binghamton University–SUNY, USA and Palacky University, Czech Republic......................................................................................................................... 462,.634 Ben Hassine, Mohamed Ali / Tunis El Manar University, Tunisia ................................................... 351 Beynon, Malcolm J. / Cardiff University, UK ........................................................................... 760,.784 Blot, Jean-Yves / Portugal Institute of Archaeology, Portugal.......................................................... 516 Bordogna, Gloria / CNR IDPA, Italy ................................................................................................. 191 Bosc, P. / IRISA-ENSSAT, Université de Rennes 1, France................................................................ 143 Braga, André / IBM Brazil, Brazil..................................................................................................... 381 Buche, Patrice / INRA, France .......................................................................................................... 299 Bulskov, Henrik / Roskilde University, Denmark.............................................................................. 325 Callens, Bert / Ghent University, Belgium ......................................................................................... 167 Campaña, Jesús R. / University of Granada, Spain ......................................................................... 435 Carrasco, R. A. / Universidad de Granada, Spain ............................................................................ 563 Chen, Jianhua / Louisiana State University, USA ............................................................................. 538 Chen, Yan / Louisiana State University, USA .................................................................................... 538 Coelho, Joao./ Portugal Institute of Archaeology, Portugal .............................................................. 516 de Tré, Guy / Ghent University, Belgium ..................................................................................... 34,.167 de Caluwe, Rita / Ghent University, Belgium...................................................................................... 34 Demoor, Marysa / Ghent University, Belgium .................................................................................. 167 Doña, J. M. / University of Malága, Spain ........................................................................................ 805 Dubois, Didier / IRIT, Université de Toulouse, France ....................................................................... 97 Feil, Balazs / University of Pannonia, Hungary .................................................................................. 55 Fiot, Céline / University of Montpellier II – CNRS, France .............................................................. 727 Galindo, José / University of Málaga, Spain ................................................................................. 1,.351 Gonzalez, Claudia / Universidad Simón Bolívar, Venezuela ............................................................. 270 Gosseye, Lise / Ghent University, Belgium ........................................................................................ 167 Goswami, A. / I.I.T., Kharagpur, India .............................................................................................. 658 Gupta, D. K. / I.I.T., Kharagpur, India .............................................................................................. 658 Hadjali, A. / IRISA-ENSSAT, Université de Rennes 1, France .......................................................... 143 Haemmerlé, Ollivier / IRIT, France .................................................................................................. 299
Hong, Tuzng-Pei / National University of Kaohsiung, Taiwan ......................................................... 615 Kacprzyk, Janusz / Polish Academy of Sciences, Poland................................................................... 34 La Red, D. / National University of the Northeast, Argentina ........................................................... 805 Liétard, Ludovic / IRISA/IUT & IRISA/ENSSAT, France ................................................................. 246 Medina, Juan M. / University of Granada, Spain ............................................................................. 435 Meier, Andreas / University of Fribourg, Switzerland ...................................................................... 586 Mouaddib, Noureddine / Université de Nantes, France .................................................................. 115 Ounelli, Habib / Tunis El Manar University, Tunisia ........................................................................ 351 Peláez, J. I. / University of Malága, Spain......................................................................................... 805 Pivert, O. / IRISA-ENSSAT, Université de Rennes 1, France ............................................................ 143 Prade, Henri / IRIT, Université de Toulouse, France .......................................................................... 97 Psaila, Guiseppe / University of Bergamo, Italy................................................................................ 191 Raschia, Guiseppe / Université de Nantes, France ........................................................................... 115 Rocacher, Daniel / IRISA/IUT & IRISA/ENSSAT, France ................................................................. 246 Rong, Graham H. / Massachusetts Institute of Technology, USA ..................................................... 538 Salguero, A. / Universidad de Granada, Spain.................................................................................. 563 Schindler, Günter / Galexis AG, Switzerland.................................................................................... 586 Schneider, Markus / University of Florida, USA .............................................................................. 490 Shahri, Hamid Haidarian / University of Maryland, USA............................................................... 745 Sharma, Awadhesh Kumar / MMM Engineering College, Gorakhpur, UP, India .......................... 658 Shen, Ju-Wen / Chunghwa Telecom Lab, Taiwan ............................................................................. 615 Škrbić, Srđan / University of Novi Sad, Serbia ................................................................................. 407 Takači, Aleksandar / University of Novi Sad, Serbia ........................................................................ 407 Thomopoulos, Rallou / INRA, France .............................................................................................. 299 Tineo, Leonid / Universidad Simón Bolívar, Venezuela .................................................................... 270 Touzi, Amel Grissa / Tunis El Manar University, Tunisia ................................................................. 351 Tudorie, Cornelia / University “Dunărea de Jos”, Galaţi, Romania ............................................... 218 Turgay, Safiye / Abant Izzet Baysal University, Turkey ..................................................................... 822 Ughetto, Laurent / Université de Nantes, France ............................................................................. 115 Urrutia, Angélica / Universidad Católica del Maule, Chile ............................................................. 270 Veryha, Yauheni / ABB Corporate Research Center, Germany ........................................................ 516 Vila, M. A. / Universidad de Granada, Spain .................................................................................... 563 Voglozin, W. Amenel / Université de Nantes, France ....................................................................... 115 Vychodil, Vilem ./ Binghamton University–SUNY, USA and Palacky University, Czech Republic................................................................................................................................. 634 Wang, Shyue-Liang / New York Institute of Technology, USA .......................................................... 615 Wang, Yi / Nottingham Trent University, UK..................................................................................... 706 Werro, Nicolas / University of Fribourg, Switzerland ....................................................................... 586 Xexéo, Geraldo / Universidade Federal do Rio de Janeiro, Brazil................................................... 381 Zadrożny, Sławomir / Polish Academy of Sciences, Poland .............................................................. 34
Section I Introduction Volume I Chapter I Introduction.and.Trends.to.Fuzzy.Logic.and.Fuzzy.Databases. ............................................................. 1. José Galindo, University of Málaga, Spain Chapter II An.Overview.of.Fuzzy.Approaches.to.Flexible.Database.Querying. ................................................... 34 Sławomir Zadrożny, Polish Academy of Sciences, Poland Guy de Tré, Ghent University, Belgium Rita de Caluwe, Ghent University, Belgium Janusz Kacprzyk, Polish Academy of Sciences, Poland Chapter III Introduction.to.Fuzzy.Data.Mining.Methods........................................................................................ 55 Balazs Feil, University of Pannonia, Hungary Janos Abonyi, University of Pannonia, Hungary.
Section II Fuzzy Queries Chapter IV Handling.Bipolar.Queries.in.Fuzzy.Information.Processing ................................................................ 97 Didier Dubois, IRIT, Université de Toulouse, France Henri Prade, IRIT, Université de Toulouse, France
Chapter V From.User.Requirements.to.Evaluation.Strategies.of.Flexible.Queries.in.Databases. ....................... 115 Noureddine Mouaddib, Université de Nantes, France Guillaume Raschia, Université de Nantes, France W. Amenel Voglozin, Université de Nantes, France Laurent Ughetto, Université de Rennes 2, France Chapter VI On.the.Versatility.of.Fuzzy.Sets.for.Modeling.Flexible.Queries. ....................................................... 143 P. Bosc, IRISA-ENSSAT, Université de Rennes 1, France A. Hadjali, IRISA-ENSSAT, Université de Rennes 1, France O. Pivert, IRISA-ENSSAT, Université de Rennes 1, France Chapter VII Flexible.Querying.Techniques.Based.on.CBR. .................................................................................. 167 Guy de Tré, Ghent University, Belgium Marysa Demoor, Ghent University, Belgium Bert Callens, Ghent University, Belgium Lise Gosseye, Ghent University, Belgium Chapter VIII Customizable.Flexible.Querying.for.Classical.Relational.Databases................................................. 191 Gloria Bordogna, CNR IDPA, Italy Guiseppe Psaila, University of Bergamo, Italy Chapter IX Qualifying.Objects.in.Classical.Relational.Database.Querying. ........................................................ 218 Cornelia Tudorie, University “Dunărea de Jos”, Galati, Romania Chapter X Evaluation of Quantified Statements Using Gradual Numbers. ......................................................... 246 Ludovic Liétard, IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher, IRISA/IUT & IRISA/ENSSAT, France Chapter XI FSQL.and.SQLf:.Towards.a.Standard.in.Fuzzy.Databases. ............................................................... 270 Angélica Urrutia, Universidad Católica del Maule, Chile Leonid Tineo, Universidad Simón Bolívar, Venezuela Claudia Gonzalez, Universidad Simón Bolívar, Venezuela Chapter XII Hierarchical.Fuzzy.Sets.to.Query.Possibilistic.Databases. ................................................................. 299 Rallou Thomopoulos, INRA, France Patrice Buche, INRA, France Ollivier Haemmerlé, IRIT, France
Chapter XIII Query.Expansion.by.Taxonomy. ......................................................................................................... 325 Troels Andreasen, Roskilde University, Denmark Henrik Bulskov, Roskilde University, Denmark
Section III Implementation, Data Models, Fuzzy Attributes, and Applications Chapter XIV How.to.Achieve.Fuzzy.Relational.Databases.Managing.Fuzzy.Data.and.Metadata. ......................... 351 Mohamed Ali Ben Hassine, Tunis El Manar University, Tunisia Amel Grissa Touzi, Tunis El Manar University, Tunisia José Galindo, University of Málaga, Spain Habib Ounelli, Tunis El Manar University, Tunisia Chapter XV A.Tool.for.Fuzzy.Reasoning.and.Querying. ....................................................................................... 381 Geraldo Xexéo, Universidade Federal do Rio de Janeiro, Brazil André Braga, IBM Brazil, Brazil Chapter XVI Data.Model.of.FRDB.with.Different.Data.Types.and.PFSQL. .......................................................... 407 Aleksandar Takači, University of Novi Sad, Serbia Srđan Škrbić, University of Novi Sad, Serbia
Volume II Chapter XVII Towards.a.Fuzzy.Object-Relational.Database.Model. ........................................................................ 435 Carlos D. Barranco, Pablo de Olavide University, Spain Jesús R. Campaña, University of Granada, Spain Juan M. Medina, University of Granada, Spain Chapter XVIII Relational Data, Formal Concept Analysis, and Graded Attributes.................................................... 462 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Chapter XIX Fuzzy.Spatial.Data.Types.for.Spatial.Uncertainty.Management.in.Databases ................................... 490 Markus Schneider, University of Florida, USA
Chapter XX Fuzzy Classification in Shipwreck Scatter Analysis ........................................................................... 516 Yauheni Veryha, ABB Corporate Research Center, Germany Jean-Yves Blot, Portugal Institute of Archaeology, Portugal Joao Coelho, Portugal Institute of Archaeology, Portugal Chapter XXI Fabric.Database.and.Fuzzy.Logic.Models.for.Evaluating.Fabric.Performance ................................. 538 Yan Chen, Louisiana State University, USA Graham H. Rong, Massachusetts Institute of Technology, USA Jianhua Chen, Louisiana State University, USA Chapter XXII Applying.Fuzzy.Data.Mining.to.Tourism.Area .................................................................................. 563 R. A. Carrasco, Universidad de Granada, Spain F. Araque, Universidad de Granada, Spain A. Salguero, Universidad de Granada, Spain M. A. Vila, Universidad de Granada, Spain
Section IV Fuzzy Data Mining Chapter XXIII Fuzzy Classification on Relational Databases .................................................................................... 586 Andreas Meier, University of Fribourg, Switzerland Günter Schindler, Galexis AG, Switzerland Nicolas Werro, University of Fribourg, Switzerland Chapter XXIV Incremental.Discovery.of.Fuzzy.Functional.Dependencies ............................................................... 615 Shyue-Liang Wang, New York Institute of Technology, USA Ju-Wen Shen, Chunghwa Telecom Lab, Taiwan Tuzng-Pei Hong, National University of Kaohsiung, Taiwan Chapter XXV Data.Dependencies.in.Codd’s.Relational.Model.with.Similarities ..................................................... 634 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Vilem Vychodil, Binghamton University–SUNY, USA and Palacky University, Czech Republic
Chapter XXVI Fuzzy Inclusion Dependencies in Fuzzy Databases............................................................................ 658 Awadhesh Kumar Sharma, MMM Engineering College, Gorakhpur, UP, India A. Goswami, I.I.T., Kharagpur, India D. K. Gupta, I.I.T., Kharagpur, India Chapter XXVII A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases.................... 685 Wai-Ho Au, Microsoft Corporation, USA Chapter XXVIII Applying Fuzzy Logic in Dynamic Causal Mining............................................................................. 706 Yi Wang, Nottingham Trent University, UK Chapter XXIX Fuzzy Sequential Patterns for Quantitative Data Mining.................................................................... 727 Céline Fiot, University of Montpellier II – CNRS, France Chapter XXX A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses....................... 745 Hamid Haidarian Shahri, University of Maryland, USA Chapter XXXI Fuzzy Decision-Tree-Based Analysis of Databases............................................................................. 760 Malcolm J. Beynon, Cardiff University, UK Chapter XXXII Fuzzy Outranking Methods Including Fuzzy PROMETHEE............................................................. 784 Malcolm J. Beynon, Cardiff University, UK Chapter XXXIII Fuzzy Imputation Method for Database Systems................................................................................ 805 J. I. Peláez, University of Malága, Spain J. M. Doña, University of Malága, Spain D. La Red, National University of the Northeast, Argentina Chapter XXXIV Intelligent Fuzzy Database Management in Multiagent Systems........................................................ 822 Safiye Turgay, Abant Izzet Baysal University, Turkey
Section I Introduction Volume I Chapter I Introduction.and.Trends.to.Fuzzy.Logic.and.Fuzzy.Databases. ............................................................. 1. José Galindo, University of Málaga, Spain This.chapter.is.a.basic.chapter.for.novel.researchers.in.the.area.of.fuzzy.logic..It.introduces.the.main. concepts, like fuzzy sets and fuzzy numbers, linguistic labels, membership functions, the representation theorem and the extension principle, fuzzy set operations like union and intersection (t-norms and t-conorms), negations, fuzzy implications, different comparison operations, fuzzy quantifiers, and the possibility.theory..With.respect.to.the.fuzzy.databases,.this.chapter.gives.a.brief.introduction.to.this.topic. and enumerates a list of six research topics in this field. Chapter II An.Overview.of.Fuzzy.Approaches.to.Flexible.Database.Querying. ................................................... 34 Sławomir Zadrożny, Polish Academy of Sciences, Poland Guy de Tré, Ghent University, Belgium Rita de Caluwe, Ghent University, Belgium Janusz Kacprzyk, Polish Academy of Sciences, Poland An. overview. of. main. trends. in. the. research. on. fuzzy. querying. techniques,. including. both. querying. techniques.for.traditional.databases.as.well.as.for.fuzzy.databases,.is.described.
Chapter III Introduction.to.Fuzzy.Data.Mining.Methods........................................................................................ 55 Balazs Feil, University of Pannonia, Hungary Janos Abonyi, University of Pannonia, Hungary. This chapter gives a comprehensive view about the links between fuzzy logic and data mining, following nine steps of knowledge discovery. It defines and studies interesting methods, like fuzzy clustering, fuzzy classification, fuzzy association rule mining, and visualization of the results.
Section II Fuzzy Queries Chapter IV Handling.Bipolar.Queries.in.Fuzzy.Information.Processing ................................................................ 97 Didier Dubois, IRIT, Université de Toulouse, France Henri Prade, IRIT, Université de Toulouse, France Bipolar queries distinguish between negative and positive preferences in the processing of flexible queries. Negative preferences express what is more or less impossible or feasible, and they specify flexible constraints.restricting.feasible.or.tolerated.values..Positive.preferences.are.less.compulsory.and.rather. express.wishes,.giving.measurements.between.indifferent.and.preferred.values. Chapter V From.User.Requirements.to.Evaluation.Strategies.of.Flexible.Queries.in.Databases. ....................... 115 Noureddine Mouaddib, Université de Nantes, France Guillaume Raschia, Université de Nantes, France W. Amenel Voglozin, Université de Nantes, France Laurent Ughetto, Université de Rennes 2, France This.chapter.studies.the.whole.process.of.fuzzy.querying,.from.the.query.formulation.to.its.evaluation,. proposing.index.structures.in.the.evaluation.of.fuzzy.queries..After.introducing.different.ways.for.expressing flexibility in queries, the chapter reviews current methods for evaluating fuzzy queries. Finally, SAINTETIQ is presented, a data summarization model that produces a hierarchy of summaries given a relational.table.and.additional.metadata. Chapter VI On.the.Versatility.of.Fuzzy.Sets.for.Modeling.Flexible.Queries. ....................................................... 143 P. Bosc, IRISA-ENSSAT, Université de Rennes 1, France A. Hadjali, IRISA-ENSSAT, Université de Rennes 1, France O. Pivert, IRISA-ENSSAT, Université de Rennes 1, France This work advocates the interest of extending usual Boolean queries with preferences using fuzzy sets, highlighting.the.expressiveness.of.fuzzy.sets.with.the.division.operator.in.the.context.of.regular.databases..Some.useful.examples.are.exposed.using.the.fuzzy.query.language.SQLf.
Chapter VII Flexible.Querying.Techniques.Based.on.CBR. .................................................................................. 167 Guy de Tré, Ghent University, Belgium Marysa Demoor, Ghent University, Belgium Bert Callens, Ghent University, Belgium Lise Gosseye, Ghent University, Belgium The goal of this work is to enhance case-based reasoning (CBR) modeling a gradation in similarity of the cases. Thus, a new case is compared to previous cases in order to predict the corresponding unknown data values for the new case using possibility theory. This flexible CBR can be used to enhance flexible querying of regular databases under some conditions. Briefly, a real-world application is shown for information.retrieval.in.a.juridical.database. Chapter VIII Customizable.Flexible.Querying.for.Classical.Relational.Databases................................................. 191 Gloria Bordogna, CNR IDPA, Italy Guiseppe Psaila, University of Bergamo, Italy The.Soft-SQL.project.is.presented,.an.extension.of.SQL.for.fuzzy.queries.to.classic.relational.databases.. Perhaps.the.most.interesting.characteristic.is.to.provide.tools.allowing.users.to.directly.specify.the.context-dependent semantics of soft conditions. For example, a cheap flat in Milan does not have a similar price to a cheap flat in Tokyo. Chapter IX Qualifying.Objects.in.Classical.Relational.Database.Querying. ........................................................ 218 Cornelia Tudorie, University “Dunărea de Jos”, Galati, Romania The author studies fuzzy queries in order to rank the resulting objects (object qualification). After a discussion on different kinds of fuzzy conditions in a fuzzy query, a new particular condition is proposed: the relative object qualification as a query selection criterion, that is, queries with two conditions in which the first one depends on the results of the second one, for example, “Retrieve the inexpensive cars.among.the.high-speed.ones.” Chapter X Evaluation of Quantified Statements Using Gradual Numbers. ......................................................... 246 Ludovic Liétard, IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher, IRISA/IUT & IRISA/ENSSAT, France The chapter is devoted to the evaluation of quantified statements that can be found in many applications, for.example,.in.fuzzy.querying.databases..It.introduces.the.main.techniques.to.evaluate.such.statements. and proposes a new theoretical background for the evaluation of quantified statements with one or two fuzzy conditions: “most of the employees are well paid” and “most of the young employees are well paid.” The work shows that the context of fuzzy numbers provides some nice characteristics.
Chapter XI FSQL.and.SQLf:.Towards.a.Standard.in.Fuzzy.Databases. ............................................................... 270 Angélica Urrutia, Universidad Católica del Maule, Chile Leonid Tineo, Universidad Simón Bolívar, Venezuela Claudia Gonzalez, Universidad Simón Bolívar, Venezuela The goal of this chapter is to propose a unified SQL-based language for fuzzy relational databases. The authors study the two more general approaches in this field, SQLf and FSQL. They study the characteristics and definitions of these languages, and also the current implementations based on both languages. Chapter XII Hierarchical.Fuzzy.Sets.to.Query.Possibilistic.Databases. ................................................................. 299 Rallou Thomopoulos, INRA, France Patrice Buche, INRA, France Ollivier Haemmerlé, IRIT, France Within the framework of flexible querying of possibilistic databases, based on the fuzzy set theory, this chapter.focuses.on.the.case.where.the.vocabulary.used.both.in.the.querying.language.and.in.the.data.is. hierarchically organized, which occurs in systems that use ontologies. A hierarchical fuzzy set is defined as a fuzzy set whose definition domains are hierarchies. Besides this, two applications are presented. Chapter XIII Query.Expansion.by.Taxonomy. ......................................................................................................... 325 Troels Andreasen, Roskilde University, Denmark Henrik Bulskov, Roskilde University, Denmark An.overview.of.the.use.of.taxonomies.and.ontologies.in.querying.is.presented,.with.a.special.emphasis. on similarity derived from the ontology, where key concepts are organized and related. Queries can be expanded.with.these.similarity.measures,.thereby.causing.query.evaluation.to.be.based.on.concepts.from. the.ontology.domain.rather.than.on.words.or.numbers.in.the.query.
Section III Implementation, Data Models, Fuzzy Attributes, and Applications Chapter XIV How.to.Achieve.Fuzzy.Relational.Databases.Managing.Fuzzy.Data.and.Metadata. ......................... 351 Mohamed Ali Ben Hassine, Tunis El Manar University, Tunisia Amel Grissa Touzi, Tunis El Manar University, Tunisia José Galindo, University of Málaga, Spain Habib Ounelli, Tunis El Manar University, Tunisia This.chapter.is.addressed.mainly.to.database.administrators.and.enterprises.interested.in.the.fuzzy.capabilities.in.their.current.databases..It.presents.three.migration.approaches.from.real.relational.databases.
toward.fuzzy.relational.databases..These.strategies.offer.different.possibilities,.from.the.possibility.of. fuzzy.queries.using.the.FSQL.language.to.storing.fuzzy.data..Of.course,.each.possibility.poses.different. troubles.that.must.be.solved. Chapter XV A.Tool.for.Fuzzy.Reasoning.and.Querying. ....................................................................................... 381 Geraldo Xexéo, Universidade Federal do Rio de Janeiro, Brazil André Braga, IBM Brazil, Brazil CLOUDS.is.a.library.and.user.interface.organizing.uncertainty.in.database.systems,.a.tool.that.allows. the creation of fuzzy reasoning systems over classic, nonfuzzy relational databases. It defines a fuzzy extension.to.SQL.queries.and.was.incorporated.into.a.geographic.information.system. Chapter XVI Data.Model.of.FRDB.with.Different.Data.Types.and.PFSQL. .......................................................... 407 Aleksandar Takači, University of Novi Sad, Serbia Srđan Škrbić, University of Novi Sad, Serbia This.chapter.introduces.a.way.to.extend.the.relational.model.with.mechanisms.that.can.handle.imprecise,.uncertain,.and.inconsistent.attribute.values.using.fuzzy.logic..Furthermore,.a.query.language.called. PFSQL.is.described.for.this.fuzzy.database.model,.with.fuzzy.capabilities.and.the.possibility.to.specify. priorities.in.every.simple.fuzzy.condition..The.priorities.of.PFSQL.are.compared.with.the.thresholds. of.FSQL. Volume II Chapter XVII Towards.a.Fuzzy.Object-Relational.Database.Model. ........................................................................ 435 Carlos D. Barranco, Pablo de Olavide University, Spain Jesús R. Campaña, University of Granada, Spain Juan M. Medina, University of Granada, Spain The.authors.introduce.a.fuzzy.object-relational.database.model.including.fuzzy.extensions.of.the.userdefined data types and the collection types. Then they study a way to flexibly compare complex data types.and.an.extension.of.collection.types.allowing.partial.membership.of.its.elements..An.application. in the image-retrieval field is briefly exposed. Chapter XVIII Relational Data, Formal Concept Analysis, and Graded Attributes.................................................... 462 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Formal concept analysis with graded (fuzzy) attributes is studied. It is a particular method of analysis of fuzzy.relational.data,.and.here.an.overview.of.foundations.of.this.formal.concept.analysis.is.presented. together.with.concept.lattices.and.attribute.implications...
Chapter XIX Fuzzy.Spatial.Data.Types.for.Spatial.Uncertainty.Management.in.Databases ................................... 490 Markus Schneider, University of Florida, USA The.author.proposes.some.fuzzy.spatial.data.types,.introducing.fuzzy.points,.fuzzy.lines,.and.fuzzy. regions.in.the.two-dimensional.space..This.chapter.also.studies.fuzzy.topological.predicates.for.fuzzy. querying with an SQL-like spatial query language. Chapter XX Fuzzy Classification in Shipwreck Scatter Analysis ........................................................................... 516 Yauheni Veryha, ABB Corporate Research Center, Germany Jean-Yves Blot, Portugal Institute of Archaeology, Portugal Joao Coelho, Portugal Institute of Archaeology, Portugal This is an application of fuzzy sets theory in the area of maritime archaeology. Specifically, the authors show how fuzzy classification using SQL is applied in shipwreck scatter analysis to obtain a user-friendly representation of the wear-type parameters of fragments of ceramics from an ancient shipwreck. This data mining.method.helps.to.classify.fragments.of.ceramics.by.detecting.intrinsic.classes.and.neighborhood. relations, keeping high precision of data classification in comparison to classical methods. The authors state that this framework can be relatively easily integrated with conventional relational databases, which are.widely.used.in.existing.archaeological.information.systems. Chapter XXI Fabric.Database.and.Fuzzy.Logic.Models.for.Evaluating.Fabric.Performance ................................. 538 Yan Chen, Louisiana State University, USA Graham H. Rong, Massachusetts Institute of Technology, USA Jianhua Chen, Louisiana State University, USA A.Web-based.fabric.database.is.introduced.in.terms.of.its.physical.structure,.software.system.architecture,. basic.and.intelligent.search.engines,.and.various.display.methods.for.search.results..This.application.uses. effective fuzzy linear clustering methods to predict the fabric drape coefficient from fabric mechanical and.structural.properties,.and.the.fabric.tailorability.with.good.prediction.accuracy..Finally,.a.neuro-fuzzy. computing.technique.for.evaluating.nonwoven.fabric.softness.is.presented. Chapter XXII Applying.Fuzzy.Data.Mining.to.Tourism.Area .................................................................................. 563 R. A. Carrasco, Universidad de Granada, Spain F. Araque, Universidad de Granada, Spain A. Salguero, Universidad de Granada, Spain M. A. Vila, Universidad de Granada, Spain This.chapter.proposes.the.use.of.an.extension.of.the.FSQL.language.for.fuzzy.queries.as.one.of.the. techniques.of.data.mining,.which.can.be.used.to.solve.the.problem.of.offering.the.better.place.for.soaring.given.the.environment.conditions.and.customer.characteristics..After.doing.a.process.of.clustering. and.characterization,.the.method.is.able.of.classify.new.items.in.a.cluster.
Section IV Fuzzy Data Mining Chapter XXIII Fuzzy Classification on Relational Databases .................................................................................... 586 Andreas Meier, University of Fribourg, Switzerland Günter Schindler, Galexis AG, Switzerland Nicolas Werro, University of Fribourg, Switzerland A.context.model.with.fuzzy.classes.is.proposed.to.extend.relational.database.systems..More.precisely,. fuzzy.classes.and.linguistic.variables.and.terms,.together.with.appropriate.membership.functions,.are. added.to.the.database.schema..In.order.to.formulate.unsharp.queries,.the.authors.present.the.fCQL,.a. fuzzy classification query language, whose statements are transformed into SQL statements. Chapter XXIV Incremental.Discovery.of.Fuzzy.Functional.Dependencies ............................................................... 615 Shyue-Liang Wang, New York Institute of Technology, USA Ju-Wen Shen, Chunghwa Telecom Lab, Taiwan Tuzng-Pei Hong, National University of Kaohsiung, Taiwan Mining.fuzzy.functional.dependencies.from.fuzzy.databases.based.on.similarity.relations.is.studied,.while. methods.are.proposed.to.validate.and.incrementally.search.these.dependencies..A.detailed.example.is. given.to.illustrate.the.process.of.the.mining.algorithm..In.addition,.numerical.results.are.given.to.show. the.monotonic.characteristics.of.the.fuzzy.functional.dependencies. Chapter XXV Data.Dependencies.in.Codd’s.Relational.Model.with.Similarities ..................................................... 634 Radim Belohlavek, Binghamton University–SUNY, USA and Palacky University, Czech Republic Vilem Vychodil, Binghamton University–SUNY, USA and Palacky University, Czech Republic This.chapter.deals.with.fuzzy.logic.extensions.of.the.relational.model.that.consist.of.adding.similarity. relations to domains, truth degrees attached to the table rows (ranked tables), and considers functional dependencies.in.these.extensions..It.presents.a.particular.extension.and.functional.dependencies.in.this. extension.that.follow.the.principles.of.fuzzy.logic.in.a.narrow.sense..This.extension.is.compared.to. several.other.extensions.proposed.in.the.literature. Chapter XXVI Fuzzy.Inclusion.Dependencies.in.Fuzzy.Databases ........................................................................... 658 Awadhesh Kumar Sharma, MMM Engineering College, Gorakhpur, UP, India A. Goswami, I.I.T., Kharagpur, India D. K. Gupta, I.I.T., Kharagpur, India This chapter introduces one definition of fuzzy inclusion dependencies in fuzzy databases, a fuzzy constraint that we can see as a fuzzy foreign key between two given fuzzy relations. Inference rules on.such.dependencies.are.derived.and.an.algorithm.has.been.proposed.for.the.discovery.of.these.fuzzy. inclusion.dependencies.
Chapter XXVII A.Distributed.Algorithm.for.Mining.Fuzzy.Association.Rules.in.Traditional.Databases................... 685 Wai-Ho Au, Microsoft Corporation, USA A.new.distributed.algorithm.for.mining.fuzzy.association.rules.from.very.large.databases.is.proposed.. This.algorithm.has.a.very.effective.measure.to.distinguish.interesting.associations.from.uninteresting. ones..Each.site.scans.its.own.database.partition.to.obtain.the.number.of.tuples.characterized.by.different. linguistic.variables.and.linguistic.terms..Afterward,.they.exchange.their.own.local.counts.with.all.the. other sites to find the global values. Chapter XXVIII Applying.Fuzzy.Logic.in.Dynamic.Causal.Mining ............................................................................ 706 Yi Wang, Nottingham Trent University, UK This.chapter.applies.fuzzy.logic.to.a.dynamic.causal.mining.algorithm,.which.is.a.combination.of.mining. rules and system dynamics for discovering causality patterns in a target system. The final goal is that fuzzy logic assists the user to make better decisions, and also assists in a better understanding of future behavior.of.this.target.system. Chapter XXIX Fuzzy.Sequential.Patterns.for.Quantitative.Data.Mining ................................................................... 727 Céline Fiot, University of Montpellier II – CNRS, France Sequential-pattern.methods.handle.sequence.databases,.extracting.frequently.occurring.patterns.related. to time and transforming large amounts of data into useful comprehensible knowledge. After introducing various.fuzzy.sequential-pattern.approaches.and.the.general.principles.they.are.based.on,.a.complete. framework is defined for mining fuzzy sequential patterns handling different levels of consideration of quantitative information. This framework is applied to two real databases: Web access logs and a textual database. Chapter XXX A.Machine.Learning.Approach.to.Data.Cleaning.in.Databases.and.Data.Warehouses ...................... 745 Hamid Haidarian Shahri, University of Maryland, USA The. data. cleaning. process. is. a. duplicate. elimination. problem,. for. example,. in. data. integration. and. warehousing. Here, neuro-fuzzy techniques are mixed to produce a unique adaptive framework for data cleaning, which automatically learns from and adapts to the specific notion of similarity at a meta-level. It.can.be.utilized.in.the.production.of.an.intelligent.tool.to.increase.the.quality.and.accuracy.of.data. Chapter XXXI Fuzzy.Decision-Tree-Based.Analysis.of.Databases............................................................................ 760 Malcolm J. Beynon, Cardiff University, UK This.chapter.offers.a.description.of.fuzzy.decision-tree-based.research,.including.the.exposition.of.small. and.large.fuzzy.decision.trees.to.demonstrate.their.construction.and.practicality..Basically,.a.fuzzy.decision.tree.is.a.set.of.fuzzy.if-then.decision.rules.allowing.a.linguistic.interpretation.of.the.considered. problem.and.managing.the.possibility.for.imprecision.in.the.used.data.values.
Chapter XXXII Fuzzy Outranking Methods Including Fuzzy PROMETHEE............................................................. 784 Malcolm J. Beynon, Cardiff University, UK This chapter describes the rudiments of fuzzy outranking methods, with particular attention to fuzzy PROMETHEE, a multicriteria decision-making technique using fuzzy information. Alternative fuzzy PROMETHEE approaches are described, with one used in two real-life applications. Starting with known data about a series of possible alternatives, a preference ranking of them can be achieved. Chapter XXXIII Fuzzy Imputation Method for Database Systems................................................................................ 805 J. I. Peláez, University of Malága, Spain J. M. Doña, University of Malága, Spain D. La Red, National University of the Northeast, Argentina Missing data are often an actual problem in real data sets. Imputation is a method to fill in missing data with plausible values to produce a complete data set. This work analyzes the performance of the different traditional data imputation methods. A new fuzzy imputation approach is proposed using the ordered weighted average (OWA) operators by Yager and the majority concept. Chapter XXXIV Intelligent Fuzzy Database Management in Multiagent Systems........................................................ 822 Safiye Turgay, Abant Izzet Baysal University, Turkey In this chapter, an agent-based fuzzy data mining structure is defined to process and evaluate data and to build a rule structure for the system. Within the developed system, the focus was on the operation feature of the fuzzy data mining structure, which is the same for each agent composing the system. The suggested association rules are derived from a relational database.
xxiii
Foreword
Few scientific communities within computer sciences have thought over both their present and future like the one devoted to databases. During more than 15 years, a very comprehensive and well-known group of researchers of recognized prestige in this field meets regularly to fix the expected main challenges.and.problems.within.its.scope.and.to.propose.what.research.lines.are.the.most.promising.and. necessary. From the first meetings in Laguna Beach and Palo Alto (1988, 1990), one can follow the proposals and results in the databases field through a series of reports, which in some way constitute a guideline for the development of research in this area (Abiteboul et al., 2005; Bernstein et al., 1998; Silberschatz, Stonebraker, & Ullman, 1991, 1996; Silberschatz, Zdonik, et al., 1996). Because of the experience.and.quality.of.the.people.involved.in.these.meetings,.we.can.assess.the.opportunity.and. novelty of any work in databases in light of the recommendations and research hints proposed in these reports..One.of.the.lines.that.appear.with.more.continuity.and.insistence.is.the.treatment.of.imprecise. and.uncertain.information.in.databases. In 1991 the first report contained a section on new concepts in data models, which remarked on the management.of.uncertainty.as.a.need.for.inclusion.in.new.data.models..At.that.time,.data.never.being. entirely precise, such as those in satellite photographs, justified the need. Later in 1996, the second report.also.posed.new.problems.associated.with.vague.queries.concerning.images,.but.in.this.case.the. imprecision was supposed to arise from two sources: first, imprecise features such as color, texture, and so.forth,.and.second,.imprecise.valuations.by.the.user.of.time.and/or.space,.such.as.statements.made. that.images.are.close.to.something.or,.for.example,.made.in.the.morning..Also.in.this.second.report,. data.mining.arose.as.a.new.trend.in.the.treatment.of.information,.and.it.was.conceived.as.a.new.way. for.imprecise.querying. In.the.next.report,.the.same.ideas.appeared.again.but.included.the.interpretation.and.management.of. the imprecise results as one of the key research subjects to be studied. Nevertheless, it was in the fourth and fifth reports where the need for including imprecision and uncertainty as natural elements in databases was more clearly reflected. Specifically, in the fifth report, the most recent one, with reference to approximate data, it is said, “When.one.leaves.business.data.processing, essentially all data is uncertain or imprecise. Scientific measurements have standard errors. Location data.for.moving.objects.involves.uncertainty.in.current.position..Sequence,.image,.and.text.similarity. are.approximate.metrics.” With reference to the imprecise queries, it is said, “users should also be able to ask imprecise queries and have the processing engine include this further source of uncertainty. Of course,.with.imprecise.answers.comes.a.duty.for.the.system.to.characterize.the.accuracy.offered,.so. users.can.understand.whether.the.approximation.is.good.enough.for.their.needs.” Data.mining,.conceived.as.a.new.form.of.accessing.databases,.is.considered.again.as.a.top-priority. research line in the fifth report, where it is also emphasized that data mining contains in its own essence the task of answering some kind of imprecise query. Quoting this fifth report, “users invariably point
xxiv
out.they.have.a.single.data.mining.query:.Tell.me.something.interesting,”.which.is.clearly.an.imprecise. query. These are the reasons why this book containing a collection of chapters devoted to research on fuzzy information processing in databases appears to be quite adequate and timely. Its key subject is directly focused.on.the.resolution.of.problems.that.have.been.considered.as.being.very.important.in.all.reports. referenced. In other words, this book deals with some questions and tasks that the community of researchers.of.databases.has.been.pointing.out.for.several.years. The.use.of.fuzzy.logic.to.manage.the.imprecise.and/or.uncertain.information.in.databases.is.even.older. than.the.aforementioned.reports..It.is.a.research.line.that.is.widely.consolidated.and.has.been.developing. for over 20 years. Traditionally, two major categories of work lines have been considered. a. b.
Those dealing with the problems of flexible querying to databases, in general, consider that the user expresses the query by using imprecise terms; the result is often a set of elements of the database affected.to.an.accomplishment.degree. Those. addressing. the. description. of. data. models. include. imprecise. and/or. uncertain. attributes,. relationships,.and.structures.represented.by.fuzzy.sets.and.fuzzy.logic..
The book we are presenting includes some interesting and innovative chapters belonging to both work lines, starting with a chapter by the editor introducing basic concepts on fuzzy logic and fuzzy databases. The chapter by S. Zadrożny, G. de Tré, R. de Caluwe, and J. Kacprzyk has a special interest since it offers a very wide review of fuzzy flexible querying..D..Dubois.and.H..Prade.discuss.the.possibility. of expressing negative preferences. R. Thomopoulos, P. Buche, and O. Haemmerlé study the use of ontologies in hierarchical queries. On the other hand, different approaches to solving flexible querying are also considered: case-based reasoning by G. de Tré et al., relative object qualification by C. Tudorie, and flexible queries using taxonomies by T. Andreasen and H. Bulskov. The evaluation strategies for fuzzy queries are also studied by W. A. Voglozin, G. Raschia, L. Ughetto, and N. Mouaddib, whereas P. Bosc, O. Pivert, and A. Hadjali present a new study about the expressiveness of fuzzy sets illustrated by the fuzzy division operator. G. Xexéo and A. Braga give a new tool for fuzzy.reasoning.and.querying.applied.to.geographic.information.systems..In.other.chapters,.M..Schneider. defines fuzzy spatial data types, L. Liétard and.D..Rocacher.give.an.exhaustive.list.for.the.evaluation.of. quantified statements, and G. Bordogna and G. Psaila define a language for fuzzy querying in classical relational.databases. Formal extensions of the fuzzy relational data model are dealt with in the chapter by A. Takači and S. Škrbić, introducing a query language with the possibility to specify priorities in fuzzy statements, in the chapter by R. Belohlavek, about formal concepts, and in the chapter by R. Belohlavek and V. Vychodil,.about.similarities. The.interesting.problem.of.implementing.fuzzy.database.languages.and.systems.is.also.dealt.with.in. the book in works by A. Urrutia, L. Tineo, and C. Gonzalez, and M. A. Ben Hassine et al. The chapter by C. Barranco, J. Campaña, and J. M. Medina studies the object relational approach and the fuzzy object-oriented data model. The above-mentioned chapter by Zadrożny et al. presents both the relational.and.object-oriented.cases.where.fuzzy.queries.are.made.to.fuzzy.data.models. As we already mentioned, data mining appeared as a major research area inside the database field in the first challenge report 15 years ago, and it has been successively included inside this category in all.subsequent.reports..The.use.of.fuzzy.sets.and.fuzzy.logic.in.data.mining.has.been.widely.extended,.
xxv
and in fact, before data mining was properly considered as a research area, some of its problems were addressed by means of fuzzy approaches. In this sense, let us remember the well-known fuzzy extensions of the K-means method for clustering problems, or the use of fuzzy rules in classification models. However, it has been in the last decade that this research line was consolidated by appearing in a wide variety of suggestive results, such as new fuzzy clustering approaches, different fuzzy association-rule definitions, new fuzzy classification techniques, and so forth. This book also offers interesting results about fuzzy data mining topics. First, the chapter by B. Feil and J. Abonyi is an excellent theoretical review. The use of fuzzy decision trees is studied in one chapter by M. J. Beynon. The extraction of fuzzy association rules is discussed by W.-H. Au and Y. Wang in their respective chapters. C. Fiot studies sequential pattern discovery, S. L. Wang et al. study fuzzy functional dependencies, and A. K. Sharma, A. Goswami, and D. K. Gupta define fuzzy inclusion dependencies. Fuzzy classification is studied by A. Meier, G. Schindler, and N. Werro. Subjects associated with data mining such as data cleaning and decision making are presented in the works of H. H. Shahri and M. J. Beynon, respectively. S. Turgay proposes an agent-based fuzzy data mining structure, and missing data are studied by J. I. Peláez, J. M. Doña, and D. La Red by using the so-called fuzzy imputation approach. Additionally, the diversity of the theoretical results cited above, including topics such as fuzzy information processing, should generate many different applications, which can be mainly found in the chapters by Y. Veryha et al., Y. Chen et al., and R. Carrasco et al., as well as in the examples and demonstrations of other chapters. Summarizing, we can state that the present book offers an excellent perspective about what is now being investigated about uncertainty and imprecision management by means of fuzzy sets and fuzzy logic in the field of databases and data mining. Furthermore, all chapters include good introductions to their respective topics, good lists of references, and some key terms with concise and useful definitions. Therefore, we are sure that this handbook will be very informative and useful for a broader class of researchers, students, and companies related to the database world. Professor Dr. M. Amparo Vila and Professor Dr. Miguel Delgado University of Granada Granada, Spain
References Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., et al. (2005). The Lowell Database Research Self-Assessment. Communications of the ACM, 48(5), 111-118. Bernstein, P., Brodie, M., Ceri, S., DeWitt, D., Franklin, M., et al. (1998). The Asilomar Report on Database Research. ACM SIGMOD Record, 27(4), 74-80. Silberschatz, A., Stonebraker, M., & Ullman, J. D. (Eds.). (1991). Database systems: Achievements and opportunities. Communications of the ACM, 34(10), 110-120. Silberschatz, A., Stonebraker, M., & Ullman, J. D. (Eds.). (1996). Database research: Achievements and opportunities into the 21st century. ACM SIGMOD Record, 25(1), 52-63. Silberschatz, A., Zdonik, S., et al. (1996). Strategic directions in database systems: Breaking out the box. ACM Computing Surveys, 28(4), 764-778.
xxvi
Preface
In order to write this preface, I began to read one of the most inspiring books I have ever read. I read many of the underlined sentences (I always underline good books). Suddenly, the muses visited me and they.said.to.me.that.it.would.be.easier.if.I.quoted.some.interesting.text..One.of.the.most.prestigious. Italian philosophers, Ludovico Geymonat (1908-1991), said,1 “The first step of the human reason is satisfied, in all investigation, showing the existing difficulties in it, not hiding them, even if they are very serious. Only who knows them, not who ignores them, can feel the impulse to search for the indispensable means to dominate them; and this search is the decisive spring for the scientific progress.” I think that, today, most of research papers are focused on only a few possible solutions to a very small and very concrete subject, and even with a very local point of view. Is this useful? I think so, of course..However,.it.is.possible.that.many.researchers.are.more.interested.in.increasing.the.number.of. publications than the quality of these works, or if these works can be extended with a wider point of view, studying previous works and showing the most important “existing difficulties.” In this book, the referees.and.I.have.spared.no.effort.to.reduce.these.troubles,.but.I.am.not.sure.if.we.have.achieved.it.. Indeed, in science and research (at least) it is important not to be really sure of anything. Skepticism is important for scientific progress, and thus it was taught by scholars from Pirrón of Elis (365-275 B.C.) to René Descartes (1596-1650), including the doctor Sextus Empiricus (second to third centuries, B.C.) or Michel de Montaigne (1533-1592). Thus, with this humility that must characterize every research work, we present this book and hope that it contributes a bit to scientific progress and therefore to a better.world. In the context of this handbook, Vila and Delgado defend in the foreword that the treatment of imprecise.and.uncertain.information.in.databases.is.a.very.interesting.research.line..Imprecision.has.been. studied in order to elaborate systems, databases, and consequently applications that support this kind of information. Most works that studied the imprecision in information have used possibility, similarity, and fuzzy techniques. In this foreword, the reader can find an interesting overview of each chapter of this.volume. Basically,. a. fuzzy. database. is. a. database. with. fuzzy. characteristics,. particularly. fuzzy. attributes.. These may be defined as attributes of an item, row, or object in a database that allow the storage of fuzzy
xxvii
information (imprecise or uncertain data). There are many forms of adding flexibility in fuzzy databases. The.simplest.technique.is.to.add.a.fuzzy.membership.degree.to.each.record,.that.is,.an.attribute.in.the. range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes, using fuzzy sets (including fuzzy spatial data types), possibility distributions, or fuzzy degrees associated with some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Sometimes, the expression fuzzy databases.is.used.for.classical.databases.with. fuzzy queries or with other fuzzy aspects, such as constraints. The first chapter gives a wide historical point of view summarizing the main fuzzy database models, but this scientific field has a very promising.future. The.research.on.fuzzy.databases.has.been.developing.for.about.20.years.and.is.concentrated.mainly. on.the.following.six.research.lines. 1.. 2.. 3. 4.. 5.. 6..
Fuzzy.querying.in.classical.databases. Fuzzy.queries.on.fuzzy.databases Extension of classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy.object-oriented.databases,.etc.) Fuzzy.conceptual.modeling.tools Fuzzy.data.mining.techniques. Applications.of.these.advances.in.real.databases.
All.of.these.different.issues.have.been.studied.in.different.chapters.of.this.volume,.except.the.fourth. item.because,.in.general,.there.is.little.interest.in.fuzzy.conceptual.issues.and.this.subject.has.been.studied. in some other works in a very exhaustive manner (see related references in Chapter I). Querying.with.imprecision,.contrary.to.classical.querying,.allows.users.to.implement.fuzzy.linguistic. labels (also named linguistic terms) and express their preferences to better qualify the data they wish to get. An example of a flexible query, also named in this context a fuzzy query, would be “a list of the young employees working in a department with a big budget.” This query contains the fuzzy linguistic labels.young.and.big budget..These.labels.are.words,.in.natural.language,.that.express.or.identify.a.fuzzy. set (fixed or context dependent). Summarizing, fuzzy queries are useful to reflect the preferences of the end user and to rank the solutions. The ability to make fuzzy queries in classical databases is very useful because currently there are many classical databases. The second research line includes the first one, but we prefer to separate them because this second line finds new problems that must be studied, and because it must be framed in a concrete fuzzy database model (third research line). These two first lines are summarized by Zadrożny et al. in their chapter. On the other hand, Chapters IV to XIII study concrete problems about the fuzzy querying world (bipolar queries, fuzzy languages, quantified queries, etc.). This handbook also includes interesting chapters about the third item, extending classical data models.in.order.to.achieve.fuzzy.databases..They.study.useful.topics,.such.as.how.a.database.administrator. may.achieve.a.fuzzy.relational.database,.a.new.fuzzy.relational.model.with.a.fuzzy.query.language. including.the.possibility.to.specify.priorities.for.fuzzy.statements,.a.good.approach.to.creating.a.fuzzy. object-relational.database.model,.fuzzy.spatial.data.types,.and.even.more. Regarding fuzzy data mining issues, this handbook includes a complete review chapter by Feil and Abonyi.studying.the.main.fuzzy.data.mining.methods..This.is.probably.the.most.promising.area.because. today.there.are.many.databases.that.may.give.us.information.if.we.use.the.proper.tools..Perhaps.the. more.interesting.and.useful.tools.are.fuzzy.clustering.and.fuzzy.dependencies,.and.both.are.also.studied. in different chapters of this handbook.
xxviii
The.last.research.line,.applications,.is.also.studied.in.some.chapters..These.chapters.mix.different. theoretical issues, like data mining, with real contexts to achieve different goals. Many chapters end with some examples or applications of the topics; however, we want to highlight that the third part of this handbook includes some chapters with very interesting applications. In summary, this handbook includes a very good selection of works by leaders in this field. Each chapter includes.a.good.introduction,.shows.some.of.the.new.advances.and.the.future.lines.in.its.corresponding. topic, and gives some key-term definitions. We can be assured that fuzzy databases will be studied and developed.in.the.upcoming.years.with.the.main.target.of.improving.current.databases..It.is.easy.to.see. that scientific and technological development, including information science, can assist humankind in making the world a better place to live. Therefore, why is it not achieved on the whole planet? Why is this.world.so.unjust?.Why.can.we.not.enhance.our.lives.without.destroying.other.forms.of.life,.plants,. animals, and our own fellow people? Perhaps it is useful to reflect on one dissertation by Geymonat in which he studies the antiquity after Aristotle of Estagira (384-322 B.C.). He studied Archimedes of Syracuse (287-212 B.C.) and Herón of Alexandria (about first century, A.D.), and their fusion of science and.technology..Then.he.wondered.why.the.ancient.world.did.not.develop.a.mechanic.civilization..He. said that, probably, the reason was in the social structure of the Latin Greek world, which did not feel the necessity of inventing new machines because they had cheap and efficient machines: slaves. The Latin writer Marcus Terentius Varro (116-27 B.C.) confirms that slavery was then seen as a true machine. The French economist Bertrand De Jouvenel (1903-1987) would say that our modern machines work because we have made “the big mutation” from soil forces (animals, water, wind) to subsoil forces (coal, petroleum). In this way, the invention by the Scotch engineer James Watt (1736-1819), patented in 1769, “provokes a huge difference between leader countries and other civilizations that in the 18th.century.it. had never crossed somebody mind to consider like inferior ones.”2 These dissertations invite us to think that perhaps it is impossible to reach a very comfortable society without.using.slavery.or.subsoil.forces,.two.options.with.a.lot.of.big.problems..It.is.sad,.but.our.world.is. now.using.both.options,.especially.rich.countries..Modern.slavery.is.located.in.far-off.and.poor.countries. (many adults and children work in very bad conditions making sport shoes, footballs, ephemeral toys, and even in the dangerous fields of gold and diamond mining, tobacco plantations, etc.). On July 4, 2007,.newspapers.wrote.that.in.Brazil.more.than.1,000.slaves.were.freed.from.a.sugar-cane.plantation. where they were being forced to work 14-hour days in horrendous conditions cutting cane for ethanol production..Human-rights.groups.and.labour.organizations.believe.that.between.25,000.and.80,000.people. could be working in conditions akin to slavery in Brazil (on deforestation and on sugar-cane, coffee, and cotton plantations). Most of these products (e.g., wood) go to rich countries. The same newspapers published.that.Brazil,.one.of.the.world’s.largest.producers.of.alternative.fuel.and.the.number-one.exporter.of.ethanol.made.from.sugar.cane,.plans.to.double.production.of.the.biofuel.over.the.next.5.years,. and.more.than.the.50%.will.go.to.ecological.Europe..Biofuels3.may.be.with.an.interesting.alternative. to subsoil forces, especially for a hungry planet. They may be a renewable energy source (using the appropriate.manure.and.when.not.being.transported.over.a.long.distance,.for.example)..I.am.not.sure. whether.Brazilian.biofuel.is.a.good.environmental.option.for.Europe,.but.it.is.not.an.ethical.option.if. we do not know whether it uses forced labour or abusive work conditions, which are illegal in Europe, or.even.whether.it.uses.ecological.agriculture.techniques.or.not. Where.is.the.solution?.The.solution.is.in.our.hands,.in.you.and.me,.in.all.the.citizens.of.the.world.. All.of.us.must.demand.ethical.politics.and.refuse.such.a.very.comfortable.and.consumerist.society..It.is. a pleasure to drive a car or to eat meat everyday, or to have many shoes, jackets, rings, and necklaces, but, unfortunately, it is not sustainable. I do not know if we will be able to achieve sustainable development.or.if.it.is.even.possible,.but.in.any.case,.we.must.use.every.endeavor.to.reach.it..We.must.decide.
xxix
whether.development.is.more.or.less.important.than.sustainability.because.many.times.we.will.have.to. choose.between.these.two.concepts. Unfortunately,.I.do.not.have.global.solutions.but.I.need.to.believe.that.solutions.exist..For.now,.I. can think of some local proposals like planting trees (in order to preserve soil, water, and biodiversity), not eating meat everyday or buying unnecessary or “fussy” objects (because they need large quantities of energy), and living with open eyes and mind, looking for situations where we can help to achieve a better.world..Our.life.is.not.neutral..We.contribute.to.changing.this.world,.for.worse.or.for.better..Our. activity and knowledge have an influence on our little planet. Geymonat wrote that knowledge is not only the result of personal ingenuity, but that “it sinks its own roots in the whole collection of the diverse human activities,” and that there are two kinds of scientific and philosophic research. One of them consists of well-connected systems (such as those by Aristotle or Euclid), while the other one consists of connected fragments. Neither is better than the other because the best one is that “which provokes the highest interest to continue researching and the highest trust in the investigation powerful.” Like other philosophers, such as the Spanish Ortega y Gasset (1883-1955), Geymonat said that preserving the past and looking for the new are complementary aspects, and both of.them.are.indispensable.at.the.same.level. This book brings together some connected fragments and they are a well-connected system in the particular area of fuzzy databases. I think that it will provoke at least some interest to continue researching.and.some.trust.in.the.investigation..Maybe.the.next.generation.of.database.management.systems.will. include.many.fuzzy.characteristics.and.users.will.enjoy.fuzzy.interfaces,.fuzzy.queries,.fuzzy.dependencies, and fuzzy data mining even without knowing anything about t-norms, fuzzy measures, FSQL, or a man called Zadeh. In this sense, I think and I hope that this book will be at least a bit useful. This book was a big effort for me, but it was also a big effort for the authors, referees, and publisher. Each chapter has been reviewed by three to five referees who looked for errors and areas in need of improvement, proposing interesting approaches, references, and so forth. I will be very satisfied if someone finds more errors or improvements because it means that this handbook provokes at least some interest to continue researching. All of us must undertake a continuous process of apprenticeship, research, meditation, and thinking over everything. If we refuse to do that, then the television, mass media, and politicians.will.be.very.happy.because.they.will.do.that.for.us.
endnotes .
1
2
3.
Ludovico Geymonat was a professor in the Milan University. His book Historia de la Filosofía y de la Ciencia (2nd.edition.in.Spanish,.2006,.translated.as.History of Philosophy and Science).is. a.synthesis.of.his.two.masterpieces.Storia della Filosofia.and.Storia del Pensiero Filosofico..An. abstract in Spanish is available at http://www.resumelibros.tk and http://www.lcc.uma.es/~ppgg/ libros. Bertrand De Jouvenel’s (1976) The Civilization of Power: From Political Economy to Political Ecology (abstract in Spanish available at http://www.resumelibros.tk and http://www.lcc.uma. es/~ppgg/libros) BirdLife International’s (2005) Bioenergy: Fuel for the Future? A BirdLife International Position Paper on Bioenergy Use in the EU (retrieved from http://www.birdlife.org)
Professor Dr. José Galindo (editor) University of Málaga Málaga, Spain
xxx
Acknowledgment
The editor would like to acknowledge the help of all involved in the writing and review process of this handbook, without whose support the project could not have been satisfactorily completed. Deep appreciation is due to the Spanish Ministry of Education and Science projects TIN2006-14285 and TIN2006-07262, the Spanish Consejería de Innovación Ciencia y Empresa de Andalucía under research project.TIC-1570,.and.especially.to.their.respective.directors.for.their.partial.support.that.let.me.to.edit. this handbook. Most of the authors of chapters included in this handbook also served as referees for chapters written by other authors. Thanks go to all those who provided constructive and comprehensive reviews, including the Program Committee. Of course, special thanks go also to my editorial advisory board, a group of wonderful and qualified researchers who aided in the review process and strengthened the overall quality of.this.publication..All.these.researchers.were.really.necessary.because.all.chapters.have.been.reviewed. by.at.least.three.different.referees.and.all.their.comments.contributed.to.enhance.every.chapter. Special thanks also go to the publishing team at IGI Global, whose contributions throughout the whole process from inception of the initial idea to final publication have been invaluable. In particular, thanks go to Ms. Kristin Roth, who continuously supervised the project march via e-mail, even in Spanish language, and whose enthusiasm motivated me to continue working on this project. Jessica Thompson also did a wonderful job. I want to acknowledge the special English review by M. Carmen Chaves and Dr. Salvador Arijo, highlighting here their excellent contribution as Greenpeace1.volunteers.in.the.unsustainable Spanish province of Málaga. I cannot forget to mention the essential contribution of Professor Dr. M. C. Aranda, including scientific and personal areas, bearing without complaint part of the huge task of editing this work. Finally, I wish to thank all of the authors for their excellent contributions to this handbook, including authors of unpublished works (unfortunately, this book had a limited extension). 1 Greenpeace is an independent global organization that acts to change attitudes and behavior, to protect and conserve the environment, and to promote peace (http://www.greenpeace.org).
xxxi
About the Editor
José Galindo has a PhD in computer science from the University of Granada (Spain) and is a professor of computer science in the School of Engineering at University of Málaga (Spain). He is author of several didactical and research books and papers on computer science, databases, information systems, and fuzzy logic. He is coauthor of the book Fuzzy Databases: Modeling, Design and Implementation published by Idea Group Publishing (Hershey, USA) in 2006. He is also the editor of this current handbook. His research interests are fuzzy logic, fuzzy databases, and ethical issues in the technological age. He is a member of the IdBIS research group and the Ibero-American research project RITOS-2.
Section I
Introduction
Chapter I
Introduction and Trends to Fuzzy Logic and Fuzzy Databases José Galindo University of Málaga, Spain
Abstract This chapter presents an introduction to fuzzy logic and to fuzzy databases. With regard to the first topic, we have introduced the main concepts in this field to facilitate the understanding of the rest of the chapters to novel readers in fuzzy subjects. With respect to the fuzzy databases, this chapter gives a list of six research topics in this fuzzy area. All these topics are briefly commented on, and we include references to books, papers, and even to other chapters of this handbook, where we can find some interesting reviews about different subjects and new approaches with different goals. Finally, we give a historic summary of some fuzzy models, and we conclude with some future trends in this scientific area.
Introduction Fuzzy logic is only a mathematical tool. It is possibly the best tool for treating uncertain, vague, or subjective information. Just to give an idea about the importance of this soft computing tool, we can mention the big quantity of publications in this field, including two research journals of great quality: Fuzzy Sets and Systems1 and IEEE Transactions on Fuzzy Systems.2 Particularly, fuzzy logic has been applied to databases in many scientific papers and real applications. Undoubtedly, it is a modern research field and it has a long road ahead. This handbook is only one step. Perhaps, it is a big step.
For that reason, we will begin by introducing some basic concepts of the fuzzy sets theory. We include definitions, examples, and useful tables with reference data (for example, lists of t-norms, t-conorms, and fuzzy implications). We can find these and other concepts in other chapters of this book, possibly with different notation. The second part of this chapter studies basic concepts about fuzzy databases, including a list of six research topics on fuzzy databases. All these topics are briefly commented on, and we include references to books, papers, and even to other chapters of this handbook. Then, an overview about the basic fuzzy database models is included to give an introduction to these topics and also to the whole handbook.
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Fuzzy Sets In written sources, we can find a large number of papers dealing with this theory, which was first introduced by Lotfi A. Zadeh3 in 1965 (Zadeh, 1965). A compilation of some of the most interesting articles published by Zadeh on the theme can be found in Yager, Ovchinnikov, Tong, and Nguyen (1987). Dubois and Prade (1980, 1988) and Zimmerman (1991) bring together the most important aspects behind the theory of fuzzy sets and the theory of possibility. A more modern synthesis of fuzzy sets and their applications can be found in Buckley and Eslami (2002); Kruse, Gebhardt, and Klawonn (1994); Mohammd, Vadiee, and Ross (1993); Nguyen and Walker (2005); and Piegat (2001), and particularly in Pedrycz and Gomide (1998). Ross (2004) includes some engineering applications and Sivanandam, Sumathi, and Deepa (2006) present an introduction using MATLAB. A complete introduction in Spanish is given in Escobar (2003) and Galindo (2001). The original interpretation of fuzzy sets arises from a generalization of the classic concept of a subset extended to embrace the description of “vague” and “imprecise” notions. This generalization is made considering that the membership of an element to a set becomes a “fuzzy” or “vague” concept. In the case of some elements, it may not be clear if they belong to a set or not. Then, their membership may be measured by a degree, commonly known as the “membership degree” of that element to the set, and it takes a value in the interval [0,1] by agreement. Using classic logic, it is only possible to deal with information that is totally true or totally false; it is not possible to handle information inherent to a problem that is imprecise or incomplete, but this type of information contains data that would allow a better solution to the problem. In classic logic, the membership of an element to a set is represented by 0 if it does not belong and by 1 if it does, having the set {0,1}. On the other hand, in fuzzy logic, this set is extended to the interval [0,1]. Therefore, it could be said that fuzzy logic is an extension of the classic systems (Zadeh, 1992). Fuzzy logic is
the logic behind approximate reasoning instead of exact reasoning. Its importance lies in the fact that many types of human reasoning, particularly the reasoning based on common sense, are by nature approximate. Note the great potential that the use of membership degrees represents by allowing something qualitative (fuzzy) to be expressed quantitatively by means of the membership degree. A fuzzy set can be defined more formally as: Definition 1: Fuzzy set A over a universe of discourse X (a finite or infinite interval within which the fuzzy set can take a value) is a set of pairs: A = { A( x ) / x : x ∈ X ,
A
( x) ∈ [0,1]∈ ℜ} (1)
where mA(x) is called the membership degree of the element x to the fuzzy set A. This degree ranges between the extremes 0 and 1 of the dominion of the real numbers: mA(x) = 0 indicates that x in no way belongs to the fuzzy set A, and mA(x) = 1 indicates that x completely belongs to the fuzzy set A. Note that mA(x) = 0.5 is the greatest uncertainty point. Sometimes, instead of giving an exhaustive list of all the pairs that make up the set (discreet values), a definition is given for the function mA(x), referring to it as characteristic function or membership function. The universe X may be called underlying universe or underlying domain, and in a more generic way, a fuzzy set A can be considered a function mA that matches each element of the universe of discourse X with its membership degree to the set A: A
( x) : X → [0,1]
(2)
The universe of discourse X, or the set of considered values, can be of these two types: •
Finite or discrete universe of discourse X = {x1, x 2,..., xn}, where a fuzzy set A can be represented by:
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
•
A=
1
/ x1 +
2
/ x 2 + ... +
n
/ xn
(3)
where mi with i = 1, 2, ..., n represents the membership degree of the element xi. Normally, the elements with a zero degree are not listed. Here, the + does not have the same significance as in an arithmetical sum, but rather, it has the meaning of aggregation, and the / does not signify division, but rather the association of both values. Infinite universe of discourse, where a fuzzy set A over X can be represented by: A=∫
A
( x) / x
(4)
Actually, the membership function mA(x) of a fuzzy set A expresses the degree in which x verifies the category specified by A.
A linguistic label is that word, in natural language, that expresses or identifies a fuzzy set, that may or may not be formally defined. With this definition, we can assure that in our every day life we use several linguistic labels for expressing abstract concepts such as “young,” “old,” “cold,” “hot,” “cheap,” “expensive,” and so forth. Another interesting concept, the linguistic variable (Zadeh, 1975), is defined in the chapter by Xexéo and Braga in this handbook. Basically, a linguistic variable is a variable that may have fuzzy values. A linguistic variable is characterized by the name of the variable, the underlying universe, a set of linguistic labels, or how to generate these names and their definitions. The intuitive definition of the labels not only varies from one to another person depending on the moment, but also it varies with the context in which it is applied. For example, a “high” person and a “high” building do not measure the same. Example 1: The “Temperature” is a linguistic variable. We can define four linguistic labels, like “Very_Cold,” “Cold,” “Hot,” and “Very_Hot,” using the membership functions depicted in Figure 1.
The frame of cognition, or frame of knowledge, is the set of labels, usually associated to normalized fuzzy sets (Definition 11), used as reference points for fuzzy information processing.
Characteristics and Applications This logic is a multivalued logic, the main characteristics of which are (Zadeh, 1992): •
In fuzzy logic, exact reasoning is considered a specific case of approximate reasoning. Any logical system can be converted into terms of fuzzy logic. In fuzzy logic, knowledge is interpreted as a set of flexible or fuzzy restrictions over a set of variables (e.g., the variable Temperature is Cold). Inference is considered as a process of propagation of those restrictions. Inference is understood to be the process by which a result is reached, consequences are obtained, or one fact is deduced from another. In fuzzy logic, everything is a matter of degree.
• •
•
•
From this simple concept, a complete mathematical and computing theory has been developed which facilitates the solution of certain problems (see the references in the beginning of this chapter). Fuzzy logic has been applied to a multitude of disciplines such as control systems, modeling, simulation, prediction, optimization, pattern recFigure 1. A frame of cognition with four linguistic labels for temperature (Example 1) 1
Very_Cold
Cold
Hot
Very_Hot
0
Temperature -1
35 ºC
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
ognition (e.g., word recognition), information or knowledge systems (databases, knowledge management systems, case-based reasoning systems, expert systems, etc.), computer vision, biomedicine, picture processing, artificial intelligence, artificial life, and so forth. Summarizing, fuzzy logic may be an interesting tool where hitherto known methods fail, highlighting complex processes, where we need to introduce the expert knowledge from experienced people, or where there are unknown magnitudes or ones that are difficult to measure in a reliably way. In general, fuzzy logic is used when we need to represent and operate with uncertain, vague, or subjective information. Many applications use fuzzy logic with other general or soft computing tools like genetic algorithms (GAs), neural networks (NNs), or rule based systems.
Membership Functions Zadeh proposed a series of membership functions that could be classified into two groups: those made up of straight lines, or “linear,” and Gaussian forms, or “curved.” We will now go on to look at some types of membership functions. These types of fuzzy sets are those known as convex fuzzy sets in fuzzy set theory, with the exception of that known as extended trapezium that does not necessarily have to be convex, although for semantic reasons, this property is always desirable. •
representation of a nonfuzzy (crisp) value. •
•
•
(5)
Singleton (Figure 3): It takes the value zero in all the universe of discourse except in the point m where it takes the value 1. It is the
1 b − x L( x) = b − a 0
if x ≤ a
if a < x < b
(7)
if x ≥ b
Gamma Function (Figure 5): It is defined by its lower limit a and the value k > 0. Two definitions: 0 Γ( x) = −k ( x −a )2 1 − e 0 2 Γ( x) = k ( x − a ) 1 + k ( x − a) 2
if x ≤ a if x > a if x ≤ a
(8) (9)
if x > a
This function is characterized by rapid growth starting from a. The greater the value of k, the greater the rate of growth. The growth rate is greater in the first definition than in the second. Horizontal asymptote in 1. The gamma function is also expressed in a linear way (Figure 5b):
0 if x ≤ a x −a Γ( x) = if a < x < b b − a if x ≥ b 1
(6)
L Function (Figure 4): This function is defined by two parameters, a and b, in the following way, using linear shape:
Triangular (Figure 2): Defined by its lower limit a, its upper limit b, and the modal value m, so that a<m1. As the value of k increases, the growth rate increases and the bell becomes narrower.
(13)
1 P( x) = 1 + k ( x − m) 2
X
1
0
•
c d
Figure 9. Pseudo-exponential fuzzy set
1
•
a b
X
bership values (height) associated to each of these points (ei, hei). Comments:
(14)
Extended Trapezoid Function (Figure 10): Defined by the four values of a trapezoid (a,b,c,d), and a list of points between a and b, and/or between c and d, with their mem-
m
In general, the trapezoid function adapts quite well to the definition of any concept in human contexts, with the advantage that it is easy to define, easy to represent, and simple to calculate. In specific cases, the extended trapezoid is very useful. This allows greater expressiveness through increased complexity. In general, the use of a more complex function is usually difficult to define with precision and probably it does not give increased precision, as we must keep in mind that we are defining a fuzzy concept. Concepts that require a nonconvex function can be defined. In general, a nonconvex function expresses the union
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Figure 10. Extended trapezoidal fuzzy set
A ⊆ B ⇔ ∀x ∈ X,
( x) ≤
B
( x)
(16)
A fuzzy inclusion may be defined using a degree of subsethood. For example, when both fuzzy sets are defined in a finite universe, this degree may be computed as (Kosko, 1992):
1 he3 he2 he1 0
A
a e1 b
c e2 e3
X
d
S ( A, B) =
1 Card( A) − ∑ max{0, A( x) − B( x)} Card( A) x∈X
of two or more concepts, the representation of which is convex. In fuzzy control, for example, the aim is to express the notions of “increase,” “decrease,” and “approximation,” and in order to do this, the types of membership functions previously mentioned are used. The membership functions Gamma and S would be used to represent linguistic labels such as “tall” or “hot” in the dominion of height and temperature. Linguistic labels, such as “small” and “cold,” would be expressed by means of the L function. On the other hand, approximate notions are sometimes difficult to express with one word. In the dominion of temperature, it would be “comfortable” or “approximately 20ºC,” which would be expressed by means of the triangle, trapezoid, or gaussian function.
Concepts about Fuzzy Sets In this section, the most important concepts about fuzzy sets are defined. This series of concepts regarding fuzzy sets allow us to deal with fuzzy sets, measure and compare them, and so on.
(17)
Definition 4: The support of a fuzzy set A defined over X is a subset of that universe that complies with: Supp( A) = {x : x ∈ X,
A
( x) > 0}
(18)
Definition 5: The a-cut of a fuzzy set A, denoted by Aα is a classic subset of elements in X, whose membership function takes a greater or equal value to any specific α value of that universe of discourse that complies with: A = {x : x ∈ X,
A
( x) ≥ , ∈ [0,1]}
(19)
The Representation Theorem allows the representation of any fuzzy set A by means of the union of its α-cuts. Definition 6: The Representation Theorem states that any fuzzy set A can be obtained from the union of its α-cuts. A=
∈[0 ,1]
A
(20)
Definition 2: Let A and B be two fuzzy sets over X. Then, A is equal to B if:
Definition 7: By using the Representation Theorem, the concept of convex fuzzy set can be established as that in which all the α-cuts are convex:
A = B ⇔ ∀x ∈ X,
∀x, y ∈ X, ∀ ∈ [0,1]:
A
( x) =
B
( x)
(15)
∀x, y ∈ X, ∀ ∈ [0,1]:
Definition 3: Taking two fuzzy sets A and B over X, A is said to be included in B if:
A
A
( ⋅ x + (1 − ) ⋅ y ) ≥ min( A( x),
( ⋅ x + (1 − ) ⋅ y ) ≥ min( A( x),
A
( y ))
(21)
This definition means that any point situated between another two will have a higher membership
A
( y ))
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
degree than the minimum of these two points. Figures 7, 8, or 9 are typical examples of convex fuzzy sets, whereas Figure 10 represents a nonconvex fuzzy set. Definition 8: A concave fuzzy set complies with:
the concrete application, the manner in which the uncertainty is to be represented, and how this one is to be measured during the experiments. The following points give a brief summary of some of these methods (Pedrycz & Gomide, 1998).
Horizontal method: It is based on the answers of a group of N “experts.” ∀x, y ∈ X, ∀ ∈ [0,1]: A( ⋅ x + (1 − ) ⋅ y ) ≤ min( A( x), A( y )) ∀x, y ∈ X, ∀ ∈ [0,1]: A( ⋅ x + (1 − ) ⋅ y ) ≤ min( A( x), A( y )) • The question takes the following form. (22) Can x be considered compatible with the concept A?” Definition 9: The kernel of a fuzzy set A, defined • Only “Yes” and “No” answers are acover X, is a subset of that universe that complies ceptable, so: with: Kern ( A) = {x : x ∈ X,
A
( x) = 1}
x∈X
A
( x)
A(x) = (Affirmative Answers) / N (27)
(23)
Definition 10: The height of a fuzzy set A defined over X is: Hgt ( A) = sup
1.
2.
(24)
Definition 11: A fuzzy set A is normalized if and only if: ∃x ∈ X,
A
( x) = Hgt ( A) = 1
•
(25)
Definition 12: The cardinality of a fuzzy set A with finite universe X is defined as: Card( A) = ∑
A
( x)
•
(26)
x∈X
If the universe is infinite, the addition must be changed for an integral defined within the universe.
Membership Function Determination If the system uses badly defined membership functions the system will not work well, and these functions must therefore be carefully defined. The membership functions can be calculated in several ways. The chosen method will depend on
Vertical method: The aim is to build several α-cuts (Definition 5), for which several values are selected for α.
3.
Now, the question that is formulated for these predetermined α values is as follows: Can the elements of X that belong to A with a degree that is not inferior to α be identified?” From these α-cuts, the fuzzy set A can be identified, using the so-called identity principle or representation theorem (Definition 6).
Pair comparison method (Saaty, 1980): Supposing that we already have the fuzzy set A over the universe of discourse X of n values (x1, x2, ..., xn), we could calculate the reciprocal matrix M=[ahi], a square matrix n×n with the following format:
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
A( x1 ) A( xn ) A( x2 ) ... A( xn ) A( xi ) ... A( x j ) ... 1 (28) • This matrix has the following properties: The principal diagonal is always one, ahi aih =1 (property of ������������������ reciprocity������� ), and ahi aik = ahk (transitive property), ∀ i, j, k = 1, 2, …, n. • If we want to calculate the fuzzy set A, the process is reversed: The matrix M is calculated, and then A is calculated from M. • In order to calculate M, the level of priority or the highest membership degree of a pair of values is numerically quantified: xi with respect to xj. The number of comparisons is: n (n – 1) / 2. Transitivity is difficult to achieve (the eigenvalue of the matrix is used to measure the consistency of the data, so that if it is very low, the experiments should be repeated).
Method based on problem specification: This method requires a numerical function that should be approximate. The error is defined as a fuzzy set that measures the quality of the approximation. Method based on the optimization of parameters: The shape of a fuzzy set A depends on some parameters, indicated by the vector p, which is represented by A(x;p). •
Some experimental results in the form of pairs (element, membership degree) are needed: (Ek, Gk) with k = 1, 2...., N.
6.
•
The problem consists of optimizing the vector p, for example, minimizing the squared error:
min p ∑[Gk − A( Ek ; p)]
N
2
k =1
(29)
Method based on fuzzy clustering: This is based on clustering together the objects of the universe in overlapping groups where levels of membership to each group are considered as fuzzy degrees. There are several fuzzy clustering algorithms, but the most widely used is the algorithm of “fuzzy isodata” (Bezdek, 1981). In this handbook, there is a chapter by Feil and Abonyi explaining some data mining techniques, including fuzzy clustering.
Fuzzy Set Operations Fuzzy sets theory generalizes the classic sets theory. It means that fuzzy sets allow operations of union, intersection, and complement. These and other operations can be found in Pedrycz and Gomide (1998) and Petry (1996) such as concentration (the square of the membership function), dilatation (the square root of the membership function), contrast intensification (concentration in values below 0.5 and dilatation in the rest of values), and fuzzification (the inverse operation). These operations can be used when linguistic hedges, such as “very” or “not very,” are used.
Union and Intersection: T-conorms and T-norms Definition 13: If A and B are two fuzzy sets over a universe of discourse X, the membership function of the union of the two sets A∪B is expressed by: A∪B
( x) = f ( A( x),
B
( x)), x ∈ X
(30)
where f is a t-conorm (Schweizer & Sklar, 1983).
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Definition 14: If A and B are two fuzzy sets over a universe of discourse X, the membership function of the intersection of the two sets A∩B, is expressed by: A∩B
( x) = g ( A( x),
B
( x)), x ∈ X
Figure 11. Intersection (minimum) and union (maximum) Intersection
U nion
1
(31)
where g is a t-norm (Schweizer & Sklar, 1983). Both t-conorms (s-norms) and t-norms establish generic models respectively for the operations of union and intersection, which must comply with certain basic properties (commutative, associative, monotonicity, and border conditions). They are concepts derived from Menger (1942) and Schweizer and Sklar (1983), and that have been studied in-depth more recently (Butnario & Klement, 1993). Definition 15: Triangular Norm, t-norm: binary operation, t: [0,1]2 → [0,1] that complies with the following properties: 1. Commutativity: x t y = y t x. 2. Associativity: x t (y t z) = (x t y) t z. 3. ����������������� Monotonicity: If x ≤ y, and w ≤ z then x t w ≤ y t z. 4. Boundary conditions: x t 0 = 0, and x t 1 = x. Definition 16: Triangular Conorm, t-conorm, or s-norm: Binary operation, s: [0,1]2 → [0,1] that complies with the following properties: 1. 2. 3. 4.
Commutativity: x s y = y s x. Associativity: x s (y s z) = (x s y) s z. Monotonicity: If x ≤ y, and w ≤ z then x s w ≤ y s z. Boundary conditions: x s 0 = x, and x s 1 = 1.
The most widely used of this type of function is the t-norm of the minimum and the t-conorm or s-norm of the maximum as they have retained a large number of the properties of the boolean operators, such as the property of idempotency (x t x = x; 10
0
X
x s x = x). In Figure 11, we can see the intersection and union, using respectively the minimum and maximum, of two trapezoid fuzzy sets. There is an extensive set of operators, called t-norms (triangular norms) and t-conorms (triangular conorms), that can be used as connectors for modeling the intersection and union respectively (Dubois & Prade, 1980; Piegat, 2001; Predycz & Gomide, 1998; Yager, 1980). The most important are shown in Tables 1 and 2. A relationship exists between t-norms (t) and t-conorms (s). It is an extension of De Morgan’s Law: x s y = 1 − (1 − x) t (1 − y ) x t y = 1 − (1 − x) s (1 − y )
(32)
When a t-norm or a t-conorm comply with this property they are said to be conjugated or dual. T-norms and t-conorms cannot be ordered from larger to smaller. However, it is easy to identify the largest and the smallest t-norm and t-conorm: the largest and smallest t-norm are respectively the minimum and the drastic product, and the largest and smallest t-conorm are respectively the drastic sum and the maximum function. Note that if two fuzzy sets are convex, their intersection will also be (but not necessarily their union).
Negations or Complements The notion of the complement can be constructed using the concept of strong negation (Trillas, 1979).
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Definition 17: A function N: [0,1] → [0,1] is a strong negation if it fulfills the following conditions: 1. 2. 3. 4.
Boundary conditions: N(0) = 1 and N(1) = 0. Involution: N(N(x)) = x. Monotonicity: N is nonincreasing. Continuity: N is continuous.
Although there are several types of operators which satisfy such properties or relaxed versions of them, Zadeh’s version of the complement (Zadeh, 1965) is mainly used: N (x) = 1 – x. Thus, for a fuzzy set A in the universe of discourse X, the membership function of the complement, denoted by ¬A , or by A , is shown as: ¬A
( x) = 1 −
A
( x), x ∈ X
(33)
Implication Operators A fuzzy implication (Dubois & Prade, 1984; Zadeh, 1975) is a function to compute the fulfillment degree of a rule expressed by IF X THEN Y, where the antecedent or premise and the consequent or conclusion are fuzzy. Definition 18: A function f: [0, 1] × [0, 1] → [0, 1] is a fuzzy implication, f(x,y) ∈ [0,1], also denoted by x ⇒f y, if it fulfils the following conditions: 1. 2. 3. 4.
0 ⇒f a = 1, ∀ a ∈ [0,1] a ⇒f 1 = 1, ∀ a ∈ [0,1] 1 ⇒f a = a, ∀ a ∈ [0,1] Decreasing (respectively increasing) monotonicity with respect to the first (respectively second) argument.
Sometimes, another condition is added: (x ⇒f (y ⇒f z)) = ( y ⇒f (x ⇒f z)). The most important implication functions are shown in Table 3. Note that Kleene-Dienes implication is based on the classical implication definition (x⇒y = ¬x ∨ y); that is, it is a strong implication, using the Zadeh’s negation and the maximum s-norm. In standard fuzzy sets theory, there are, basically, four models for implication operations (Trillas
& Alsina, 2002; Trillas, Alsina, & Pradera, 2004; Trillas, Cubillo, & del Campo, 2000; Ying, 2002): (1) Strong or S-implications (x ⇒f y = N(x) s y), (2) Residuated or R-implications (x ⇒f y = supc{x∈[0,1], x s c ≤ y), (3) Quantum logic, Q-implications, or QM-implications (x ⇒f y = N(x) s (x t y)), and (4) Mamdani–Larsen or ML-implications (x ⇒f y = ϕ1(x) t ϕ(y), where ϕ1 is an order automorphism on [0,1] and ϕ2: [0,1] → [0,1] is a non-null contractive mapping, that is, ϕ2(w)≤w,∀w∈[0,1]). Some of these types of fuzzy implications are overlapping (for example, Łukasiewicz implication is an S-implication and an R-implication). Some applications utilize implication functions, which do not fulfill all the conditions in the previous definition, like the modified Łukasiewicz implication. Besides, it is very usual to use t-norms as implications functions (Gupta & Qi, 1991) obtaining very good results, especially the minimum (Mandani implication) and the product t-norms.
Comparison Operations on Fuzzy Sets The fuzzy sets, defined using a membership function, can be compared in different ways. We will now list several methods used to compare fuzzy sets (Pedrycz & Gomide, 1998). Distance Measures: A distance measure considers a distance function between the membership functions of two fuzzy sets in the same universe. In such a way, it tries to indicate the proximity between the two fuzzy sets. In general, the distance between A and B, defined in the same universe of discourse, can be defined using the Minkowski distance: 1 p p d ( A, B) = ∫ A( x) − B( x) dx (34) X where p ≥ 1 and we assume that the integral exists. Several specific cases are typically used: 1.
Hamming Distance (p = 1):
d ( A, B) = ∫ A( x) − B( x) dx
(35)
X
11
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Table 1. t-norms functions: f(x,y) = x t y t-norms
Expression
Minimum
f ( x, y ) = min( x, y )
Product (Algebraic)
f ( x, y ) = x y
Drastic Product
xy, if y = 1 f ( x, y ) = y, if x = 1 0, otherwise
Bounded Product (bounded difference)
f ( x, y ) = max[0, (1 + p )( x + y − 1) − pxy ],
Hamacher Product
f ( x, y ) =
Yager Family
f ( x, y ) = 1 − min(1, (1 − x) p + (1 − y ) p
Dubois-Prade Family
f ( x, y ) =
Frank Family
( p x − 1)( p y − 1) , f ( x, y ) = log p 1 + p −1
Einstein Product
f ( x, y ) =
[
f ( x, y ) = Others
xy , p + (1 − p )( x + y − xy )
f ( x, y ) =
xy , max( x, y, p )
1 2 2 d ( A, B ) = ∫ A( x) − B ( x) dx X
(36)
]
1/ p
),
where p > 0
where p > 0; p ≠ 1
xy 1 + (1 − x) + (1 − y )
[
1
1 + ((1 − x) x) + ((1 − y ) y ) p
p
1 1 x +1 y p −1
]
1/ p
, where p > 0
p
[
Euclidean Distance (p = 2):
where p ≥ 0
where 0 ≤ p ≤ 1
f ( x, y ) = max(0, x p + y p − 1)
2.
where p ≥ −1
]
1/ p
Equality Indexes: This is based on the logical expression of equality, i.e., two sets A and B are equals if A⊂B and B⊂A. In fuzzy sets, a certain degree of equality can be found. With that, the following expression is defined:
For discrete universe of discourses, integration is replaced with sum. The more similar are (A ≡ B )( x) = [A( x) B( x)]∧ [B( x) A( x)]+ [A ( x) B ( x)]∧ [B ( x) A ( x)] the fuzzy sets, smaller is the distance between 2 them. Therefore, it is convenient to normalize [A( x) B( x)]∧ [B( x) A( x)]+ [A ( x) B ( x)]∧ [B ( x) A ( x)] the function of distance, denoted by dn(A,B), (A ≡ and B )( x) = 2 use this form to express the similarity as a direct complementation: 1 − dn(A,B). (37)
12
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Table 2. s-norms functions: f(x,y) = x s y t-conorms or s-norms
Expression
Maximum
f ( x, y ) = max( x, y )
Sum-Product (Algebraic sum)
f ( x, y ) = x + y − xy
Drastic sum
x, if y = 0 f ( x, y ) = y, if x = 0 1, otherwise
Bounded sum
f ( x, y ) = min(1, x + y + pxy ) ,
Einstein sum
f ( x, y ) =
Sugeno Family
f ( x, y ) = min(1, x + y + p − xy ) ,
Yager Family
f ( x, y ) = min(1, [ x p + y p ] 1 p ) ,
Dubois-Prade Family
f ( x, y ) =
Frank Family
( p1− x − 1)( p1− y − 1) f ( x, y ) = log p 1 + p −1
where p ≥ 0
x+ y 1 + xy
(1 − x)(1 − y ) , max(1 − x, 1 − y, p )
where p ≥ 0 where p > 0 where p ∈ [0, 1]
where p > 0; p ≠ 1
f ( x, y ) =
x + y − xy − (1 − p ) xy , 1 − (1 − p ) xy
where p ≥ 0
f ( x, y ) = 1 − max(0, [(1 − x) p + (1 − y ) p − 1] 1 p ) , where p > 0 Others
f ( x, y ) =
1 , 1 − [ x (1 − x) p + y (1 − y ) p ] 1 p
f ( x, y ) =
1 , where p > 0 1 − [1 (1 − x) p + 1 (1 − y ) p − 1] 1 p
where p > 0
1 if A( x) < B( x) A( x) − B( x where the conjunction (∧) is modeled on the miniA( x) B( x) = ⇒ ( A ≡ B) ( x) = mum operation, and the inclusion is represented B( x) − A( x B ( x) − A( x) + 1 if A( x) ≥ B( x) by the operator ϕ (phi), induced by a continuous 1 if A( x) < B( x) A( x) − B( x) + 1 if A( x) < B( x) t-norm t: A( x) B( x) = ⇒ ( A ≡ B) ( x) = B( x) − A( x) + 1 if A( x) ≥ B( x)
A( x)
B ( x) = sup [A( x) t c ≤ B ( x)] c∈[0 ,1]
(38)
Taking the t-norm of bounded product with p=0 (Table 1) as an example:
B( x) − A( x) + 1 if A( x) ≥ B( x)
(39)
Three basic methods can be used to obtain a single value ( ∀x ∈ X ):
13
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Table 3. Implication functions: f(x,y) = x ⇒f y Implication
Expression
f(x, y) = max(1 – x, y)
Kleene-Dienes Reichenbach, M izumoto Kleene-Dienes-
or
f(x, y) = 1 – x + xy
Klir-Yuan
f(x, y) = 1 – x + x2y
G�del
1, if x ≤ y f ( x, y ) = y, otherwise
Rescher-Gaines
1, if x ≤ y f ( x, y ) = 0, otherwise
Goguen
1, if x = 0 f ( x, y ) = min(1, y / x), otherwise ( x ≠ 0) if x ≤ y 1, f ( x, y ) = 1 − + , otherwise x y
ukasiewicz Modified ukasiewicz
f(x, y) = 1 – |x – y|
Yager
f(x, y) = xy
Zadeh
f(x, y) = max(1 – x, min(x, y))
• Optimistic Equality Index: ( A ≡ B) opt = sup x∈X ( A ≡ B) ( x)
(40)
• Pessimistic Equality Index: ( A ≡ B) pes = inf x∈X ( A ≡ B) ( x)
(41)
• Medium Equality Index:
( A ≡ B ) ( x)dx ( A ≡ B) avg = 1 Card ( x) ∫x
(42)
Thus, the following relationship is satisfied: ( A ≡ B ) pes ≤ ( A ≡ B ) avg ≤ ( A ≡ B) opt
(43)
Possibility and Necessity Measures: These concepts use the fuzzy sets as possibility distri14
Poss ( A, B) = sup [ min (A( x), B( x) ) ]
(44)
x∈X
butions where A(x) measures the possibility of being A for each value in X (Zadeh, 1978). Thus, the comparison, that is, the possibility of value A being equal to value B, measures the extent to which A and B superpose each other. It is denoted by Poss(A,B) and defined as:
The necessity measure describes the degree to which B is included in A, and it is denoted by Nec(A,B): Nec( A, B ) = inf [ max (A( x), 1 − B ( x) ) ] (45) x∈X
In Figures 12 and 13, we can see graphically how these measurements, for two concrete fuzzy sets, are calculated. It can be stated that: Poss ( A, B) = Poss ( B, A) . On the other hand, the measurement of necessity is asymmetrical, Nec ( A, B) ≠ Nec ( B, A) . However, the following relation is fulfilled:
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Figure 12. General illustration of the Poss(A,B) concept using the minimum t-norm
Π ( A) = Poss ( A, X) = sup [ min (A( x), 1) ]= sup A( x) x∈X
A
1
B
min(A,B) X
0
• •
Figure 13. General illustration of the Nec(A,B) concept using the maximum t-conorm A
B
max(A,1–B)
1–B
•
Nec(A,B)
Nec ( A, B) + Poss ( A , B) = 1
X
(46)
Other equivalences are: Poss ( A B, C ) = max {Poss ( A, C ), Poss ( B, C ) } (47) Poss ( A B, C ) = min {Nec ( A, C), Nec ( B, C ) } (48)
The generalization of the possibility and necessity measurements use triangular t-norms or t-conorms instead of min and max functions, respectively. If the concept is extended, the possibility of a fuzzy set A (or a possibility distribution) in the universe X can be defined as:
if Π ( A ) = 1 , then the certainty is indeterminate. if Π ( A ) = 0 , then the occurrence of A is certain.
Therefore, the following two equalities are always satisfied:
•
0
(49)
This possibility measures whether or not a determined event (the fuzzy set A) is possible in universe X. It would not measure uncertainty, because if Π ( A) = 1 , we know that event A is possible, but:
Poss(A,B)
1
x∈X
Π (X) = 1 (possibility of an element of the universe). Π ( ) = 0 (possibility of an element not in the universe).
Similarly, the necessity of a fuzzy set N(A) in X can be defined, and then we can set some equivalences of possibility and necessity (see Equation 50 and 51). These equivalences explain why the necessity complements the information about the certainty of event A: • • • •
•
The greater N(A), the smaller possibility of opposite event (¬A). The greater Π (A), the smaller necessity of the opposite event (¬A). N(A) = 1 ⇔ ¬A is totally impossible (if an event is totally necessary, then the opposite event is totally impossible). Π(A) = 1 ⇔ ¬A is not necessary at all N(¬A) = 0 (if an event is totally possible, then the opposite event cannot be necessary in any way). N(A) = 1 ⇒ Π(A) = 1 (if A is a totally necessary event, then must be totally possible). Note that the opposite is not satisfied.
15
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Compatibility Measures: This comparison operation measures the extent to which a certain fuzzy set is compatible with another (defined in the same space). The result is not a single number but a fuzzy set defined in the unit interval, [0,1], known as fuzzy set of compatibility. The compatibility of B with A can be defined as: Comp ( B, A) (u ) = sup u = A( x ){B( x)}, u ∈ [0,1]
(52)
Set B can be seen as a “fuzzy value” and set A as a “fuzzy concept.” Therefore, Comp(B,A) measures the compatibility with which B is A. Example 2: Let B be the value “approx. 70 years” and A be the concept “very old.” Then, the fuzzy set Comp(B,A) is represented in Figure 14 and the fuzzy set Comp(A,B) in Figure 15. The compatibility measurement has the following properties: •
• •
16
It measures the degree to which B can fulfill concept A. That degree will be greater, the more similar the fuzzy set Comp (B, A) is to the singleton “1” value (maximum compatibility). Supposing A is a normalized fuzzy set: Comp(A, A)(u) = u (Linear membership function). If A is not normalized, the function will be the same between 0 and the height of set A: If u > Height(A), Comp(A, A)(u)= indeterminate (0).
•
If B is a number x (“singleton” fuzzy set), the result will also be another “singleton” in the A(x) value: 1, if u = A( x) Comp ( B, A)(u ) = 0, otherwise (53)
• •
If B is not normalized, the result will not be either; its height is the same as that of set B. If Support (A) ∩ Support(B) = ∅, then:
Figure 14. Example 2: Illustration of Comp(B,A) u
Comp(B, A)
0.5
0.5
A
u0
Comp(B, A)(u0)=0.5
1
B
1
X
0
Figure 15. Example 2: Illustration of Comp(A,B) u Comp (A, B)
0.92 0.6 0.4
Sup{0.92,0,4}=0.92 0.6
B
1
A
u0
0
X x1
x2
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
compatibil 1, if u = 0 (minimum be used as a similarity The function mR mayity) Comp ( B, A)(u ) = Comp ( A, B)(u ) = 0, otherwise or proximity function. It is important to stress 1, if u = 0 (minimum compatibility) B, A)(u ) = Comp ( A, B)( u ) = that not all functions are relations and not all re0, otherwise lations are functions. Fuzzy relations generalize (54) the concept of relation by allowing the notion of
•
The possibility and necessity measurements between A and B are included in the support of Comp(B, A).
In order to have a clearer vision of what this measurement means, we can look at the examples shown in Figure 16. We can conclude that fuzzy set B is more compatible with another A, the closer Comp(B, A) is to 1 and the further it is from 0 (the less area it has).
partial belonging (association) between points in the universe of discourse.
Example 3: Take as an example the fuzzy relation in ℜ2 (binary relation), “approximately equal,” with the following membership function in X ⊂ ℜ, with X2 = {1,2,3}2: 1/(1,1) + 1/(2,2) + 1(3,3) + 0.8/(1,2) + 0.8/(2,3) + 0.8/(2,1) + 0.8/(3,2) + 0.3/(1,3) + 0.3/(3,1). This fuzzy relation may be defined as:
Fuzzy Relations
1, if x = y x approximately equal to y : R ( x , y) = 0.8, if x - y = 1 0.3, if x - y = 2
A classic relation between two universes X and Y is a subset of the Cartesian product X×Y. Like the classic sets, the classic relation can be described using a characteristic function. In the same way, a fuzzy relation R is a fuzzy set of tuples. In the event of a binary relation, the tuple has two values.
where x, y ∈ ℜ. When the universe of discourse is finite, a matrix notation can be quite useful to represent the relation. This example would be shown as:
Definition 19: Let U and V be two infinite (continuous) universes and R : U × V → [0,1]. Then, a binary fuzzy relation R is defined as: R=∫
U× V
R
(u , v) /(u , v)
(55)
Figure 16. Three sets (B1, B2, and B3) with the same shape placed in different positions and compared to A Maximum Compatibility
1
B1 B2
B3
A
Comp(B2, A) Comp(B1, A) Minimum Compatibility
1
0
X
1
2
3
1
1
0.8
0.3
2
0.8
1
0.8
3
0.3
0.8
1
Definitions of basic operations with fuzzy relations are closely linked to operations of fuzzy sets. Let R and W be two fuzzy relations defined in X × Y: •
u
Comp(B3, A)
X2
• • • •
Union: (R ∪ W)(x,y) = R(x,y) s W(x,y), using a s-norm s. Intersection: (R ∪ W)(x,y) = R(x,y) t W(x,y), using a t-norm t. Complement: (¬R)(x,y) = 1 – R(x,y). Inclusion: R ⊆ W ⇔ R(x,y) ≤ W(x,y). Equality: R = W ⇔ R(x,y) = W(x,y).
17
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
The concept of fuzzy number was first introduced in Zadeh (1975) with the purpose of analyzing and manipulating approximate numeric values, for example, “near 0,” “almost 5,” and so forth. The concept has been refined (Dubois & Prade, 1980, 1985), and several definitions exist. Definition 20: Let A be a fuzzy set in X and A(x) be its membership function with x∈X. A is a fuzzy number if its membership function satisfies that: 1. 2. 3.
(x) is convex. (x) is upper semicontinuity. Support of A is bounded. A A
These requirements can be relaxed. The general form of the membership function of a fuzzy number A with support (a,d) and kernel or modal interval (b,c) can be defined as: rA ( x) h = ( x ) A s A ( x) 0
if x ∈ (a, b) if x ∈ [b, c] if x ∈ (c, d ) otherwise
(56)
where rA, sA:X → [0,1], rA is not decreasing, sA is not increasing rA(a) = sA(d) = 0 and rA(b) = sA(c) = h (57) with h∈(0,1] and a, b, c, d ∈X. The number h is called the height of the fuzzy number, and some authors include the necessity of normalized fuzzy numbers, that is, with h = 1. The numbers b – a and d – c are the left and right spaces, respectively. Throughout this study, we will often use a particular case of fuzzy numbers that is obtained when we consider the functions rA and sA as linear functions. We will call this type of fuzzy number triangular or trapezoidal, and it takes the form
18
Figure 17. Graphic representation of the extension principle, where f carries out its transformation from X to Y Y
Membership degree of B
Fuzzy Numbers
B
f
X
1 1
A
Membership degree of A
shown in Figure 7. Many applications usually work with normalized trapezoidal fuzzy numbers (h=1) because these fuzzy numbers are easily characterized using the four really necessary numbers: A ≡ (a, b, c, d).
The Extension Principle One of the most important notions in the fuzzy sets theory is the extension principle, proposed by Zadeh (1975). It provides a general method that allows nonfuzzy mathematical concepts to be extended to the treatment of fuzzy quantities. It is used to transform fuzzy quantities, which have the same or different universes, according to a transformation function between those universes. Let A be a fuzzy set, defined in universe of discourse X and f a nonfuzzy transformation function between universes X and Y, so that f: X → Y. The purpose is to extend f so that it can also operate on the fuzzy sets in X. The result must be a fuzzy set B in Y: B = f(A). This transformation is represented in Figure 17. It is achieved with the use of the Sup-Min composition, which will now be described in a general way in the case of the Cartesian product in n universes. Definition 21: Let X be a Cartesian product of n universes such as X = X1 × X2 × ... × Xn, and A1, A2 , …, An are n fuzzy sets in those n universes respectively. Moreover, we have a function f from X to the universe Y, so a fuzzy set B from Y is
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
defined by the extension principle as B = f(A1, A2 ,..., An) defined as:
( y ) = sup min( A1 ( x1 ),..., x∈X , y = f ( x ) B
x )) (58)
An ( n
Example 4: Let both X and Y be the universe of natural numbers. • Sum 4 function: y = f(x) = x + 4; A = 0.1/2 + 0.4/3 + 1/4 + 0.6/5; B = f(A) = 0.1/6 + 0.4/7 + 1/8 + 0.6/9; •
We can conclude that the extension principle allows us to extend any function (for example arithmetic) to the field of fuzzy sets, making possible the fuzzy arithmetic.
Fuzzy Arithmetic Thanks to the extension principle (Definition 21), it is possible to extend the classic arithmetical operations to the treatment of fuzzy numbers (see Example 4). In this way, the four main operations are extended in: 1.
Extended sum: Given two fuzzy quantities A1 and A2 in X, the membership function of the sum A1 + A2 is found using the expression: A1 + A 2
( y ) = sup{min(
( y − x),
A1
Extended difference: Given two fuzzy quantities A1 and A2, in X, the membership function of the difference A1 – A2 is found using the expression:
2.
A2
( x)) / x ∈ X}
(59)
In this way, the sum is expressed in terms of the supreme operation. The extended sum is a commutative and associative operation and the concept of the symmetrical number does not exist.
A1 − A 2
( y ) = sup{min(
( y + x),
A1
A2
( x)) / x ∈ X}
3.
A1* A2
Extended product: The product of two fuzzy quantities A1 * A2 is obtained as follows: sup {min( A1 ( y / x), A2 ( x)), ( y) = max ( A1 (0), A2 (0))
x ∈ X − {0}} if x ≠ 0 if x = 0
4.
(60)
(61)
Extended division: The division of two fuzzy quantities A1 ÷ A2 is defined as follows:
A1 ÷ A 2
( y ) = sup{min(
( xy ),
A1
A2
( x)),
x ∈ X }
(62)
From these definitions, we can easily conclude that if A1 and A2 have a discrete universe (with finite terms) and they have n and m terms respectively, then the number of terms of A1+A2 and of A1−A2 is (n−1)+(m−1)+1, that is, n+m−1. Based on a particular expression from the uncertainty principle, adapted to the use of α-cuts and in a type of numbers similar to those previously described, called LR fuzzy numbers (Dubois & Prade, 1980), rapid calculus formulae for the previous arithmetical operations are described. It is important to point out that if we have two fuzzy numbers, the sum or remainder of both fuzzy numbers will be fuzzier (it will have greater cardinality) than the most fuzzy of the two (that which has greatest cardinality). This is logical, since if we add two “approximate” values, the exact value of which we do not know, the result can be as varied as the initial values are. The same thing happens with division and multiplication but on a larger scale.
19
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Possibility Theory This theory is based on the idea of linguistic variables and how these are related to fuzzy sets (Dubois & Prade, 1988; Zadeh, 1978). In this way, we can evaluate the possibility of a determinate variable X being (or belonging to) a determinate set A, like the membership degree of the X elements in A. Definition 22: Let there be a fuzzy set A defined in X with membership function mA(x) and a variable x in X (whose value we do not know). So, the proposition “x is A” defines a possibility distribution, in such a way that it is said that the possibility of x = u is mA(u), ∀u∈X. The concepts of fuzzy sets and membership functions are now interpreted as linguistic labels and possibility distributions. Instead of membership degrees, we have possibility degrees, but all the tools and properties defined for fuzzy sets are also applicable to possibility distributions.
Fuzzy Quantifiers Fuzzy or linguistic quantifiers (Liu & Kerre, 1998a, 1998b; Yager, 1983; Zadeh, 1983) have been widely applied to many applications, including database applications (Galindo, 1999; Galindo, Medina, Cubero, & García, 2001). Fuzzy quantifiers allow us to express fuzzy quantities or proportions in order to provide an approximate idea of the number of elements of a subset fulfilling a certain condition or the proportion of this number in relation to the total number of possible elements. Fuzzy quantifiers can be absolute or relative: •
20
Absolute quantifiers express quantities over the total number of elements of a particular set, stating whether this number is, for example, “much more than 10,” “close to 100,” “a great number of,”and so forth. Generalizing this concept, we can consider fuzzy numbers as absolute fuzzy quantifiers, in order to use
•
expressions like “approximately between 5 and 10,” “approximately −8,”and so on. Note that the expressed value may be positive or negative. In this case, we can see that the truth of the quantifier depends on a single quantity. For this reason, the definition of absolute fuzzy quantifiers is, as we shall see, very similar to that of fuzzy numbers. Relative quantifiers express measurements over the total number of elements, which fulfill a certain condition depending on the total number of possible elements (the proportion of elements). Consequently, the truth of the quantifier depends on two quantities. This type of quantifier is used in expressions such as “the majority” or “most,” “the minority,” “little of,” “about half of,”and so forth. In this case, in order to evaluate the truth of the quantifier, we need to find the total number of elements fulfilling the condition and consider this value with respect to the total number of elements which could fulfill it (including those which fulfill it and those which do not fulfill it).
Some quantifiers such as “many” and “few” can be used in either sense, depending on the context (Liu & Kerre, 1998a). In Zadeh (1983), absolute fuzzy quantifiers are defined as fuzzy sets in positive real numbers and relative quantifiers as fuzzy sets in the interval [0,1]. We have extended the definition of absolute fuzzy quantifiers to all real numbers. Definition 23: A fuzzy quantifier named Q is represented as a function Q, the domain of which depends on whether it is absolute or relative: Qabs : ℜ → [0,1]
(63)
Qrel : [0,1]→ [0,1]
(64)
where the domain of Qrel is [0,1] because the division a/b ∈ [0,1], where a is the number of elements fulfilling a certain condition, and b is the total number of existing elements.
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
In order to know the fulfillment degree of the quantifier over the elements that fulfill a certain condition, we can apply the function Q of the quantifier to the value of quantification Φ, with Φ = a if Q is absolute and Φ = a/b if Q is relative. There are two very important classic quantifiers: The universal quantifier (for all, ∀), and the existential quantifier (exist, ∃). The first of them is relative and the second one is absolute. They are discretely defined as: 1 if x = 1 Q∀ ( x ) = 0 otherwise
0 if x = 0 Q∃ ( x) = 1 otherwise
(65)
(66)
Some quantifiers (absolute or relative) may have arguments, and in these cases, the function is defined using the arguments (Galindo, Urrutia, & Piattini, 2006). A survey of methods for evaluating quantified sentences and some new methods are shown in the literature (Delgado, Sánchez, & Vila, 1999, 2000), and in this volume, see the chapter by Liétard and Rocacher.
Fuzzy Databases In the Foreword of this handbook, Vila and Delgado refer to a series of five reports, which are a guideline for the development of research in this area. They state that “one of the lines that appears with more continuity and insistence is the treatment of the imprecise and uncertain information in databases.” Imprecision has been studied in order to elaborate systems, databases, and consequently applications which support this kind of information. Most works which studied the imprecision in information have used possibility, similarity, and fuzzy techniques. If a regular or classical database is a structured collection of records or data stored in a computer, a fuzzy database is a database which is able to deal with uncertain or incomplete information
using fuzzy logic. Basically, a fuzzy database is a database with fuzzy attributes, which may be defined as attributes of an item, row, or object in a database, which allows storing fuzzy information (Bosc, 1999; De Caluwe & De Tré, 2007; Galindo et al., 2006; Petry, 1996). There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, that is, an attribute in the range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes using fuzzy sets, possibility distributions, or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Of course, fuzzy databases should allow fuzzy queries using fuzzy or nonfuzzy data, and there are some languages based on SQL (ANSI, 1992; Date & Darwen, 1997) that allow these kind of queries like FSQL (Galindo, 2007; Galindo et al., 2006) or SQLf (Bosc & Pivert, 1995; Goncalves & Tineo, 2006). The research on fuzzy databases has been developed for about 20 years and concentrated mainly on the following areas: 1. 2. 3.
4. 5. 6.
Fuzzy querying in classical databases, Fuzzy queries on fuzzy databases, Extending classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy object-oriented databases, etc.), Fuzzy conceptual modeling tools, Fuzzy data mining techniques, and Applications of these advances in real databases.
All of these different issues have been studied in different chapters of this volume, except the fourth item because, in general, there is little interest in fuzzy conceptual issues and, besides, these subjects have been studied in some other works in a very exhaustive manner (Chen, 1998; Galindo et al., 2006; Kerre & Chen, 2000; Ma, 2005; Yazici & George, 1999).
21
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
The first research area, fuzzy queries in classical databases, is very useful because currently there are many classical databases. The second item includes the first one, but we prefer to separate them because item 2 finds new problems that must be studied and because it must be framed in a concrete fuzzy database model (third item). The querying with imprecision, contrary to classical querying, allows the users to use fuzzy linguistic labels (also named linguistic terms) and express their preferences to better qualify the data they wish to get. An example of a flexible query, also named in this context fuzzy query, would be “list of the young employees, working in department with big budget.” This query contains the fuzzy linguistic labels “young” and “big budget.” These labels are words, in natural language, that express or identify a fuzzy set. In fact, the flexibility of a query reflects the preferences of the end user. This is manifested by using a fuzzy set representation to express a flexible selection criterion. The extent to which an object in the database satisfies a request then becomes a matter of degree. The end user provides a set of attribute values (fuzzy labels) which are fully acceptable to the user, and a list of minimum thresholds for each of these attributes. With these elements, a fuzzy condition is built and the fuzzy querying system ranks the answered items according to their fulfillment degree. Some approaches, the so-called bipolar queries, need both the fuzzy condition (or fuzzy constraint) and the less-compulsory positive preferences or wishes. A very interesting work about bipolar queries may be found in this volume in the chapter by Dubois and Prade. In another chapter, Urrutia, Tineo and Gonzalez study the two most known fuzzy querying languages, FSQL and SQLf. Of course, we must reference the interesting and general review about the fuzzy querying proposals, written by Zadrożny, de Tré, de Caluwe, and Kacprzyk. Other chapters about fuzzy queries study different aspects, such as evaluation strategies or quantified statements, for example. About fuzzy data mining issues, this handbook includes a complete review chapter by Feil and
22
Abonyi, studying the main fuzzy data mining methods. Perhaps the more interesting and useful tools are the fuzzy clustering and the fuzzy dependencies, and both of them are also studied in different other chapters of this handbook. The last item, applications, is also studied in some chapters. These chapters mix different theoretical issues like data mining to real contexts with different goals. About the third item, extending classical data models in order to achieve fuzzy databases, this handbook also includes interesting chapters. Ben Hassine et al. study in their chapter how to achieve fuzzy relational databases, giving different methods and addressing their explanations, mainly to database administrators and enterprises, in order to facilitate the migration to fuzzy databases. Takači and Škrbić present their fuzzy relational model and propose a fuzzy query language with the possibility to specify priorities for fuzzy statements. Barranco et al. give a good approach of a fuzzy object-relational database model, whereas some other interesting fuzzy object-oriented database models are presented and summarized respectively (De Caluwe, 1997; Galindo et al., 2006). This last book includes fuzzy time data types and in the book you have in your hands, Schneider defines fuzzy spatial datatypes. In another chapter, Belohlavek presents an overview of foundations of formal concept analysis of data with graded attributes, which provides elaborated mathematical foundations for relational data in some fuzzy databases. In this section, we want to give a wide historical point of view summarizing the main published models aiming at solving the problem of representation and treatment of imprecise information in relational databases. This problem is not trivial because it requires relation structure modification, and actually, the operations on these relations also need to be modified. To allow the storage of imprecise information and the making of an inaccurate query of such information, a wide variety of case studies is required, which do not occur in the classic model, without imprecision. The first approaches, which do not utilize the fuzzy logic, were proposed by Codd (1979, 1986,
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
instance, in the car plate attribute of someone who does not have a car. This is a tetravalued logic where the A value, having a similar meaning to that of the m in the trivalued logic mentioned above, is generated by comparing any value containing an A-mark, and a new I value is added as a result of the comparison of any value containing an I-mark. The tetravalued logic is shown in Table 5. In Galindo et al. (2006), some other approaches are summarized, like the “default values” approach by Date (1986), similar to the DEFAULT clause in SQL, the “interval values” approach by Grant (1980), who expands the relational model in order to allow that a possible value range/interval be stored in one attribute, and statistical and probabilistic databases.
1987, 1990). Then, some basic models were proposed, like the Buckles-Petry model (1982a, 1982b, 1984), the Prade-Testemale model (1984, 1987a, 1987b; Prade, 1984), the Umano-Fukami model (Umano, 1982, 1983; Umano & Fukami, 1994), and the GEFRED model of Medina-Pons-Vila (1994; Galindo, Medina, & Aranda, 1999; Galindo et al., 2001; Medina, 1994).
Imprecision without Fuzzy Logic In this section, some ideas allowing for imprecise information treatment will be summarized, without utilizing either the fuzzy set theory or possibility theory. In the bibliography, these models are dealt with globally in the section on imprecision in conventional databases, although some of the ideas discussed here have not been implemented in any of the models. The first attempt to represent imprecise information on databases was the introduction of NULL values by Codd (1979), which was further expanded (Codd, 1986, 1987, 1990). This model did not use the fuzzy set theory. A NULL value in an attribute indicates that such a value is any value included in the domain of such an attribute. Any comparison with a NULL value originates an outcome that is neither True (T) nor False (F) called “maybe” (m) (or unknown, in the SQL of Oracle). The truth tables of the classical comparators NOT, AND, and OR can be seen in Table 4. Later on, another nuance was added, differentiating the NULL value in two marks: The “A-mark” representing an absent or unknown value, although it was applicable, and the “I-mark” representing the absence of the value because it is not applicable (undefined). An I-mark may be situated, for
Basic Model of Fuzzy Databases The simplest model of fuzzy relational databases consists of adding a grade, normally in the [0,1] interval, to each instance (or tuple). This keeps database data homogeneity. Nevertheless, the semantic assigned to this grade will determine its usefulness, and this meaning will be utilized in the query processes. This grade may have the meaning of membership degree of each tuple to the relation (Giardina, 1979; Mouaddib, 1994), but it may mean something different, like the dependence strength level between two attributes, thus representing the relation between them (Baldwin, 1983), the fulfillment degree of a condition or the importance degree (Bosc, Dubois, Pivert, & Prade, 1997) of each tuple in the relation, among others. The main problem with these fuzzy models is that they do not allow the representation of imprecise information about a certain attribute of
Table 4. Truth tables for the trivalued logic: True, false, and maybe NOT
AND
T
m F
OR
T m F
T
F
T
T
m F
T
T T
m
m
m
m m F
M
T m m
F
T
F
F
F
T m F
F
F
T
23
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Table 5. Truth tables for the tetravalued logic NOT
AND
T
A
I
F
OR
T A
I
F
T
F
T
T
A
I
F
T
T T
T
T
A
A
A
A
A
I
F
A
T A
A
A
I
I
I
I
I
I
F
I
T A
I
F
F
T
F
F
F
F F
F
T A
F
F
a specific entity (like the “tall” or “short” values for a “height” attribute). Besides, the fuzzy character is assigned globally to each instance (tuple) it making impossible to determine the specific fuzzy contribution from each constituting attribute. These problems are solved in the model presented in Galindo et al. (2006), and you can learn about this model in the chapter by Ben Hassine et al. in this handbook.
Similarity Relations Model: Buckles-Petry Model This is the first model that utilizes similarity relations (Zadeh, 1971) in the relational model. It was proposed by Buckles and Petry (1982a, 1982b, 1984). In this model, a fuzzy relation is defined as a subset of the following Cartesian product: P(D1)× ... ×P(Dm), where P(Di) represents the parts set of a Di domain, including all the subsets that could be considered within the Di domain (having any number of elements). The data types permitted by this model are finite set of scalars (labels), finite set of numbers, and fuzzy number set. The meaning of these sets is disjunctive, that is, the real value is one belonging to the set. The equivalence types on a domain are constructed from a similarity function or relation, in which the values taken by such a relation are provided by the user. Typically, these similarity values are standardized in the [0,1] interval, where 0 corresponds to “totally different” and 1 to “totally similar.” A similarity threshold can be established with a value between 0 and 1 in order
24
to get the values whose similarity is greater than the threshold, or to consider those values indistinguishable.
Possibilistic Models Under this denomination, models using the possibility theory to represent imprecision are included. The most important models in this group are PradeTestemale model, Umano-Fukami model, and GEFRED model. Another important model is the the Zemankova-Kaendel model (1984, 1985), which is briefly summarized in Galindo et al. (2006).
Prade-Testemale Model Prade and Testemale published a fuzzy relational database (FRDB) model that allows the integration of what they call incomplete or uncertain data in the possibility theory sphere (Prade, 1984; Prade & Testemale, 1984, 1987a, 1987b). An attribute A, having a D domain, is considered. All the available knowledge about the value taken by A for an x object can be represented by a possibility distribution πA(x) about D ∪ {e}, where e is a special element denoting the case in which A is not applied to x. In other words, πA(x) is an application that goes from D ∪ {e} to the [0,1] interval. From this formulation, all value types adopted by this model can be represented. In every possibilistic model one must take into account that, for a value d ∈ D, if πA(x) (d) = 1, then this just indicates that the d value is totally possible for A(x), and not that the d value is true for A(x),
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
Table 6. Representation of information in two possibilistic models Prade-Testemale Model
Umano-Fukami Model
The precise data are known and this is crisp: c
πA(x)(e) = 0 πA(x)(c) = 1 πA(x)(d) = 0, ∀ d ∈ D, d ≠ c
πA(x)(d) = {1 / c }
Unknown but applicable
πA(x)(e) = 0 πA(x)(d) = 1, ∀ d ∈ D
Unknown (Equation 67)
Not applicable or nonsense
πA(x)(e) = 1 πA(x)(d) = 0, ∀ d ∈ D
Undefined (Equation 68)
Total ignorance
πA(x)(d) = 1, ∀ d ∈ D ∪ {e}
Null (Equation 69)
Range [m, n]
πA(x)(e) = 0 πA(x)(d) = 1, if d ∈ [m, n] ⊆ D πA(x)(d) = 0, in other case
πA(x)(d) = 1, if d ∈ [m, n] ⊆ D πA(x)(d) = 0, in other case
The information available is a possibility distribution µa
πA(x)(e) = 0 πA(x)(d) = µa(d), ∀ d ∈ D
πA(x)(d) = µa(d), ∀ d ∈ D
The possibility that it may not be applicable is λ and, in case it is applicable, the data are µa
πA(x)(e) = λ πA(x)(d) = µa(d), ∀ d ∈ D
Without representation
Information
unless this is the only possible value, that is, πA(x) (d’) = 0, ∀d’ ≠ d. Both the information and representation of this model is shown in Table 6. How two possibility distributions can be compared was discussed in the Comparison Operations on Fuzzy Sets section earlier in this chapter. In general, the most commonly used measurements are possibility and necessity.
Umano-Fukami Model This proposal (Umano, 1982, 1983; Umano & Fukami, 1994) also utilizes the possibility distributions in order to model information knowledge. In this model, if D is the discourse universe of A(x), πA(x) (d) represents the possibility that A(x) takes the value d∈D. The following kind of knowledge may be modeled: unknown and applicable information, the non-applicable information (undefined), and the total ignorance (we do not know if it is applicable or non-applicable):
Unknown = πA(x)(d) = 1,
∀d∈D
(67)
Undefined = πA(x)(d) = 0,
∀ d ∈ D
(68)
Null = {1/Unknown, 1/Undefined}
(69)
For the remaining cases of imprecise information, a similar model to the one above is adopted. The kind of fuzzy information and the representation of this model are shown in Table 6. Besides, every instance of a relation in this model has a possibility distribution associated with it in the [0,1] interval, thus indicating the membership degree of that particular instance to such a relation. In other words, a fuzzy relation R, with m attributes, is defined as the following membership function: µR: P(U1) × P(U2) × … × P(Um) → P([0,1]) (70)
25
Introduction and Trends to Fuzzy Logic and Fuzzy Databases
where the × symbol denotes the Cartesian product, P(Uj) with j=1, 2, ..., m is the collection of all the possibility distributions in the discourse universe Uj of the j-th R attribute. The function µR associates a P([0,1]) value to every instance of the relation R, which corresponds to all the possibility distributions in the [0,1] interval; this shall be considered an R membership degree of such an instance. Finally, in the query process, expressed either in fuzzy or precise terms, the model solves the query problem by dividing the set of instances involved in the relation into three subsets, where the first subset contains the instances completely satisfying the query; the second subset groups those instances that might satisfy the query; and the third subset consists of those instances which do not satisfy the query.
The GEFRED Model by Medina-Pons-Vila The GEFRED model dates back to 1994, and it has experienced subsequent expansions (Medina, 1994; Medina et al., 1994; Galindo et al., 1999, 2001, 2006). This model is an eclectic synthesis of some of the previously discussed models. One of the major advantages of this model is that it consists of a general abstraction that allows for the use of various approaches, regardless of how different they might look. As a possibilistic model, it refers particularly to generalized fuzzy domains, thus admitting the possibility distribution in the domains, but it also includes the case where the underlying domain is not numeric but scalars of any type. It includes UNKNOWN, UNDEFINED, and NULL values as well, having the same sense as that in Umano-Fukami model. The GEFRED model is based on the definition which is called Generalized Fuzzy Domain (D) and Generalized Fuzzy Relation (R), which include classic domains and classic relations, respectively. Basically, the Generalized Fuzzy Domain is the basic domain, with possibility distributions defined for this domain and the NULL value. All data types that can be represented are shown in Table 1 in the chapter by Ben Hassine et al. 26
On the other hand, the Generalized Fuzzy Relations of GEFRED model are relations whose attributes have a Generalized Fuzzy Domain, and each attribute may be associated to a “compatibility attribute” where we can store a compatibility degree. The compatibility degree for an attribute value is obtained by manipulation processes (such as queries) performed on that relation, and it indicates the degree to which that value has satisfied or met the operation performed on it. The GEFRED model defines fuzzy comparators that are general comparators based on any existing classical comparator (>, 0 C new C new v∈V A v∈V A
(20)
Taking into account the corresponding ranges, the fuzzy set for the numeric attribute ‘VictimAge: year(Registration) – year(Birthdate)’ becomes: ~ C new VVictimAge = {(31,0.33), (32,0.66), (33,1), (34,0.66), (35,0.33)} .
Furthermore, the set VAC j is replaced by its fuzzy counterpart:
{
}
~C C VA j = (x,1)x ∈ VA j
178
ye a r (B i r t h d a t e),
(21)
(
sim V
C new A
,V
Cj A
~ new ~ C VAC ∩ VA j = ~ new ~ C = VAC ∪ VA j
)
min max ∑ x∈dome
∑
x∈dom e
( x) ( x ), ( x ) new C ~ ~ VA j V AC
~ new V AC
( x),
~C VA j
(22)
This calculation is based on the fuzzification of the Jaccard similarity measure as described in (Miyamoto, 2000).
Similarity Between Two Cases The similarity between two cases C new and C j is obtained by the weighted aggregation of the similarities of all ‘descriptive’ attributes, taking into account the weights that are associated with the attributes. In this chapter, we only consider conjunctive aggregation. Therefore, we can apply an implicator operator for the modeling of the impact of the weights (de Tré, ������������������� de Caluwe, Tourné, & Matthé�������� , 2003): f im∧ : [0,1]× [0,1]→ [0,1]
(w , sim(V A:e
C new A
C
))
(
(
new
C
,VA j max 1 − wA:e , sim VAC ,VA j
))
(23)
Note that with this definition, the semantic conditions for weights as proposed in Dubois et al. (1997) are satisfied. The similarity between the two cases is then obtained by:
(
)
((
(
new
C
)))
sim C new , C j = min f im∧ wA:e , sim VAC ,VA j A:e∈D (24)
Flexible Querying Techniques Based on CBR
Retrieval of Similar Cases The cases in the database that are similar to the case C new are retrieved by calculating the similarity:
(
)
sim C new , C j , 1 ≤ j ≤ l
(25)
for all stored cases C j . In order to retrieve only the most similar cases, the user can provide a threshold value τ: only cases for which the similarity is not lower than τ are provided in the result. Finally, the ~ fuzzy set SC new of cases that are similar to the case new C is obtained by:
{
SC new = (C j , sim(C new , C j ) )1 ≤ j ≤ l ∧ sim(C new , C j ) ≥
}
(26)
If this fuzzy set is empty, a message will be sent to the revision process.
Prediction The prediction process aims to predict the values for the ‘predictable’ attributes: P ={A1′ : e1′, A2′ : e2′ ,..., A′p : e′p}
of the new case. For the same reason as with ‘descriptive’ attributes, each ‘predictable’ attribute A′ has an associated set of actual values for each case C j , 1 ≤ j ≤ l stored in the database (cf. Equations [17]-[18]): C
V A′ j
(27)
Furthermore, the prediction process will associate a fuzzy set of predicted values: ~ new V AC′
(28)
with the attribute. Possibility theory is used to determine the possible elements of this set and thus defines the semantics of the membership grades ~ new of VAC′ as degrees of uncertainty. Here, the CBR hypothesis is interpreted as (Dubois et al., 1998,
2000): ‘the more similar two cases are, the more possible it is that their corresponding “predictable” attribute values are similar.’ ~ new In a straightforward approach, VAC′ can be obtained by: ~ new VAC′ = x,
~ new V AC′
( x) x ∈ dome′ ∧
~ new V AC′
( x) > 0
(29)
where ~ new V AC′
( x) =
C ∈ C
max ~ S new C
( C ) > 0 ∧ C [A′ ]= x
~ S
C new
(C )
Hereby, C [A′] denotes the actual value of attribute A′ for case C. By using Equation (29), each distinct value of attribute A′ that occurs in a case C that is an ~ element of the fuzzy set SC new of similar cases is considered to be a possible value for A′ in C new . Its degree of possibility is obtained as the maximum of the membership grades S~C new (C ) of ������� all ~ cases C in SC new that have the value as attribute value for A′ . More advanced techniques that also deal with the similarities between the values of A′ in the ~ similar cases of SC new ,can be used here (Dubois et al., 2000). These topics are outside the scope of this chapter.
Revision The revision process gets input from both the case comparison and the prediction processes. On the ~ one hand, it can occur that the fuzzy set SC new is empty. This means that no cases with similar characteristics are found in the database when considering the given similarity ranges, weights, and threshold value τ. Because too stringent conditions might have caused the empty query result, feedback from the user is necessary. On the other hand, it might also be the case that the predicted values prove to be incorrect, which might be caused by conditions that are too
179
Flexible Querying Techniques Based on CBR
soft. Therefore, as soon as the actual values of the ‘predictable’ attributes in P become available and are entered in the system, the new values are compared with the predicted values by calculating their similarity. Hereby, the same techniques as in the case comparison process can be applied, but now using the following ranges and weights:
Enhancing Flexible Database Querying
RangeA1′ :e1′ , RangeA2′ :e′2 ,..., RangeA′ p :e′p and
Flexible Querying
wA1′ :e1′ , wA2′ :e2′ ,..., wA′ p :e′p .
For many years, an emphasis has been put on research that aims to make database systems more flexible and better accessible. An important aspect of flexibility is the ability to deal with imperfections of information, like imprecision, vagueness, uncertainty, or incompleteness. Imperfection of information can be dealt with at the level of data modeling, the level of database querying, or both. The key idea in flexible querying is to introduce preferences inside database queries (Bosc, ��������� Kraft, & Petry������������������������������������������������ , 2005). This can be done at two levels: inside elementary query conditions and between query conditions. Preferences inside query conditions allow for expressing that some values are more adequate than others, whereas preferences between query conditions are used to associate different levels of importance with the conditions. To support preferences, query languages like SQL and OQL and their underlying algebraic frameworks have been generalized. Hereby, the possible extensions and flexible counterparts of the algebraic data manipulation operators have been studied (Bosc & Pivert, 1992, 1995; de Tré, �������������������� Verstraete, Hallez, Matthé, & de Caluwe������������������������� , 2006; Galindo, Medina, �������� Pons, & Cubero��������������������������������� , 1998; Galindo, ���������������� Urrutia, & Piattini����������������������������������������� , 2006; Umano & Fukami, 1994; Zadrozny & Kacprzyk, 1996). As the main objective of flexible querying is to refine Boolean conditions, which are either completely true or completely false, it is sufficient that the underlying logical framework supports some notion of ‘degree of satisfaction.’ Alternatively, an underlying logical framework based on possibility and necessity measures can be used to express certainty about query satisfaction. This approach,
The similarity between the completed ‘predictnew ed’ case Ccompleted and the case C new as originally entered in the database is obtained by the following counterpart of Equation (24):
(
)
((
(
C new
new sim Ccompleted , C new = min f im∧ wA′:e′ , sim VA′ completed ,VAC′ A′:e ′∈P
)))
new
(30)
If this similarity is lower than the threshold value τ, then the prediction is considered inadequate, and again feedback from the user is necessary. If user feedback is necessary, the process will interact with the user and provide the user all information that is available. More specifically, the cause of the interaction will be communicated, together with information about all attribute range values, weights, and (intermediate) results in the calculation of the similarities. By comparing the intermediate results, the process can determine for which descriptive attribute(s) values have the highest similarity and the lowest similarity. This information together with the feedback of the user can help the user to decide to adapt (some of) the parameters of the case description process. Alternatively, the parameters can also be automatically adapted by the process. This can be done, for example, by proportionally decreasing or increasing the weight and/or the relative distances of the definition of the similarity range of the attribute that performs best or worst. More details on this are outside the scope of this chapter.
180
In this section, we describe how the CBR approach presented in the previous section can be applied to enhance flexible querying of (conventional) relational databases.
Flexible Querying Techniques Based on CBR
as originally presented in Prade and Testemale (1984), does not provide any discussion of inapplicability of information at the logical level or offer a presentation of a formal framework for coping with it together with other null values, despite the fact that inapplicability is handled with a special domain value (⊥) in the data model. As illustrated in de Tré and de Caluwe (2003), extended possibilistic truth values (EPTVs) can be used to express (un)certainty about query satisfaction in flexible database querying: the EPTV representing the extent to which it is (un)certain that a given database record belongs to the result of a flexible query can be obtained by aggregating the calculated EPTVs that denote the extents to which it is (un)certain that the record satisfies the different criteria imposed by the query. Moreover, the logical framework based on EPTVs extends to the approach presented in Prade and Testemale (1984) and explicitly deals with the inapplicability of information during the evaluation of the query conditions: if some part of the query conditions are inapplicable, this will be reflected in the resulting EPTV. An extension of SQL that copes with EPTVs has been described in de Tré et al. (2006).
Extending Flexible Querying Systems with Extra CBR Facilities In a first approach, a flexible querying system could be extended with a CBR system for instance-based prediction (Dubois et al., 2000). Such an extra facility additionally allows users to examine the database for predicted values for a set of given attributes. Of course, in order to be usable, the underlying CBR hypothesis, ‘The more two database entities are similar, the more possible the similarity of associated attribute values’ must hold. By using the facility, a CBR technique as described in the section titled A Flexible CBR Approach for Information Retrieval’ will be applied. After having initialised the CBR system, the user has to enter the relevant attribute values describing the case under consideration. The unknown attribute values will then be predicted by the prediction process and returned to the user by the CBR system.
Embedding CBR Facilities in a Flexible Querying Language Rather than being provided as an extra stand-alone facility, CBR can also be embedded in existing (flexible) querying systems. To do this, the query language must be extended with an extra facility ‘PREDICT’ that allows the prediction of the unknown values of specified attributes. Without such a facility, those unknown values will in most systems be represented by the pseudodescription null (Codd, 1979; Vassilou, 1979). In the next subsections, we describe such a predict facility that could be embedded in a flexible querying language for conventional, relational databases supported by a logical framework of EPTVs.
Flexible Querying Using a Logical Framework of EPTVs In order to use EPTVs for expressing query satisfaction in flexible querying of regular relational databases, the relational model and relational algebra (Codd, 1972) must be extended with some additional facilities. To start with, the definition of a relation R is extended so that it contains an extra attribute ‘ Contains : TEPTV ’ with a corresponding data type TEPTV , that has EPTVs as allowed values. As such, each relation Ri, 1 ≤ i ≤ r in a database schema has the following schema:
The extra attribute with name ‘Contains’ is used to express the extent to which the tuples of the relation belong to the relation. Hereby, it is implicitly assumed that the schema of a relation corresponds to a predicate and all tuples that belong to the relation are propositions that should not evaluate to false, that is, that have an associated EPTV that differs from {(F, 1)}. If no more information is available, for simplification, it can be assumed that all tuples initially have the value {(T, 1)} as associated with
181
Flexible Querying Techniques Based on CBR
EPTV. The value {(T, 1)} could, for example, be the default value that is assigned to the tuple on insertion. In a more general approach, users might be allowed to assign their own truth values, hereby expressing that the tuple belongs to the relation, only to the given extent. In order to guarantee the relational closure property of the set of relational algebra operators (Codd, 1972), the definitions of the operators must also be extended such that the results of the queries are also extended relations that have an extra attribute ‘Contains.’ The value of the extra attribute then expresses the extent to which a tuple belongs to the answer set of the query. In fact, the EPTV expresses the certainty about the compatibility of the tuple with the results expected by the user. This certainty is calculated during query processing, as is presented below for the selection, projection, and join operators (de Tré et al., 2006). Illustrative database. The relational database used to illustrate the proposed flexible querying approach is a simplification of the one introduced in the section titled A Flexible CBR Approach for Information Retrieval. It consists of two relations named ‘Victim’ and ‘Complaint,’ as shown in Figure 3. Each tuple in Victim represents information about a victim of some crime for which an official complaint is registered in the database and is characterized by a unique victim ID (VID), which is the primary key attribute, and an age attribute (Age). Each tuple in Complaint represents information
Figure 3. Example of the relations Complaint and Victim relation Victim
relation Complaint CID
VID
Duration
Contains
VID
Age
Contains
C0 V0
{(T,)}
V0
{(T,)}
C0 V0 C0 V0
{(T,)}
V0
{(T,)}
0
{(T,)}
V0
{(T,)}
C0 V0
⊥Integer
{(T,)}
V0
{(T,)}
C0 V0
{(T,)}
182
about a juridical complaint and is characterized by a unique complaint ID (CID), which is the primary key attribute; the corresponding victim ID of the victim (VID), which is a foreign key that refers to relation Victim; and the total duration of the complaint handling (Duration). The associated domains domT of the considered attributes contain a domain specific ‘undefined’ element ⊥T, that is used to model cases where a regular domain value is not applicable (cf. Prade & Testemale, 1984). In this way, the attribute domain of Duration contains an element ⊥Integer which denotes that a regular value for the duration is not applicable, which could be due to the fact that the complaint has been withdrawn. for example, this is the case for complaint C4. The selection operation. In relational algebra (Codd, 1972), the selection operation, also called the restriction operation, is written in the following general format: a WHERE e
(32)
where a denotes a database relation and e is a truth-valued function, also called the restriction condition, whose parameters are some subset of the attributes of a. The selection operation restricts relation a by discarding all tuples of a that do not satisfy e at all, that is, that have a calculated truth value that differs from false. The resulting relation contains the same attributes as relation a. In the proposed extension, the truth-valued function e is further generalised to a function that evaluates to an EPTV. Examples of such functions are the ‘IS’ function and the generalisations of the comparison operators like ‘=,’ ‘≠,’ ‘.’ As an illustration, only the definition of the ‘IS’ function is described below. Definitions for the comparison operators are given in de Tré and de Caluwe (2004). With the understanding that v is the crisp stored value of attribute A and mL is the membership function of the fuzzy set L that represents the values desired by the user, the EPTV of the proposition ‘A IS L’ is defined by:
Flexible Querying Techniques Based on CBR
{(T ,
T
max(
T,
F,
⊥)
), ( F ,
max(
F
T,
F,
⊥)
), (⊥,
⊥
max(
T,
where •
T
•
F
•
⊥
ed:
=
L
F,
⊥)
)}
(33)
(v )
1 − L (v ) if v ≠⊥ T = if v =⊥T 0 1 − L (⊥T ) if v =⊥T = 0 if v ≠⊥ T
With Equation (33)��������������������������� , the following is reflect-
• • •
The resulting EPTV must be normalized. This is guaranteed by the division by max( T , F , ⊥ ) . If v is inapplicable (� v =⊥T ) and the fuzzy set L refers to the value ⊥T , the truth value T is possible to the calculated extent. The possibility of the truth value ⊥ is 1, if the
Figure 4. Resulting relations of the considered queries Complaint {CID, Duration} Complaint where Duration IS Long CID
VID
CID
Duration
Contains
Duration
Contains
C0
{(T,)}
C0 V0
{(T,)}
C0
{(T,)}
C0 V0 C0 V0
{(T,0.0),(F,)}
C0
0
{(T,)}
0
{(T,),(F,0.)}
C0
⊥Integer
{(T,)}
C0 V0
⊥Integer
{(⊥,)}
C0
{(T,)}
(a)
In practice, the fuzzy set L can be labeled by a linguistic term. The result set of the selection operation is obtained by evaluating e and by calculating the EPTVs that are associated with the resulting tuples. Hereby, it is important and necessary that the EPTVs of the original relation a are appropriately dealt with. ~ , presented in Therefore, the conjunction operator ∧ the preliminaries, is applied to the original EPTV and the EPTV that is obtained from the evaluation of e. Only tuples with a resulting EPTV that differs from {(F, 1)} belong to the resulting relation. As an example, consider the following query: Complaint WHERE Duration IS Long This query selects all ‘Complaint’-tuples with a Duration that is compatible with the fuzzy set that is labeled with the linguistic term ‘Long.’ The membership function of this fuzzy set is given as: Long
: domInteger → [0,1]
0 0 x x − 300 300 1
if if
x =⊥ Integer x < 300
if
300 ≤ x ≤ 600
if
x > 600
The tuples belonging to the result set of the query are given in Figure 4(a). For every tuple in this result set, the corresponding EPTV is calculated as the conjunction of the EPTV associated with the tuple in the original relation and the ETPV that is obtained by applying the ‘IS’ function as defined above in Equation (33), that is:
(b)
(Complaint where Duration IS Long) joIn Victim CID
attribute does not apply ( v =⊥T ) and����������� the fuzzy set label does not refer to the value ⊥T .
EPTVs allow the modeling of the partial satisfaction of a flexible query condition (tuples ‘C02’ and ‘C03’). It also might be the case that the flexible condition is completely satisfied (tuple ‘C01’) or not satisfied at all (tuple ‘C05’); this tuple is not in the result set of the query. If some part of the data is not defined, for example, due to the fact that the complaint has been withdrawn (tuple ‘C04’), this is explicitly reflected in the associated EPTV. The projection operation. The algebraic projection operation is written in the following general format (Codd, 1972): a {X, Y, …, Z}
(34)
where a denotes a database relation and X, Y, …, Z are regular attributes of a. The result of the projection operation is a relation with a heading derived from the heading of a by removing all attributes not mentioned in the set {X , Y ,, Z } and a body consisting of all tuples of a, restricted to the values of attribute {X , Y ,, Z }. Hereby, repeated tuples are deleted. In the proposed extension, the extra attribute Contains is also added to the resulting relation. For each tuple t in the body of the resulting relation, the corresponding EPTV is calculated as the disjunction of all EPTVs that are associated with tuples t ′ in a that
184
In this way, the relational closure property is guaranteed with respect to the projection operator. As an example, consider the query: Complaint {CID, Duration} This query selects all ‘Complaint’-tuples, but restricts their tuple values to the values of attributes CID and Duration. The corresponding EPTVs of the original Complaint relation are copied to the resulting relation. The tuples belonging to the result set of the query are given in Figure 4(b). The join operation. Consider the two relations a and b with respective attribute sets {XY {Contains _ a}} and {YZ{Contains _ b}} where X = {X 1 , X 2 ,, X m }, Y = {Y1 , Y2 ,, Yn } and Z = {Z1 , Z 2 ,, Z p }.
This means that the attributes Y1, Y2, …, Yn are common to the two relations, X1, X2, …, Xm, Contains_a are the other attributes of a and Z1, Z2, …, Zp, Contains_b are the other attributes of b. Contains_a is the extra attribute for the associated EPTVs in a, whereas Contains_b is the extra attribute for the associated EPTV’s in b. The algebraic (natural) join operation is written in the following format (Codd, 1972): a JOIN b
(35)
The resulting relation is a relation with heading {XYZ{Contains}}and body consisting of all tuples that can be obtained by ‘combining’ tuples of a and b, which have the same values for all attributes in
Flexible Querying Techniques Based on CBR
Y in common. Within the extended approach, the associated EPTV in Contains is calculated by aggregating (combining) the corresponding EPTVs in Contains_a and Contains_b using the conjunction ~ that is presented in the preliminaries. operator ∧ As an example, consider the query: (Complaint WHERE Duration IS Long) JOIN Victim
This query joins the relations (Complaint WHERE Duration IS Long), presented in Figure 4(a), and Victim; the resulting EPTV of each tuple in the result is calculated by applying the conjunc~ to the EPTVs of both tuples that tion operator ∧ are combined to obtain the resulting tuple. The tuples belonging to the result set of the query are given in Figure 4(c).
The Predict Operation In order to illustrate the embedment of CBR facilities in a flexible querying system, the approach presented in the previous subsection is extended with an extra operation ‘PREDICT.’ This operation can only be meaningfully applied if the underlying CBR hypothesis “The more two database entities are similar, the more possible the similarity of associated attribute values” holds. In its simplest form, the format of this operation is as follows: a PREDICT X
(36)
where a denotes a database relation and X is an attribute of a. The result of the predict operation is a new relation that contains the same attributes as relation a. The tuples of the result set are obtained from the tuples of a by replacing any null value that occurs for the attribute X by a predicted value (if this value can be calculated). These predicted values are obtained by applying a CBR technique as described in the section titled A Flexible CBR Approach for Information Retrieval. Because a regular relational database is considered and because the data type of the attribute X does not change, only
domain values of the data type of X are allowed as predicted values. Consequently, only one predicted ~ value out of the fuzzy set VXt of predicted values (if not empty) can be completed in the tuple t. In the presented approach, the most possible value is chosen. This is the value x with the maximum ~ associated membership grade in VXt : ~ V Xt
( x) =
max
y∈dom X ∧
~ V Xt
( y)>0
~ V Xt
( y )
(37)
If more than one value has the maximum membership grade, then one of these values is chosen arbitrarily as approximation. Of course, because of the lost information, this is not an ideal situation. When working with a fuzzy database, the fuzzy ~ set VXt could be stored as the value for X, hereby representing the predicted value as adequate as possible, without a loss of information. For each tuple in the result set for which a null value has been replaced by a predicted value x, the associated EPTV is calculated by the conjunction of the EPTV: {(T ,
max(
~ V Xt
~ V Xt
( x)
( x),1 −
~ V Xt
( x))
), ( F ,
1− max(
~ V Xt
~ V Xt
( x)
( x),1 −
~ V Xt
( x))
)}
(38)
and the EPTV that was originally associated with ~ , prethe tuple. Again the conjunction operator ∧ sented in the preliminaries is used for this purpose. For all other tuples, the associated EPTV remains unchanged. As an example, consider the following query as applied on the relation depicted in Figure 5(a): Complaint PREDICT Duration This query predicts all null values that occur in the Duration attribute of the relation Complaint. The result of the query is presented in Figure 5(b). All tuples, except the tuple with CID ‘C03’ remain the same. For tuple ‘C03,’ the original Duration value was null. With the assumption that the fuzzy set of predicted values ‘C03’, returned by the CBRapproach is:
185
Flexible Querying Techniques Based on CBR
Figure 5. An illustration of the PREDICT operation relation Complaint CID
VID
Duration
Complaint PredIct Duration Contains
Duration
Contains
C0 V0
{(T,)}
C0 V0
{(T,)}
{(T,)}
{(T,)}
null
{(T,)}
C0 V0 C0 V0
0
{(T,),(F,0.)}
C0 V0
⊥Integer
{(T,)}
C0 V0
⊥Integer
{(T,)}
C0 V0
{(T,)}
C0 V0
{(T,)}
~ C 03 VDuration = {(550,0.8), (520,0.6), (583,0.4)}
the predicted value of Duration in ‘C03’ becomes 550. With: 0.8 0.2 ), ( F , )} = {(T ,1), ( F ,0.25)} 0.8 0.8
the associated EPTV becomes ~ {(T ,1)} = {(T ,1), ( F ,0.25)} ∧ ~ {(T ,1)} = {(T ,1), ( F ,0.25)} C 03[Duration ]∧
A Real-World Application: The Gender Claim Database Within a juridical context, the availability of and easy access to information regarding similar cases is useful with a view to treating complaints. Such cases can help jurists to detect potential pitfalls in time or to make assessments about future developments. The CBR approach and querying techniques presented in the previous sections can be used to predict future developments with respect to the handling of new complaints entered in a database for gender claim handling. The approach allows one to accommodate future attribute values, like the total duration of the complaint handling and the potential (intermediate) results of the actions undertaken by the jurists. Predictions are obtained by comparing new complaints with similar complaints that are stored in the database. Such a database application, called the ‘gender claim database,’ has been developed for the Belgian
186
VID
C0 V0 C0 V0
(a)
C 03[Duration ]= {(T ,
CID
(b)
Federal Institute of Equality of Women and Men and is meant to register, to preserve, and to process complaints about direct or indirect discrimination on the basis of gender, harassments (if these relate to the sex of the victim), and unwanted sexual behaviour, which are part of the authority of the institute. The database system is intended to support the way in which the complaints are dealt with as well as the way in which they will be reported to the authorities. A team of jurists is responsible for the complaint handling. The complaint handling system would offer jurists an important surplus value if it could support their database searches for similar cases and could help them make assessments about future developments in the handling of a newly entered complaint. For example, from similar cases, the jurist can learn more about the most likely options for that sort of complaint. Such facilities would be useful because these could help jurists to detect potential pitfalls in time and provide a means for a better exploitation of the database. The underlying idea is that the retrieved information is not intended to replace the knowledge of the jurist, but is additional to it. For that reason, interaction with the jurist is supported and encouraged, especially within the revision process.
Conclusion and Future Trends In a CBR approach for information retrieval, four main processes can be identified: case description, case comparison, prediction, and revision. In order
Flexible Querying Techniques Based on CBR
to be applicable, the CBR hypothesis that “similar problems have similar solutions” must hold. In the first part of this chapter, it has been described how fuzzy set theory and its related possibility theory can be applied to efficiently deal with the imperfections that are inherent to these processes. For the sake of argumentation, a conventional relational case database has been considered. More specifically, it has been illustrated how the case description process in the case of a relational case database can be made more flexible by providing similarity ranges and weights for the considered attributes. These similarity ranges define the acceptable values for the attributes, whereas the weights denote the relative importance of the attributes within the similarity determination process. It has also been illustrated how a flexible similarity measure for the comparison of two cases can be set up in the case comparison process. This makes sense because case comparisons will seldom result in an exact similarity matching of cases, and fuzzy set theory allows modeling a gradation of similarity of cases. Furthermore, it has also been illustrated how the inevitable uncertainty that occurs when predictions are made can be handled using possibility theory and how the revision process can help to fine-tune the system. Because of the added flexibility, the resulting approach is called a flexible CBR approach. In the second part of the chapter, it has been explained how a flexible CBR approach can be used to enhance flexible database querying. Two approaches have been distinguished. In the first approach, a flexible querying system is extended with a CBR system for instance-based prediction. In the second approach, CBR is embedded in an existing flexible querying system. For the sake of illustration, such a flexible querying approach for regular relational databases has been presented. The approach uses a logical framework based on EPTVs and has an embedded CBR-based prediction facility that allows predicting unknown data. To illustrate the practical usefulness of the approach, a real-world application for information retrieval in a juridical database for gender-claim handling has been briefly introduced.
Future work will focus on the further enhancement and development of the presented techniques. Among others, the incorporation of text retrieval mechanism, alternative aggregation techniques for the comparison process, more advanced techniques for value prediction, and the further (semi-)automation of the revision process will be studied. Another field of ongoing research is the generalization of the approach towards ‘fuzzy’ databases, that is, databases that can contain imperfect (imprecise, vague, incomplete, or uncertain) data.
References Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications of the ACM, 7(1), 39-59. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Essex, UK: ACM Press/Addison-Wesley. Bosc, P., Kraft, D., & Petry, F. E. (2005). Fuzzy sets in database and information systems: Status and opportunities. Fuzzy Sets and Systems, 153(3), 418-426. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. International Journal on Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6). Codd, E. F. (1972). Relational completeness of data base sublanguages. In R. J. Rustin (Ed.), Data base systems. Englewood Cliffs, NJ: Prentice Hall. Codd, E. F. (1979). RM/T: Extending the relational model to capture more meaning. ACM Transactions on Database Systems, 4(4).
187
Flexible Querying Techniques Based on CBR
de Calmès, M., Dubois, D., Hüllermeier, E., Prade, H., & Sedes, F. (2003). ��������������������������� Flexibility and fuzzy casebased evaluation in querying: An illustration in an experimental setting. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11(1), 43-66. de Cooman, G. (1995). Towards a possibilistic logic. In D. Ruan (Ed.), Fuzzy set theory and advanced mathematical applications (pp. 89-133). Boston: Kluwer Academic Publishers. de Cooman, G. (1999). From possibilistic information to Kleene’s strong multi-valued logics. In D. Dubois, E. P. Klement, & H. Prade (Eds.), Fuzzy sets, logics and reasoning about knowledge (pp. 315-323). Boston: Kluwer Academic Publishers. de Tré, G. (2002). Extended possibilistic truth values. International Journal of Intelligent Systems, 17, 427-446. de Tré, G., & de Caluwe, R. (2003). ��������� Modeling uncertainty in multimedia database systems: An extended possibilistic approach. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 11(1), 5-22. de Tré, G., & de Caluwe, R. (2004). Towards more flexible database systems: A logical framework based on extended possibilistic truth values. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications DEXA 2004 (pp. 900-904), Zaragoza, Spain. de Tré, G., de Caluwe, R., Tourné, K., & Matthé, T. (2003). ���������������������������������������� Theoretical considerations ensuing from experiments with flexible querying. In Proceedings of the 10th International Fuzzy Systems Association (IFSA) World Congress (pp. 388-391), Istanbul, Turkey. de Tré, G., Verstraete, J., Hallez, A., Matthé, T., & de Caluwe, R. (2006). The ����������������������� handling of selectproject-join operations in a relational framework supported by possibilistic logic. In Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU) (pp. 2181-2188), Paris, France. 188
Dubois, D., Esteva, F., Garcia, P., Godo, L., Lopez de Mantaras, R., & Prade, H. (1998). Fuzzy set modeling in case-based reasoning. International Journal on Intelligent Systems, 13, 345-373. Dubois, D., Fargier, H., & Prade, H. (1997). Beyond min aggregation in multicriteria decision: (Ordered) weighted min, discri-min and leximin. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 181-192). Boston: Kluwer Academic Publishers. Dubois, D., Hüllermeier, E., & Prade, H. (2000). Flexible control of case-based prediction in the framework of possibility theory. Lecture Notes in Artificial Intelligence, 1898, 61-73. Berlin/Heidelberg: Springer-Verlag. Dubois, D., & Prade, H. (1988). Possibility theory. New York: Plenum Press. Dubois, D., & Prade, H. (Eds.). (2000). Fundamentals of fuzzy sets. Dordrecht, The Netherlands: Kluwer Academic Publishers Group. Ellman, J. (1995). An application of case based reasoning to object-oriented database retrieval. In Proceedings of the 1st UK Case Based Reasoning Workshop, Salford, UK. Faltings, B. (1997). Probabilistic indexing for case-based prediction. Lecture Notes in Artificial Intelligence, 1266, 611-622). Berlin/Heidelberg: Springer-Verlag. Galindo, J., Medina, J., Pons, O., & Cubero, J. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible querying and answering systems (pp. 164174). Dodrecht: Kluwer Academic Publishers. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Miyamoto, S. (2000). Fuzzy multisets and their generalizations. Lecture Notes in Computer Science, 2235, 225����������������������������������� -���������������������������������� 236). Berlin/Heidelberg: ���������������������������� SpringerVerlag.
Flexible Querying Techniques Based on CBR
Pedrycz, W., & Gomide, F. (1998). An introduction to fuzzy sets: Analysis and design. The MIT Press. Plaza������������������������������������������������ , E., Esteva, F., Garcia, P., Godo, L., & López de Màntaras, R. (1996). ��������������������������� A logical approach to casebased reasoning using fuzzy similarity relations. Information Sciences, 106, 105����� -���� 122. Prade, H. (1982). Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In Proceedings of the 12th International Symposium on MultipleValued Logic (pp. 223-227). Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143.
Management of Data SIGMOD Conference (pp. 162-169). Yager, R. R. (1997). Case-based reasoning, fuzzy systems modeling and solution composition. In Proceedings of the ������������������������ Case-Based Reasoning Research and Development Second International Conference ICCBR-97 (pp. 633-643), Rhode Island, USA. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353. Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357; 9, 43-80.
Rescher, N. (1969). Many-valued logic. New York: Mc.Graw-Hill.
Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3-28.
Richter, M. M. (1995). On the notion of similarity in case-based reasoning. In G. della Riccia, R. Kruse, & R. Viertl (Eds.), Mathematical and statistical methods in artificial intelligence (pp. 171-184). Heidelberg: Springer-Verlag.
Zadrozny, S., & Kacprzyk, J. (1996). FQUERY for Access: Towards human consistent querying user interface. In Proceedings of the 1996 ACM Symposium on Applied Computing (SAC) (pp. 532-536), Philadelphia, PA.
Richter, M. M. (2006). Modeling uncertainty and similarity-based reasoning: Challenges. In Workshop Proceedings of the 8th European Conference on Case-Based Reasoning ECCBR 2006 (pp. 191199), Ölüdeniz/Fethiye, Turkey. Shimazu, H., Kitano, H., & Shibata, A. (1993). Retrieving cases from relational data-bases: Another stride towards corporate-wide case-base systems. In Proceedings of the 13th International Joint Conferences on Artificial Intelligence IJCAI (pp. 909-915), Chambéry, France. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Vassiliou, Y. (1979). Null values in data base management: A denotational semantics approach. In Proceedings of the Special Interest Group on
Key Terms Case Based Reasoning: Case based reasoning (CBR) is a methodology where new problems are solved by investigating, adapting, and reusing solutions to a previously solved, similar problem. Hereby knowledge is deduced from the characteristics of a collection of past cases, rather than induced from a set of knowledge rules that are stored in a knowledge base. Case Comparison: This CBR process is responsible for the retrieval of cases in the database that are adequately similar to the case for which some values need to be predicted. Hereby, the fuzzy preferences and/or fuzzy conditions provided in the case description process are applied. Case Description: In this CBR process, it is identified how the cases are structured and can
189
Flexible Querying Techniques Based on CBR
be extracted from the database. Furthermore, the parameters required for case comparison purposes are set. For example, in case of a fuzzy case comparison, the fuzzy preferences and/or fuzzy conditions are specified�. Database: A collection of persistent data. In a database, data are modeled in accordance with a database model. This model defines the structure of the data, the constraints for integrity and security, and the behavior of the data. Flexible Querying: Searching for data in a database is called querying. Modern database systems offer/provide a query language to support querying. Relational databases are usually queried using SQL (Structured Query Language). Regular database querying can be made more user friendly by applying techniques for self-correction of syntax and semantic errors, database navigation or “indirect” answers like summaries, conditional answers, and contextual background information for (empty) results. This is called flexible querying. A special subcategory of flexible querying techniques is based on the introduction of fuzzy preferences and/or fuzzy conditions in queries. This is sometimes called fuzzy querying. Flexible Querying Techniques Based on CBR: CBR techniques can be used for flexible database querying purposes. More specifically, CBR techniques can be used for instance-based
190
prediction with which unknown data values can be approximated. Hereby, four main processes can be identified: case description, case comparison, prediction, and revision. Prediction: Based on the data in similar cases, a prediction model is built for each of the unknown data values that must be predicted. Each prediction model represents the predicted approximation of the unknown value. These models are forwarded to the user and to the revision process. Relational Database: A relational database is a database that is modeled in accordance with the relational database model. In the relational database model, the data are structured in relations that are represented by tables. The behavior of the data is defined in terms of the relational algebra, which originally consists of eight operators (union, intersection, division, cross product, join, selection, projection, and division), or in terms of the relational calculus, which is of a declarative nature. Revision: This process is activated when no similar cases are found in the case comparison or when the actual values for the attributes involved in the prediction process become available. The latter typically occurs when the case has been further processed by the users and the new data have been entered in the database. All extra information is processed. Eventually, a request to modify the parameter settings is generated and sent to the case description process.
191
Chapter VIII
Customizable Flexible Querying in Classical Relational Databases Gloria Bordogna CNR IDPA, Italy Guiseppe Psaila University of Bergamo, Italy
Abstract In this chapter, we present the Soft-SQL project whose goal is to define a rich extension of SQL aimed at effectively exploiting flexibility offered by fuzzy sets theory to solve practical issues when querying classic relational databases. The Soft-SQL language is based on previous approaches that introduced soft conditions on tuples in the classical relational database model. We retain the main features of these approaches and focus on the need to provide tools allowing users to directly specify the context dependent semantics of soft conditions. To this end, Soft-SQL provides a command (named CREATE TERM-SET) to define the semantics of linguistic values with respect to a context represented by a linguistic variable (Zadeh, 1975); the SELECT command is extended in order to support soft predicates based on the user defined term sets, the semantics of grouping and aggregation can be modified, and finally, the clauses in the SELECT command can be combined effectively.
Introduction The need to flexibly query relational databases has been widely recognized as a means to improve the effectiveness of the retrieval in current systems using SQL for expressing information needs. The main inadequacy of the SQL language is caused by the crisp algebra on which it is founded that does not support the ranking of the results with respect to their relevance to user needs. In this book, the chapter by Kacprzyk et al. provides an extensive survey on flexible querying approaches.
For many categories of users, the possibility to express tolerant conditions and to retrieve discriminated answers in decreasing order of relevance can greatly simplify users’ tasks that generally are performed through a sequence of trial and error phases. The problem of false drops when querying databases by specifying crisp selection conditions is well known. Several approaches have been proposed either based on preference specifications or on soft conditions tolerating degrees of undersatisfaction to overcome this drawback of SQL language use (Bosc & Pivert, 1992; Dubois
& Prade, 1997; Eduardo, Goncalves, & Tineo, 2004; Kießling, 2002, 2003; Petry, 1996; Rosado, Ribeiro, Zadrozny, & Kacprzyk, 2006; Tineo, 2000). The foundations of Kießling for preferences in databases are the basis for an intuitive valuation of search results in which it is assumed that people naturally express their requests in terms like “I like A better than B.” All these preferences can be formulated as strict partial orders. Based on this formulation, the Preference SQL language (Kießling, 2002, 2003) has been defined as an extension of SQL. Several built-in base preference types, combined with the adherence to declarative SQL programming style, guarantees great programming productivity. Further, the Preference SQL optimizer does an efficient rewriting into standard SQL. Another approach for the specification of preferences in queries is based on soft constraints, that is, tolerant selection conditions formalized within fuzzy set theory (Zadeh, 1965).�������������������� Several extensions of SQL to allow the specification of soft selection conditions in queries have been proposed. A rich taxonomy that helps in understanding the various proposals of extension of SQL by fuzzy set theory is outlined in Rosado et al. (2006). In Dubois and Prade (1997), two reasons for using fuzzy set theory (Zadeh, 1965) to make querying more flexible are discussed. First, fuzzy sets provide a better representation of the user’s preferences. One reason is that users feel much more comfortable using linguistic terms instead of precisely specified numerical constraints when expressing in a query some condition such as when asking for some hotel “not too expensive and not too far from the beach.” Furthermore, the semantics of these linguistic terms can be exactly “d” (i.e., after Zadeh, 1999, defined as a function on the basic domain of a variable) by fuzzy sets (Zadeh, 1965) so that we can have a price definitely matching or definitely not matching the user’s request, but also a price that matches to a certain degree. The second reason is that a direct consequence of having a matching degree is that answers can be ranked according to users’ requirements. Furthermore,
192
the possibility to “precisiate” the semantics of the linguistic terms makes it possible to implement mechanisms that offer users a full control on the semantics of their flexible queries (Bordogna & Psaila, 2005). According to many authors (Bosc & Pivert, 1992, 1995; Kacprzyk & Zadrozny, 1995; Medina, Pons, & Vila, 1994; Petry, 1996), there are two main lines of research in the use of fuzzy set theory in the database management system (DBMS) context. The first one assumes a conventional database and, essentially, develops a flexible querying interface using fuzzy sets, possibility theory, fuzzy logic, and so forth (Bosc & Pivert, 1992, 1995; Bosc, Buckles, Petry, & Pivert, ��������������� 1999; Dubois & Prade, 1997; Galindo, Medina, Cubero, & García, 2000; Goncalves & Tineo, 2003, 2005; Kacprzyk, Zadrozny, & Ziolkowski, 1989; Ribeiro & Moreira, 1999; Tahani, 1977; Takahashi, 1991, 1995; Tineo, 2000). In the chapter of this book by Urrutia et al., a review of two extensions of SQL, namely FSQL (Galindo, Urrutia, & Piattini, 2006) and SQLf (Bosc & Pivert, 1995) is presented. The second line of research uses fuzzy or possibilistic elements for developing a fuzzy database model to manage imprecise and vague data (Bosc & Prade, 1994; Umano & Fukami, 1994). Also in this case querying constitutes an important element of the model (Baldwin, Coyne, & Martin, 1993; Bosc & Pivert, 1997a, 1997b; Buckles & Petry, 1985; Buckles, Petry, & Sachar, 1986; Galindo, Medina, & Aranda, 1999; Galindo, Medina, Pons, & Cubero, 1998; Galindo et al., 2006; Prade & Testemale, 1984, 1987; Shenoi, Melton, & Fan, 1990). For a description of a fuzzy extension of SQL that works on crisp and fuzzy relations, see Galindo et al. (2006). However, these proposals missed addressing some practical and not negligible aspects related to the effective usage of the flexible query language. Mainly, they do not exploit one of the main features offered by fuzzy set modeling, that is, the possibility to “precisiate” the semantics of linguistic terms (Zadeh, 1999) used in the flexible queries, thus not making users capable of having
Customizable Flexible Querying
full control of the semantics of the soft conditions they use. In Bordogna Psaila (2004, 2005) and Galindo et al. (2006), attempts have been made in this direction. However, these proposals do not focus on the need for specific tools to define context dependent linguistic predicate semantics, so as to adapt the query language to the application context. Let us think about the many interpretations of the term close when used in the context of a spatial database: it can vary depending on the scale of the map (close on a map with a scale 1:1000 vs. close on a scale 1:10.000), on the entities to which it is applied (close between countries vs. close between cities), or on the database itself (close in a cadastral database vs. close in an astronomical database). This consideration is valid for many terms such as cheap, that has a different meaning when buying a ticket for a theater performance or a ticket for a cinema. It may also depend on the intention of the query: cheap for a house to buy is different with respect to cheap for a house to rent. The context of usage of a linguistic term heavily determines its meaning. The proposals defined so far usually assume that linguistic predicates are somehow defined ”a priori” outside the query language (usually an extension of the classical SQL SELECT command). Even when user defined fuzzy predicates can be specified, like in SQLf, there are not specific commands in the query language itself to customize the meaning of the terms. Further, the meaning of the fuzzy predicates is fixed once and for all and cannot be modified depending on the context of its usage. This lead to SQL extensions that are hardly useful from a practical point of view since they do not provide direct means to the users for explicitly changing or customizing the semantics of linguistic predicates according to the user needs and indications. The proposal in this chapter is about flexible querying in conventional relational databases. We addressed practical issues to define an effective flexible query language, specifically: •
The problem to extend SQL so as to allow users to define the semantics of their linguistic predicates;
•
•
To allow a contextualization of the meaning of the defined linguistic terms so as to automatically modify their semantics depending on the context; Finally, to allow the specification of flexible queries by extending the SQL language, drawing on the experience of previous proposals and in particular of SQLf (Bosc & Pivert, 1995).
These objectives are achieved by defining Soft-SQL, an extension of the SQL language for customized flexible querying in classic relational databases. Its main and distinguishing features are the following: •
•
•
Queries operate on standard relations and produce standard relations as a result. The attribute “membership degree” of a tuple that can be used to rank the items reflecting their degree of satisfaction of the query conditions is dealt with as any other attribute; this allows closure to be achieved; Specific commands are provided to define sets of linguistic values (the command named CREATE TERM-SET), sets of linguistic quantifiers for groups of tuples, and complex selection conditions (the command named CREATE LINGUISTIC QUANTIFIER) and their semantics; Furthermore, mechanisms for allowing the easy and transparent contextualization of the linguistic values are defined. The SELECT command is extended in order to support context dependent soft predicates based on the user defined term sets to modify the semantics of grouping and aggregation based on basic and user defined quantifiers and, finally, to effectively combine some or all the clauses in the SELECT command to achieve a high degree of flexibility and effectiveness.
Examples of usage of these commands will be provided.
193
Customizable Flexible Querying
Background of the Proposed Soft SQL The starting point of our Soft-SQL is the approach defined for the extension of SQL in the conventional relational data model within fuzzy set theory, named SQLf, based on soft conditions on attribute values (Bosc & Pivert, 1995). In SQLf, a soft condition is expressed by means of a linguistic predicate represented by a fuzzy set. SQLf has been defined by extending the relational algebra so as to operate on fuzzy relations. ���� The introduction of soft conditions in SQL and the relational database model is achieved by generalizing a relation r, defined as a subset of D=D1 × D2 × , ..., × Dn, to become a fuzzy relation rf, defined as a fuzzy subset of D; that is, each tuple d of rf is associated with a membership degree µrf(d) in [0,1]. µrf(d) is interpreted as the degree of satisfaction of the linguistic predicates in the query. In order to satisfy������������������������������������������ the closure property, a regular relation can be seen as a kind of fuzzy relation with all the tuples having the same membership degree equal to 1. For their formal definition, refer to ��������� Bosc and Pivert (1995). This is the first difference with respect to our proposal of Soft SQL that works on regular relations. A basic block SQLf query allows first to specify a regulation mechanism of the query result in order to control the number of the desired items (tuples). This can be done by either specifying the maximum desired number N of tuples or a minimum threshold T that each tuple’s membership degree must exceed to be included in the result or both of these values. Further, the basic SQLf (Bosc & Pivert, 1995) query allows specifying soft conditions in the WHERE clause as follows: select [N | T | N, T] (attributes) from (relations) where ( fuzzy condition);
in which ( fuzzy condition) can imply fuzzy and Boolean basic conditions at the same time, linked by some connectors (AND, OR, or even a linguistic
194
quantifier such as most). The use of fuzzy quantifiers to define compound selection conditions has been proposed also in other fuzzy extensions of SQL language (Galindo et al., 2000; Kacprzyk & Zadrozny, 1997; Kacprzyk & Ziolkowski, 1986; Tineo, 2000). Also, fuzzy joins are possible to allow multiblock queries. In most applications, the membership functions of the linguistic predicates such as big and cheap are defined by trapezoidal functions. These definitions are coded in the application. Soft compound conditions can be expressed in the form of logical expressions of elementary conditions, for example, “big AND cheap” or “cheap AND close to the center,” are represented by fuzzy set operations, or by elementary conditions aggregated by a linguistic quantifier such as “most of (cheap, close to the center).” To illustrate an example, consider table FLAT, that describes flats in cities and their properties. FLAT(id: Integer, NumberOfRooms: Integer, City: String, Inhabitants: Integer, DistanceFromCenter: float, Price: float)
An example of SQLf query on the relation FLAT is: SELECT 5, 0.6 C.Id FROM FLAT as C WHERE most of (C.Price IS cheap, NumberOfRooms IS big, C.DistanceFromCenter IS close to the center);
in which the soft conditions are expressed by “IS linguistic term.” It imposes the evaluation of the degree of satisfaction of the linguistic predicates, for example, IS cheap, by the values of the attribute C.Price of relation FLAT. The intermediate fuzzy relation resulting from the evaluation of the soft condition is projected on the attribute C.Id and the best five tuples over the threshold 0.6 are returned to the user as a result. The linguistic quantifier most of is evaluated once all the
Customizable Flexible Querying
single soft conditions in parenthesis have been evaluated; it aggregates their degrees of satisfaction to produce an overall degree that is returned as the ranking degree of the tuple. Notice that in this fuzzy extension of SQL, linguistic values such as cheap are specified independently of the context to which they are applied. This can create problems since semantics of some linguistic values might depend on the context; for example, the notion of cheap flat in Milan is not the same in Paris or New York. Further, users cannot explicitly define the semantics of the linguistic values since no SQLf command is defined specifically for this purpose. This is a severe limitation of SQLf that makes it inadequate from a practical point of view. It can create problems since semantics of linguistic predicates might depend on the context, as it will be discussed in the next section, and users may want to control and modify their semantics when querying a database. In SQLf, it is also possible to express queries which work on sets of tuples using the GROUP BY and HAVING clauses. This type of query allows for expressing queries involving aggregate functions (MIN, MAX, SUM, etc.). In SQL, each partition groups tuples that have the same value on the grouping attribute(s). This functionality is retained in SQLf, where the HAVING clause can be used along with a fuzzy set condition aimed at the selection of partitions. In this respect, various possible conditions can be formulated from simple conditions involving aggregate functions to more complex ones involving fuzzy quantifiers (Tineo, 2000). So, for instance, an SQLf query such as the following one can be specified: it looks for the cities with few inhabitants, such that most of the flats in the city are cheap. SELECT C.City FROM FLAT as C WHERE C.Inhabitants IS few, GROUP BY C.City HAVING most of C.Price IS cheap;
Also in this case, the semantics of the fuzzy quantifier most of is defined “a priori” in SQLf. This query selects the small cities having most of the flats with cheap prices. One could also ask for flats in the same city having the average price around 100000 euros, for example, by replacing the last two rows in the previous query with the following ones: GROUP BY C.City HAVING avg(C.Price) ≈ 100000
The basic block SQLf query can also be nested so as to generate queries with an arbitrary level of nesting.
Main Focus of the Chapter Distinguishing Characteristics of Soft-SQL In this section, we present our ideas for the definition of a flexible and customizable query language for a conventional relational database management system. We incorporate flexibility in SQL by assuming the extensions of Bosc and Pivert (1995) at the basis of the definition of SQLf and by taking into account issues related with practical aspects of its usage, specifically, the need to provide users with full control on the semantics of their queries, depending on the application and the query context. This last characteristic is mandatory when considering a database in which the semantics of linguistic terms such as cheap, high, far, big, and small are referred to different attributes of distinct relations. To clarify, let us consider the terms cheap and close, and the different situations in which their meanings can change. •
Different databases. The meaning of cheap can change if we change the database because the application context for which the database has been defined is different. For example,
195
Customizable Flexible Querying
•
•
196
cheap buildings has a different semantics when referring to buildings in a cadastral database and when referring to buildings in an estate agency database. Within the same database. Within the same database, it is possible to have different meanings for the same linguistic terms, depending on the semantics of tuple to select. For example, cheap for flats has a different meaning than cheap for villas, although both data sets are stored in the same database but in distinct relations. Current selected tuples. The semantics of a linguistic term can be influenced by the current selected tuples. If one has selected flats in a small city like Bergamo (northern Italy) and specifies a further soft selection condition close to the city center or cheap, the interpretation of the linguistic constraints is likely to be different than in the case in which one is formulating the same selection on flats located in a big and expensive city like Milan. A flat that is 4 km from the center of Milan can be considered close to the center, while 4 km from the center of Bergamo can be considered not completely close. A cheap flat in Milan is likely to be considered very expensive in Bergamo. Further, one can have in mind to drive from the flat to the center or another one can consider to walk, and also these subjective settings of human mind influence the interpretation of closeness. Consequently, close and cheap can be interpreted as a relative soft condition, whose interpretation varies depending on the scope for which the query is formulated. We represent this concept of “query scope” by a parameter that we name hereafter the zooming factor. This notion of zooming factor is intuitive in geographic information systems (GIS) where it indicates the scaling factor of a map visualized on the screen. The higher the zooming factor, the stricter the interpretation of the closeness. We can generalize this concept to any linguistic term by taking care
of the fact that the zooming factor parameter affects the semantics of the linguistic value so as to make it stricter as the zooming factor increases. So, if we want to modify the interpretation of cheap to reflect the fact that in a small city like Bergamo it is stricter than in a big city like Milan, we can associate Bergamo with a higher zooming factor than Milan to achieve our objective. The idea is to derive the zooming factor automatically from the actual values of another attribute of the tuples. In the case of the example, we could compute the zooming factor by applying a function (for example, the average) evaluating the values of the attribute inhabitants of the city, so that a small city would have a lower average population than a big city. While the first two situations are modeled in FSQL (Galindo et al., 2006), the third one is not considered. In Soft-SQL, preferences on selection conditions are represented by soft conditions as in SQLf but in order to support customizable context dependent soft conditions, we designed the language having in mind the following guidelines. • •
•
The semantics of the soft selection conditions must be formalized depending on the context. Soft selection conditions should be easily customizable; the language must provide some way to define the semantics of linguistic predicates. The closure property of the SQL SELECT command must be strictly preserved; in other words, a SELECT statement takes relations as input and generates a relation as output. No special meaning is attributed to the membership value of a tuple; it is dealt with just as any other attribute.
Soft-SQL, allows the user to specify the context dependent semantics of linguistic predicates by means of two commands: CREATE TERM-SET,
Customizable Flexible Querying
for defining the semantics of linguistic values used to specify simple soft conditions, and CREATE LINGUISTIC QUANTIFIER, for defining the semantics of linguistic quantifiers used to specify compound soft conditions and also fuzzy aggregation functions. These user-defined linguistic predicates can be specified at distinct levels in Soft-SQL queries on classic relational databases: in the extended basic SQL SELECT command, in the Soft SQL GROUP BY clause, in the extended SQL HAVING clause, and in the extended aggregate functions (such as the Soft-SQL COUNT). This way, the user can have full control of the flexible query language, being able to fully customize the query; in addition, the user can use a linguistic value with distinct meanings in the same application, depending on the chosen reference attribute and query scope. In Soft-SQL, like in SQLf, soft conditions are expressed through linguistic predicates identifying fuzzy subsets of the attribute domains and are specified in the WHERE clause of the extended SQL query. Differently than in SQLf, we do not produce fuzzy relations as results of queries, but ordinary relations. This way, the membership degree of a tuple is dealt with as any other kind of attribute of a tuple. Besides, in the soft condition also the context of the linguistic predicates is specified, so that it is possible to choose the proper interpretation of the linguistic value. In the following, we first introduce the definition of the command to define linguistic values and customize their semantics. Then, we introduce the command to
define linguistic quantifiers. Finally, we define the SELECT query command.
Customized Linguistic Predicates Suppose the user wishes to query the database based on a linguistic concept, for example, “the price is cheap.” The main problem that arises is: how is it possible to define the semantics of the linguistic concept “cheap” for prices? In this case, the key of flexibility is the possibility of defining linguistic concepts appropriate for the specific application context. Then, consider the case of storing data about flats to sell in the database. Suppose the user wishes to define a set of linguistic terms for price levels, such as “expensive” and “cheap.” The semantics of a linguistic term might be defined by a trapezoidal function, normalized within the range [0, 1] (see Figure 1a). When defining the linguistic terms, we can consider that (in Europe and North America) price levels might be considered between 0 and 1 million euro. Soft-SQL provides commands for defining term sets and linguistic predicates. The term set named PriceLevels, following the previous considerations, can be defined as follows: CREATE TERM-SET PriceLevels NORMALIZED WITHIN (0, 1000000) EVALUATE Price WITH PARAMS Price AS FLOAT, VALUES (‘expensive’,(0.6, 0.7, 0.85,1),*) (‘cheap’,(0,0,0.2,0.4),*);
Figure 1. Trapezoidal membership function of the linguistic term cheap: (a) Cheap is defined on the unit interval, while (b) Cheap is rescaled on the absolute domain 0, 1million €. 1
1 mcheap
0
0.1
0.2
(a)
0.4
mcheap
1 0
100.M 200M
400M
1 Million €
(b) 197
Customizable Flexible Querying
The NORMALIZED WITHIN clause defines the normalization range; in this term set, the evaluation range is normalized between 0 and 1000000 euro; values less than 0 are treated as 0, while values greater than 1000000 are treated as 1000000. The EVALUATE clause specifies the type of the parameter that, after normalization, is subjected to the soft condition specified by the defined linguistic terms. In this case, the type of the argument of the soft condition is a floating point value named Price. Finally, the two linguistic terms “expensive” and “cheap” are defined: the name is followed by four values (in the range [0, 1]) that are the x coordinates of, respectively, the bottom left side corner, the top left side corner, the top right side corner, and the bottom right side corner of the trapezoidal function (see the trapezoidal function associated with cheap in Figure 1 and the section titled Definition of the CREATE TERM-SET Command for more details). A linguistic term is exploited to query tables by specifying a soft predicate in conditions through the SELECT command (in the WHERE and HAVING clauses). For instance, suppose the user wishes to query table Flat in order to find cheap flats. The WHERE clause might be the following: WHERE Price IS ’cheap’ IN PriceLevels
The new IS .. IN operator allows specifying soft predicates. The condition reported above says that values of attribute Price are checked against the trapezoidal function defined for the linguistic value ’cheap’ in the term set PriceLevels. Suppose its price is 700000 euros: based on the definition of the linguistic term (the function is depicted in Figure 1) and on the normalization range, the degree of satisfaction is 0, so the flat is not cheap. If the price is 100000 euros, the satisfaction degree is 1, so the flat is truly cheap. If the price is 300000 euros, the satisfaction degree is 0.5, so the flat is partially cheap. However, the same linguistic terms might be exploited for a more sophisticated evaluation mechanism. For instance, suppose the user looking for flats wants to compare
198
the difference between the price and a base level in order to know if the difference of price is cheap or expensive. In practice a function: f2(price, base) = price - base can be defined, and the value provided by this function is normalized and then checked against the trapezoidal function associated with the linguistic values. Then, actually the EVALUATE clause defines one or more evaluation functions, as in the following enriched definition of the term set PriceLevels. CREATE TERM-SET PriceLevels NORMALIZED WITHIN (0, 1000000) EVALUATE Price WITH PARAMS Price AS FLOAT EVALUATE (price - base) WITH PARAMS price AS float, base AS float, VALUES (‘expensive’,(0.6, 0.8, 1,1),*) (‘cheap’,(0,0,0.2,0.4),*);
Two evaluation functions are defined: the first one is simple; the value of the parameter Price is evaluated as it is against the trapezoidal function. The second evaluation function is based on two parameters (Price and base) and the difference between them is normalized and evaluated. When the term set is exploited in queries (within the SELECT command by means of the IS ... IN operator) the system checks for the number and the type of parameters, determining which evaluation function to apply. The “*”in the definition of the linguistic value semantics is associated with a modifier function (in the specific case of “*” with a product, but two other modifier functions are possible : “-” and “+” that define a left or right translation of the trapezoidal functions on their domain) and it is used to specify, in the basic block Soft-SQL query, that the semantics of the linguistic value must be made dependent on the context, that is, the zooming factor (a detailed description of modifiers is in the section titled Definition of the CREATE TERM-SET Command).
Customizable Flexible Querying
For example, the following WHERE clause wants to exploit the second evaluation function in order to obtain flats for which it is necessary to add a cheap amount of money w.r.t. the base price of 300000 euros. WHERE (Price, 300000) IS ’cheap’ IN PriceLevels
The system matches the pair (Price, 300000) with the evaluation functions defined in the term set and applies the one (if defined) compatible with the pair.
Soft-SQL Basic Queries and Term Sets Consider now a simple, but complete, query written by means of the extended SELECT command. We want to select cheap flats in Rome. SELECT Id, Price FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
The WHERE clause now specifies a compound soft condition in which a crisp predicate taking values in {0,1} is conjunct with the soft predicate, taking values in [0,1]. Based on the fuzzy AND evaluated as the minimum of the two membership degrees (the maximum, in case of disjunction), we compute the membership value of tuples; then, only tuples having a membership degree greater than 0 are selected; finally, these tuples are projected on attributes Id and Price. Thus, the result of the query is the set of flats that are in Rome and are cheap (fully or partially). This way, the query is flexible: the user obtains not only flats with price less than or equal to 200000 euros, but also flats with price such as 250000 euros or 350000: they are not exactly cheap, but their price is still close to be cheap, as far as the concept of a cheap flat in a City, and perhaps the user might find them interesting. However, the membership degree of selected tuples, that might
add useful information, is not produced by the previous query and tuples are not ordered with respect to their membership degree as it occurs in SQLf. In Soft-SQL, we followed the approach that queries generate classical relational tables. So, if the user wants to obtain the membership degree of tuples, the user should obtain this value as a classical attribute, by using a special keyword, as shown in the following query. SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
The keyword DEGREE refers to the membership degree of tuples. Consequently, the generated table contains three attributes, that is, the flat identifier, the flat price, and the degree of satisfaction (attribute D) of the selection condition. For example, a flat with price 300000 euros has 0.5 as membership degree, meaning that the price is exactly cheap, but still close to be considered cheap. The membership degree is an important measure, and can be exploited to better select tuples: in effect, it can be used to rank tuples denoting how much they satisfy selection conditions. To this end, the ORDER BY clause can be exploited, as the following query does, and taking, for instance, the five best results. SELECT TOP 5 Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ORDER BY DEGREE DESC;
Observe that w.r.t. the syntax of SQLf, the query is based on a specific keyword, the TOP keyword. When TOP n is specified, the query takes the first n sorted tuples in the result table. The TOP keyword is general and not specifically designed to deal with membership degrees. Therefore, no special parameters must be added to the query, since ORDER BY and TOP operates on crisp relations as well. 199
Customizable Flexible Querying
Let us now formulate a query in which we want to modify the semantics of cheap depending on the fact that given the context of “Rome,” that is, a big city, we want to dilute the definition of cheap with respect to its standard definition so as to be able to also consider cheap prices for flats that are commonly not considered cheap. This can be done by specifying the optional parameter ZOOMING as follows: SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ZOOMING 0.5;
Having specified a factor of ZOOMING=0.5, and the modifier “*” in the definition of the linguistic value semantics, we indicate that we have to multiply by 0.5 the actual price before to evaluate its satisfaction of the soft condition cheap. So if in the common case a price of 300000 euros is c h e a p to a degree of 0.5 with the ZOOMING=0.5, we can say that a price of 600000 euros is still cheap to the degree 0.5. Conversely, by specifying ZOOMING=2, we want to restrict the concept of cheap so that a price of 150000 is cheap only to the degree 0.5. Finally, it is possible to select only tuples having membership degree greater than or equal to a specified threshold by means of the DEGREE THRESHOLD subclause, as shown in the following query. SELECT Id, Price, FROM FLAT WHERE City=’Rome’ Price IS ‘Cheap’ DEGREE THRESHOLD
DEGREE AS D AND IN PriceLevels ZOOMING 0.5 0.8;
After the selection condition is evaluated (and a membership degree is associated with the tuples), only tuples having membership greater than or equal to 0.8 are selected. For reader information, FSQL (Galindo et al., 2006) allows specifying threshold degrees for single conditions or groups
200
of conditions. We prefer to apply the threshold degree to a tuple when the whole condition has been evaluated, in order to obtain a more natural and intuitive extension of the classical SQL.
Linguistic Quantifiers Suppose now the user wants to select tuples by evaluating their membership degree in a more flexible way; for instance, the user might want to select flats for which most of the following three conditions C1, C2, and C3 are satisfied: C1: The flat is cheap, C2: The flat is big (in terms of number of rooms) C3: The flat is close to the center The concept of quantified condition most of a set of conditions are satisfied is also possible in SQLf and is based on the concept of linguistic quantifier. Thus, to improve flexibility, the user should be provided with a command to define linguistic quantifiers. We then introduced the CREATE LINGUISTIC QUANTIFIER command that allows defining relative linguistic quantifiers as defined by Zadeh (1983). CREATE LINGUISTIC QUANTIFIER most VALUES (0.45, 0.65, 1, 1); CREATE LINGUISTIC QUANTIFIER almost _ all VALUES (0.9, 1, 1, 1);
The two above instructions define two quantifiers, named most and almost all respectively. The tuples following the VALUES keyword define, as for linguistic values in term sets, a trapezoidal membership function µquantifier normalized within the range [0, 1] that represents the semantics of the quantifier (for instance, function µalmost _ all is the membership function for the quantifier almost _ all). We rely on the OWA semantics introduced by Yager (1988) for the evaluation of quantified soft conditions in the WHERE clause. This choice is motivated by the fact that in this context the quantifier is used to aggregate condi-
Customizable Flexible Querying
tions and thus this definition is more adequate than the Zadeh’s definition. To derive the weighting vector W =[w1 ... wn] of the OWA operator, given the membership function of the relative nondecreasing quantifier µQ defined by the CREATE LINGUISTIC QUANTIFIER command and n, the number of the soft conditions to aggregate, we compute the following (Yager, 1994): wi= µQ (i/n) – µQ((i-1)/n) with i=1, …, n Then we apply the OWA operator defined by the weighting vector W to the degrees of satisfaction µc1(t), .., µcn(t)of the soft conditions c1, ..,cn by each tuple t: OWAQ(µ c1(t),…, µ
(t))=∑i=1,..,n (wi
cn
*
bi)
with b i being the i-th greatest of the set µc1(t),..,µcn(t). As in SQLf, quantifiers can be exploited in the query. Then, the question on which we based the above example is expressed by means of the following query. SELECT id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND QUANTIFIED most (Price IS ‘Cheap’ IN PriceLevels ZOOMING 0.5, NumberOfRooms IS ‘Big’ IN RoomNUmbers ZOOMING 2, DistanceFromCenter IS ‘close’ IN CityDistances ZOOMING 0.5) ORDER BY DEGREE DESC;
The QUANTIFIED predicate evaluates the most quantifier on the three conditions listed in parentheses. Further, the QUANTIFIED predicate can be freely composed with other predicates. Observe that since we are interested in flats in a big and expensive city such as Rome, we have modified the standard definition of the linguistic values to fit the specific context: specifically, we have con-
sidered more dilated the semantics of cheap for prices, stricter the definition of big for flats, and more dilated the definition of close.
Groups Soft-SQL extends the GROUP BY clause in order to cope with membership degrees of tuples. The question is: what happens when tuples are grouped together? What is the membership degree of the overall group? Soft-SQL provides different semantics, named SAFE, OPTIMISTIC, and AVERAGE. The SAFE semantics assigns the groups the minimum membership degree of tuples; the OPTIMISTIC semantics assigns the group the maximum membership degree of tuples; the AVERAGE semantics assigns the groups the average of membership degrees of tuples. The different semantics give different relevance to groups. The SAFE semantics behaves as a conjunction and can be used to obtain strict evaluation of groups, based on the worst representative. For instance, consider the following query (notice the WITH SAFE DEGREE option). SELECT City, DEGREE AS D FROM FLAT WHERE Price IS ‘Cheap’ IN PriceLevels DEGREE THRESHOLD 0.8 GROUP BY City WITH SAFE DEGREE;
This can also be expressed in SQLf. Given a city, the minimum degree of tuples describing a flat in that city is considered for the overall group; consequently, if the degree of a city (let us denote it as city1) is 0.85 and the degree of a second city (let us denote it as city2) is 0.95, this means that all flats in city2 have membership degree greater than 0.95; the user might find city2 more interesting than city1, since flats available in city2 are generally cheaper than flats available in city1. If we change the GROUP BY clause in the previous query as: GROUP BY City WITH OPTIMISTIC DEGREE
201
Customizable Flexible Querying
we adopt the optimistic semantics: in this case, membership degree of the group is the membership degree of the best representative of the group. Consider again the two sample cities city1 and city2: with the optimistic semantics we might obtain, for instance, a membership degree of 1 for city1 and of 0.98 for city2. This means that city1 has at least one cheap flat, while city2 does not have a fully cheap flat. Finally, the AVERAGE semantics consider the average membership degree of tuples in a group as the average membership degree for the overall group. This is very useful to evaluate the average strength of a group. Observe that the semantics corresponds to the notion of cardinality of a fuzzy set. Nevertheless, one could also exploit a linguistic quantifier such as most to compute the semantics of the GROUP BY clause based on a trade-off between a risk taken to a risk adverse attitude such as : GROUP BY City WITH most DEGREE
in which most is the linguistic quantifier previously defined. The semantics of these Soft-SQL queries can also be expressed in SQLf. The difference with respect to SQLf is the fact that in this context of the GROUP BY we do not evaluate the linguistic quantifier based on the OWA operator, since it is too costly, given that generally each group can contain a large number of tuples. We adopt the OWA definition of linguistic quantifiers in the context of the QUANTIFIED predicates where the number of the soft conditions satisfaction degrees to be aggregated by the OWA operator is limited. In contrast, in the GROUP BY, it is more intuitive to directly use the definition of linguistic quantifiers given by Zadeh (1983) and to evaluate the trapezoidal membership function µQ associated with the quantifier Q on the numeric fuzzy cardinality of each group. The average membership degree ∑ of tuples in the group is computed (∑ is the cardinality of the fuzzy set); then, the overall membership degree of the group is given by µQ (∑).
202
HAVING Clause Soft-SQL redefines the semantics of the HAVING clause on the lines of SQLf. Similarly to the WHERE clause, we might have soft predicates based on linguistic predicates (by means of the IS … IN operator); furthermore, the membership degree of a group before group selection might not be 1. In this case, the membership degree of a group, after group selection, is the minimum value between the original membership and the one obtained by the having condition evaluation. In fact, the HAVING clause plays the role of a further selection of groups: after grouping, groups are further evaluated and selected based on the clause; consequently, it is intuitive to consider the minimum membership degree between the original one and the one obtained by the condition (like an AND). Again, as a straightforward extension, we allow the user to specify a membership degree threshold for groups (similar to the FROM/WHERE clauses): if the minimum threshold is not specified, groups with a membership degree greater than 0 are selected; otherwise groups with membership degree greater than or equal to the specified threshold are selected. This way, the HAVING clause is coherent with its role: it is a group selection condition. Consequently, since groups have a membership degree, both predicates based on the IS..IN operator and on aggregate functions (see next section) can be expressed, and groups can be selected based on the specified minimum threshold for group membership degrees. The result is that WHERE and HAVING clauses are fully orthogonal and can be freely composed to write complex queries.
Flexible Aggregate Functions When defining Soft-SQL, we considered aggregate functions as well. What happens when aggregate functions are applied to a set of tuples with membership degrees? What is the membership degree of the aggregation? We �������������������� found the answer in the different semantics introduced for groups:
Customizable Flexible Querying
it is possible to choose if one wants to evaluate the membership degree by means of the SAFE, OPTIMISTIC, AVERAGE, or whatever linguistic quantifier Q defined by the user. To illustrate, consider the following basic query, that selects cheap flats in Rome.
specific case, the same set of tuples might have an average degree of 0.6, meaning that Q of the found flats are quite cheap, or even fully cheap. To denote the chosen semantics, the syntax of an aggregate function has been extended. Then, the following aggregate functions:
SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
COUNT(* COUNT(* COUNT(* COUNT(*
If we want to count the number of retrieved flats, we may transform the query as follows by means of the COUNT aggregate function.
obtain the number of tuples in the set of tuples with associated SAFE, OPTIMISTIC, AVERAGE, and Q–quantified membership degree, respectively. The behavior of the other aggregate functions are straightforward; however, when aggregate functions consider specific values, only the degree of tuples having that values are considered. For example:
SELECT COUNT(*) AS C FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
But counted tuples may have a membership degree less than 1; thus what is the membership degree of the overall set of tuples? By applying the SAFE semantics, the degree is the degree of the worst representative. In the specific case, a degree of 0.5 might say that the set of flats is not satisfactory, since at least one flat is not much cheap. By applying the OPTIMISTIC semantics, the degree is the degree of the best representative. In the specific case, the same set of tuples might have an optimistic degree of 1, meaning that the user is fully satisfied because there is at least one cheap flat in Rome. Finally, the AVERAGE semantics considers the average of membership degrees as the degree of the aggregate set of tuples, that is, an average measure of the relevance of the set of tuples. In the specific case, the same set of tuples might have an average degree of 0.8, meaning that the average of the found flats are quite cheap, or even fully cheap. In the most general case of trade-off semantics defined by a user-defined quantifier Q, the degree of the counted tuples is computed as for the case of the group described in the previous section. In the
WITH WITH WITH WITH
SAFE DEGREE) OPTIMISTIC DEGREE) AVERAGE DEGREE) Q DEGREE)
COUNT(Price WITH SAFE DEGREE) SUM(Price WITH SAFE DEGREE) AVG(Price WITH SAFE DEGREE)
consider, for computing the overall membership degree, the degrees of tuples with not null value for attribute Price. Furthermore, for computing the overall membership degree, functions MIN(Price WITH SAFE DEGREE) MAX(Price WITH SAFE DEGREE)
consider only the degrees of the tuples having the minimum (respectively, maximum) value for attribute Price. This choice is coherent with the notion of aggregate function: Because the overall set of tuples is represented by one single value that corresponds to the minimum (respectively, maximum) value, only the degrees of representative tuples are considered. This characteristic of Soft-SQL is novel and not present in previous extensions of SQL.
203
Customizable Flexible Querying
Table 1. Example of a relation Id
DistanceFromCenter
DEGREE
101
2.5
0.9
5.2
0.8
2.5
1.0
102
103
Examples Suppose we have the set of tuples shown in Table 1, with membership degrees DEGREE, obtained after the application of a soft selection condition. Table 2 shows the set of aggregate functions and the returned values with membership degree µ. Observe that the degree for the MIN aggregate function is computed considering only tuples having the minimum values, while for the COUNT function all tuples are considered. Also notice the meaning of the different degree: the SAFE quantifier summarizes the worst satisfaction of selection conditions, the OPTIMISTIC quantifier summarizes the best satisfaction (there is at least one tuple fully satisfying the selection conditions), and the AVERAGE quantifier shows the average behavior of tuples w.r.t. the selection conditions. We could even specify a user defined degree through a linguistic quantifier Q.
Examples Suppose the user wants to know how many quite cheap flats are in Rome (note the degree threshold 0.8 that captures the idea of “quite cheap flat”).
SELECT COUNT(* WITH OPTIMISTIC DEGREE) AS items, DEGREE FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ZOOMING 2 DEGREE THRESHOLD 0.8;
Since we are interested in understanding the goodness of the selected items, we specify the OPTIMISTIC quantifier: we obtain the maximum membership degree, that denotes the degree of the best selected item w.r.t. the selection condition. Suppose now we want to know the number of selected flats and the minimum distance from the center among tuples satisfying the query with membership degree of at least 0.8. SELECT COUNT(* WITH OPTIMISTIC DEGREE) AS items, MIN(DistanceFromCenter WITH SAFE DEGREE) AS MinDist, DEGREE WITH AVERAGE DEGREE FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels DEGREE THRESHOLD 0.8;
W.r.t. the previous query, we added the aggregate function that computes the minimum distance from the center with the SAFE quantifier. This means that we consider that the distance from the center is an important parameter, thus the strength of the overall set of selected tuples depends on
Table 2. Example of membership degrees for aggregate functions Function
µ
MIN(DistanceFromCenter WITH SAFE DEGREE)
2.5
0.9
MIN(DistanceFromCenter WITH AVERAGE DEGREE)
2.5
0.95
MIN(DistanceFromCenter WITH OPTIMISTIC DEGREE)
2.5
1.0
3
0.8
COUNT(* WITH SAFE DEGREE)
COUNT(* WITH AVERAGE DEGREE)
COUNT(* WITH OPTIMISTIC DEGREE)
204
Value
3 3
0.9 1.0
Customizable Flexible Querying
the minimum degree of tuples having the lowest distance. Then, we have to choose the final degree. In these situations, we can again decide what semantics to apply, that is, SAFE, OPTIMISTIC, or AVERAGE, because this situation can be seen as an aggregation as well. The chosen semantics is applied to the degrees of aggregate functions appearing in the SELECT clause. In the example, we choose the AVERAGE degree, because we evaluate the strength of the set of selected tuples by combining both parameters. If the selected set of tuples were as the one shown in the example in Table 1, the final membership degree would be 0.95, that is, the average between 1.0 (the COUNT function) and 0.9 (the MIN function). Flexible aggregation semantics can be exploited when the GROUP BY clause appears in the query as well. Consider the following query. SELECT City FROM FLAT WHERE Price IS ‘Cheap’ IN PriceLevels AND DistanceFromCenter IS ‘Close’ IN Distances GROUP BY City WITH SAFE DEGREE HAVING COUNT(* WITH AVERAGE DEGREE) 0}
Flexible Aggregate Functions The syntax of aggregate functions is as follows: COUNT(* [ WITH quantifier-spec DEGREE ] ) COUNT( attr-name [ WITH quantifier-spec DEGREE ] ) COUNT( DISTINCT attr-name [ WITH quantifier-spec DEGREE ] ) MIN( attr-name [ WITH quantifier-spec DEGREE ] ) MAX( attr-name [ WITH quantifier-spec DEGREE ] ) AVG( attr-name [ WITH quantifier-spec DEGREE ] ) SUM( attr-name [ WITH quantifier-spec DEGREE ] )
where quantifier-spec is one of SAFE, AVERAGE, and OPTIMISTIC, or a user-defined linguistic quantifier; if the WITH option is not specified, by default the SAFE quantifier is adopted. Given a set of tuples T’ used to compute the aggregate value av by an aggregate function af, the membership degree associated to av, denoted as µ(av), the minimum, the average, the maximum membership degree associated with tuples in T’, depending on the specified basic quantifier, respectively; in case of a user defined quantifier, the trapezoidal membership function µQ associated with the quantifier is evaluated on the average membership degree of tuples as previously described. In particular, function
210
COUNT(*) operates on all selected tuples, while the other functions operate only on tuples having a not null value for the specified attribute.
Semantics Consider the set T of tuples on which to apply an aggregate function af. If af is the COUNT(*) function, T’=T; otherwise an attribute attr is specified and T’={t∈T | attr is not null}. Given the aggregate value av = af i( T’ ) generated by the aggregate function af, its membership degree µ(av)=Mint’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is SAFE, is µ(av)=Avgt’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is AVERAGE, is µ(av)=Maxt’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is OPTIMISTIC. If the specified quantifier is a user-defined quantifier Q, µ(av)= µQ ( Avgt’ { µ(t’) }, ∀t'∈T.'
The SELECT-ORDER BY Block Consider now the SELECT and ORDER BY clauses. Their syntax is the following: SELECT [ TOP n ] result-schema [ WITH quantifier-name DEGREE ] FROM-WHERE-Block [ ORDER BY list-of-ordering-features ]
SELECT Clause. As in the standard SQL SELECT command, the attribute list appearing in the SELECT clause (i.e., result-schema) defines the schema for the table generated by the SELECT statement; here, we extend SQL by allowing the use of the special keyword DEGREE as an attribute name, whose value is the membership degree of tuples (if the GROUP BY clause is not present) or groups (if the GROUP BY clause is present). This special attribute is motivated by the fact that the resulting table is a classical relational table with no membership degree associated to tuples. If the user wishes to know the relevance of tuples w.r.t. specified selections, this attribute might be used to add a column to the output table with the membership degree. The extended semantics for aggregate functions (see the section titled Flex-
Customizable Flexible Querying
ible Aggregate Functions) that can be exploited in the result-schema requires the introduction of a mechanism to choose the final membership degree of tuples in the result. For this reason, the optional subclause WITH quantifier-name DEGREE is introduced for the SELECT clause as well. It allows the specification of a quantifier (SAFE, AVERAGE, OPTIMISTIC, or user-defined); by means of this subclause, it is possible to choose the degree of a tuple in presence of aggregate functions. In the case of a user-defined quantifier quantifier-name =Q, the Zadeh evaluation will be applied, which means that the AVERAGE of the degrees is computed first and then the value µQ(AVERAGE)of the trapezoidal function defined by (lb,lt,rt,rb). ORDER BY Clause. As in the standard SELECT command, the ORDER BY clause sorts the tuples in the result table; We allow the user to specify the DEGREE special attribute as sort key. The TOP subclause in the SELECT clause inserts only the first n sorted tuples into the result table.
Semantics We define the semantics of the SELECT clause, as far as the generation of the schema for the result table is concerned. Consider at first the case in which the GROUP BY-HAVING block is not specified. The clause operates on the set of tuples FW. •
If no aggregate functions are specified in the SELECT clause, the membership degree of each tuple FW is µ(t) (and the reserved attribute DEGREE assumes this value). •������������������������������������������� if aggregate functions are defined in the SELECT clause, no attributes are allowed in the clause (as the usual SQL constraint); in this case, the membership degree depends on the quantifier specified for the clause (if it is not specified, the SAFE quantifier is applied by default). Thus, one single tuple summarizing the entire set of tuples is generated, and its membership degree is defined as follows.
With af i, we denote the i-th aggregate function in the SELECT clause; with avi =af i(TFW), we denote the value returned by the i-th aggregate function in the SELECT clause; with µ(avi), we denote the membership value obtained by the i-th aggregate function. If the quantifier is SAFE, µ(TFW) = Min 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier is AVERAGE, µ(TFW) = Avg 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier is OPTIMISTIC, µ(TFW) = Max 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier Q is defined by the user with (lb,lt,rt,rb), µ(TFW) = µQ (Avg 1=1…n { µ(avi)}) ∀ avi = af i(TFW), with af i in the clause where µQ denotes the function specified by the quadruple (lb,lt,rt,rb). If the GROUP BY-HAVING block is specified, the semantics of the SELECT clause is slightly different.
The GROUP BY-HAVING Block Consider now the GROUP BY-HAVING block of clauses. Its syntax is the following: list-of-grouping-attributes [ WITH quantifier-spec DEGREE ] [ HAVING soft-group-selection-condition ] [ DEGREE THRESHOLD dtg ] [ GROUP BY
]
211
Customizable Flexible Querying
GROUP BY Clause. The GROUP BY clause behaves similarly to standard SQL, but each group has a membership degree as well, which is obtained by applying a quantifier (basic or user defined) by means of the optional subclause WITH quantifierspec DEGREE. If this subclause is not specified, the default SAFE quantifier is applied, that computes the minimum tuples’ membership degree as the membership degree for the group. The subclause specifying the application of a quantifier is: WITH
quantifier-spec DEGREE
where quantifier-spec specifies the application of a quantifier, that is, either SAFE or AVERAGE or OPTIMISTIC. For example, if the membership degree of a group is computed in an optimistic way (i.e., the maximum tuples’ membership degree), it is specified as: GROUP BY attr WITH OPTIMISTIC DEGREE
In the HAVING clause, soft-group-selection-condition allows predicates over aggregate functions and grouping attributes, as well as the specification of soft conditions by means of the IS .. IN operator. Notice that the semantics of aggregate functions has been extended as well, in order to cope with membership degrees. The optional clause DEGREE THRESHOLD g that follows the HAVING clause allows the user to specify a filtering threshold over the membership degree of each group: if present, only groups with membership degree greater than g are selected; otherwise only groups with membership degree greater than 0 are selected. g is greater than 0 and less than 1.
The Complete SELECT Command The complete syntax for the command is then the following: SELECT [ TOP n ] result-schema [ WITH quan-
212
tifier-spec DEGREE ] FROM source-relation [ WHERE soft-selection-condition ] [ DEGREE THRESHOLD dtf ] [ GROUP BY list-of-grouping-attributes [ WITH quantifier-spec DEGREE ] [ HAVING soft-group-selection-condition ] [ DEGREE THRESHOLD dtg ] ] [ ORDER BY
list-of-ordering-features ]
Semantics We can now define the semantics for the clauses in the GROUP BY-HAVING block. The GROUP BY Clause. Consider a group of tuples (grouped together based on the values of the grouping attributes) and a function GroupMembe rship(g,quantifier) that computes the membership degree for the group, depending on the specified quantifier. The membership degree of the group is µ(g)=GroupMembership(g,quantifier). For predefined quantifiers: GroupMembership(g, SAFE)=Min(µ(t)),∀t∈g; GroupMembership(g, AVERAGE)=Avg(µ(t)),∀t∈g. GroupMembership(g,OPTIMISTIC)=Max(µ(t)),∀t∈g. If a user defined linguistic quantifier Q is specified, the quantifier has associated the tuple (leftbottom-corner, left-top-corner, right-top-corner, right-bottom-corner) that defines the trapezoidal function µQ. In this case, it is GroupMembership(g, Q) = µQ ( Avgt’ { µ(t’) },∀t'∈g. The HAVING Clause. The HAVING clause can be in turn a soft condition based on the IS ... IN operator, applied only to grouping attributes. Thus, it is evaluated by the computation of a membership degree. Similarly to what defined the WHERE clause, if we denote the group selection condition, by means of φ, the membership degree for group g is:
Customizable Flexible Querying
µ’(g) = Min(µ(g), µφ (g))
G’GH = { g ∈ GGH | µ(g) ≥ dtg }
We adopted the Min semantics, because the group selection condition is a further selection applied to group, and it can be seen as a conjunction with the previous evaluations that gave rise to the group membership degree. The HAVING clause allows the specification of aggregate functions in comparison expressions. Thus, if we denote a comparison expression with aggregate functions as AggrCompExpr:
while in case the DEGREE THRESHOLD subclause is not specified, it is defined as follows.
AggrCompExpr = Min 1=1…n { µ(avi) } ∀ avi = af i(g), with af i ∈ AgrCompExpr Again, we used the Min semantics, because in a comparison expression containing two aggregate functions, it is natural to imagine one of them with the lower membership degree determining the membership degree of the comparison. In case an aggregate function is specified in the expression on which the IS .. IN operator is evaluated, the membership degree obtained by the IS .. IN operator is considered. As an example, consider the condition:
G’GH = { g ∈ GGH | µ(g) > 0} The SELECT Clause. The semantics of the SELECT clause changes when the GROUP BYHAVING block is specified. It generates a tuple for each group g ∈ G’GH. If no aggregate functions are specified, the membership degree is µ(g), as previously defined. If aggregate functions are specified, the semantics depends on the specified quantifier for the overall clause (if not specified, the SAFE quantifier is assumed by default). If the quantifier is SAFE, FWi If the quantifier is AVERAGE, FWi If the quantifier is OPTIMISTIC, FWi with 0≤i≤n, 0 while for 1≤i≤n it is µi = µ (avi) = af i(TFW), where af i is i-th aggregate function in the SELECT clause.
MAX(Price) IS ‘cheap’ in PriceLeves
Conclusion
That is not allowed in the WHERE clause because an aggregate function is used. In the HAVING clause, it is allowed, but both the aggregate function and the IS .. IN operator gives a membership degree. We choose to consider the membership degree returned by the IS .. IN operator as the overall membership degree of the expression.
In this chapter, we presented the current results of the Soft-SQL project, whose goal is to define and implement a flexible query language as an extension of the classical SQL within fuzzy set theory. The work takes ideas from previous well known approaches such as SQLf (Bosc & Pivert, 1995) and is motivated by practical issues, mainly the intent of defining a user customizable, context dependent, and fully controllable query language, exploiting features of classic SQL as far as possible in order to allow the expression of flexible queries in classical relational databases. The proposal introduces several novel concepts. First, it works on regular relations and no special meaning is attributed to the membership degree. The user must specifically use the SQL ORDER BY clause to rank the tuples of a relation with respect
The DEGREE THRESHOLD Subclause. Consider the set of groups GH produced by the GROUP BY clause and possibly filtered by the HAVING clause, whose membership degree µ(g) (with GH) has been computed as previously discussed. If the subclause DEGREE THRESHOLD dtg is specified, the final set of groups G’GH produced by the GROUP BY-HAVING block is defined as follows:
213
Customizable Flexible Querying
to their membership degree attribute values. This way, Soft-SQL is really an extension of SQL, and it subsumes it completely satisfying the closure property. By means of the notion of user-defined term sets, and through the use of the new command named CREATE TERM-SET, the user can define and customize the semantics of sets of linguistic values with respect to a given context. Thus, linguistic terms used to express soft conditions can assume a different semantics according to the attribute to which they are applied as it occurs when using terms in natural language. To this end, the SELECT command has been redefined, in order to let the user specify flexible queries based on context dependent soft selection conditions. Furthermore, a rich set of options has been introduced in the SELECT command to allow the precise definition of the query semantics and to adapt it to the specific context of the query, in order to look results that really meet users’ needs. This way, we achieve a new level of flexibility of the language not possible in other previous extensions of SQL by fuzzy sets. By following the same approach, the user is allowed to define linguistic quantifiers. The new command CREATE LINGUISTIC QUANTIFIER allows specifying and customizing the semantics of newly created relative linguistic quantifiers, which can be exploited both in the WHERE clause and in the GROUP BY clause.
References Baldwin, J. F., Coyne, M. R., & Martin, T. P. (1993). Querying a database with fuzzy attribute values by iterative updating of the selection criteria. In International Joint Conference on Artificial Intelligence (IJCAI’93). Bosc, P., Buckles, B., Petry, F. E., & Pivert, O. (1999). Fuzzy databases. In J. C. Bezdek, D. Dubois, & H. Prade (Eds.), Fuzzy sets in approximate reasoning and information systems: The handbook of fuzzy set series (pp. 404-468). Kluwer Academic Publishers.
214
Bosc, P., & Pivert, O. (1992). Fuzzy querying in conventional databases. In L. A. Zadeh & J. Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty. John Wiley & Sons. Bosc, P., & Pivert, O. (1995). SQLf: a relational database language for fuzzy querying. Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (1997a). Fuzzy queries against regular and fuzzy databases. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 187-208). Kluwer Academic Publishers. Bosc, P., & Pivert, O. (1997b). On representationbased querying of databases containing ill-known values. In Proceedings of the International Symposium on Methodologies for Intelligent Systems (ISMIS’ 97) (pp. 477-486). Bosc, P., & Prade, H. (1994). An introduction to the fuzzy set and possibility theory-based treatment of flexible queries and uncertain and imprecise databases. In A. Motro & P. Smets (Eds.), Uncertainty management in information systems: From needs to solutions. Kluwer Academic Publisher. Bordogna, G., & Psaila, G. (2004, June 24-26). Fuzzy spatial SQL. In Proceedings of Flexible Querying Answering Systems (FQAS04) (LNAI 3055), Lyon, France. Springer-Verlag. Bordogna, G., & Psaila, G. (2005, March 15-16). Extending SQL with customizable soft selection conditions. In Proceedings of the ACM-SAC Track on Information Access, Santa Fè, NM. Buckles, B. P., & Petry, F. E. (1985). Query languages for fuzzy databases. In J. Kacprzyk & R. R. Yager (Eds.), Management decision support systems using fuzzy sets and possibility theory (pp. 241-251). TUV Rheinland: Verlag. Buckles, B. P., Petry, F. E., & Sachar, H. S. (1986). Design of similarity-based relational databases. In H. Prade & C. V. Negoita (Eds.), Fuzzy logic in knowledge engineering (pp. 3-7). TUV Rheinland: Verlag.
Customizable Flexible Querying
Dubois, D., & Prade, H. (1997). Using fuzzy sets in flexible querying: Why and how? In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 45-60). Kluwer Academic Publishers. Eduardo, J., Goncalves, M., & Tineo, L. (2004, September 27-October 1). A fuzzy querying system based on SQLf2 and SQLf3. In Proceedings of the XXX Conferencia Latinoamericana de Informática (CLEI 2004), ��������������� Arequipa, Peru. Galindo, J., Medina, J. M., & Aranda, G. M. C. (1999). Querying ������������������������������������ fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14, 375-411.� Galindo, J., Medina, J. M., Cubero, J. C., & García, M. T. (2000). Fuzzy quantifiers in fuzzy domain calculus. In Proceedings of the 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU´2000) (pp. 1697-1702), Spain. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems. SpringerVerlag. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Idea Group Publishing. Goncalves, M., & Tineo, L. (2003, March 1-4). Derivation principle in SQLf2 algebra operators. In Proceedings of the 1st International Conference on Fuzzy Information Processing Theories and Application (FIP-2003), Beijing, ��������������� China. Goncalves, M., & Tineo, L. (2005, May 22-25). Derivation principle in advanced fuzzy queries. In Proceedings of the 14th IEEE International Conference on Fuzzy Systems (Fuzz-IEEE 2005), Reno, NV. Kacprzyk, J., & Zadrozny, S. (1995). FQUERY for access: Fuzzy querying for windows-based DBMS.
In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems. Physica-Verlag. Kacprzyk, J., & Zadrozny, S. (1997). Implementation of OWA operators in fuzzy querying for Microsoft Access. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: theory and applications (pp. 293-306). Boston: Kluwer. Kacprzyk, J., Zadrozny, S., & Ziolkowski, A. (1989). FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 6, 443-453. Kacprzyk, J., & Ziolkowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, 16, 474-479. Kießling, W. (2002). Foundations of preferences in database systems. In Proceedings of the 28th International Conference on Very Large Databases. Kießling, W. (2003). Preference queries with svsemantics. In Proceedings of the 11th International Conference on Management of Data (COMMAD 2005). Medina, J. M., Pons, O., & Vila, M. A. (1994). Gefred: A generalized model of fuzzy relational databases. Information Sciences, 76, 87-109. Petry, F. E. (1996). Fuzzy databases. Kluwer Academic Publisher. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143. Prade, H., & Testemale, C. (1987). Representation of soft constraints and fuzzy attribute values by means of possibility distributions in databases. In J. C. Bezdek (Ed.), Analysis of fuzzy information (vol. II, pp. 213-229). CRC Press. Ribeiro, R. A., &. Moreira, A. M. (1999). Intelligent query model for business characteristics. In
215
Customizable Flexible Querying
Proceedings of the IEEE/WSES/IMACS CSCC’99 Conference.
Rosado, A., Ribeiro, R. A., Zadrozny, S., & Kacprzyk, J. (2006). Flexible query languages for relational databases: An overview. In G. Bordogna & G. Psaila (Eds.), Flexible databases supporting imprecision and uncertainty. Springer-Verlag.
Zadeh, L. A. (1999). From computing with numbers to computing with words: From manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems, 45(1), 105-119.
Shenoi, S., Melton, A., & Fan, L. T. (1990). An equivalence classes model of fuzzy relational databases. Fuzzy Sets and Systems, 38, 153-170.
Key Terms
Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing and Management, 13, 289-303. Takahashi, Y. (1991). A fuzzy query language for relational databases. IEEE Transactions on Systems, Man and Cybernetics, 21, 1576-1579. Tineo, L. (2000). Extending RDBMS for allowing fuzzy quantified queries. In M. Kung (Eds.), DEXA Proceedings (LNCS 1873, pp. 407-416). Springer-Verlag. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multi criteria decision making. IEEE Transactions on Systems, Man and Cybernetics, 18, 183-190. Yager, R. R. (1994). Interpreting linguistically quantified propositions. International Journal of Intelligent Systems, 9, 541-569. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Zadeh, L. A. (1975). The concept of a linguistic variable and its application to approximate reasoning (I-II). Information Sciences, 8, 199249, 301-357. Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. 216
Aggregate Function: A function of SQL language working on sets of tuples instead of on single tuples and returning one single value as a result of their evaluation. Flexible Query: A query allowing the specification of some kind of preferences on selection conditions and/or priorities among conditions. Within the fuzzy context, flexible queries are also named fuzzy queries: preferences in fuzzy queries are defined by soft conditions expressed by linguistic predicates such as young, while priorities among conditions are expressed by numeric values expressing the degrees of priorities. Fullfillment Degree: A value in [0,1] that expresses the satisfaction degree of a tuple of a relation subjected to a flexible query in a relational database. When it is zero, it means that the tuple does not satisfy the flexible query at all; when it is one, it means that the tuple fully satisfies the query. Intermediate values in (0,1) indicate partial satisfaction of the query by the tuple. Fullfillment degree is also named membership degree of the tuple in SQLf queries. Linguistic Quantifier: Linguistic quantifiers extend the set of quantifiers of classical logic. They can be either crisp (such as all, at least 1, at least k, half) or fuzzy quantifiers (such as most, several, some, approximately k). Formally, Zadeh (1983) first defined fuzzy quantifiers as fuzzy subsets and identified two types of quantifiers: absolute and relative. Absolute quantifiers, such as about 7, almost 6, and so forth are defined as fuzzy sets with membership
Customizable Flexible Querying
function on a subset of positive integers. Relative quantifiers are defined as fuzzy sets with membership function defined on [0,1]. OWA Operator: Ordered weighting average operators defined by Yager (1988) are a family of mean-like operators that allow the realization of aggregations in between the two extremes of the AND and OR corresponding with the minimum and the maximum of the operands respectively. Soft Condition: Tolerant selection condition admitting degrees of satisfaction defined by a fuzzy set on the domain of a linguistic variable, such as Age, and specified by linguistic terms, such as young, old, and so forth.
Soft-SQL: Indicates the fuzzy extension of SQL defined in this chapter.� SQL: Structured Query Language used in relational databases. SQLf:� Indicates a fuzzy extension of SQL language (Bosc & Pivert, 1995). Another extension is named FSQL (Galindo et al., 2006).� Term Set: Name of the set of values that a linguistic variable can assume.
217
218
Chapter IX
Qualifying Objects in Classical Relational Database Querying Cornelia Tudorie University Dunarea de Jos, Galati, Romania
Abstract The topic presented in this chapter refers to qualifying objects in some kinds of vague queries sent to relational databases. We want to compute a fulfillment degree in order to measure the quality of objects when we search them in databases. After a discussion on various kinds of object linguistic qualification, with different kinds of fuzzy conditions in a fuzzy query, a new particular situation is proposed to be included in this subject: the relative object qualification as a query selection criterion, that is, queries with two conditions in which the first one depends on the results of the second one. It is another way to express the user’s preferences in a flexible query. In connection with this, a new fuzzy aggregation operator, AMONG, is defined. We also propose an algorithm to evaluate this kind of queries and some definitions to make it applicable and efficient (dynamic modeling of the linguistic values and unified model of the context). We demonstrate these ideas with software already implemented in our lab.
Introduction Database querying by various selection criteria can often confront a major limitation: the difficulty to realize and express precise criteria for locating the information. This happens because people do not always think and speak in precise terms, or they do not have details on the data range. The research community recently proposed a new way to query databases, more expressive and flexible than the classical one. It is about vague queries, for example: “retrieve the persons well paid which live not too far from the office”, of course, formulated in an
adequate query language. The main reason to use the vague predicates well paid and not too far is to express more flexibly the user’s preferences and at the same time to rank the selected tuples by a degree of criteria satisfaction. When a precise criterion, like “salary > 500 and distance home-office < 200” is required, it may return an empty list, even if there are a lot of persons having attribute values very close to the specified ones. As well, the same precise criterion may return a complete list of all persons, without any helpful ordering. So, it would be useful to provide intelligent interfaces to databases, able to interpret and evaluate imprecise criteria in queries.
Qualifying Objects in Classical Relational Database Querying
Some important advantages resulting from including vague criteria in a database query may be: • • •
Easy to express queries The possibility to classify database objects by selecting them based on a linguistic qualification The possibility to refine the result by assigning to each tuple the corresponding fulfillment degree (the degree of criteria satisfaction); in other words, to provide a ranked answer according to the user’s preferences
Under this circumstance, when vague queries are accepted, fuzzy set membership functions are convenient tools for modeling the user’s preferences in many aspects. Fuzzy sets theory (����������������������� Bouchon-Meunier, 1995; Dubois, Ostasiewicz, & Prade, 1999; Yager & Zadeh, 1992; ����������������������������������� Zadeh, 1965) is accepted as one of the most adequate formal frameworks to model and to manage vague expressions. Two research areas are important for the fuzzy theory applied in database field: fuzzy querying regular databases and storing fuzzy information in databases. There are many scientific works regarding database fuzzy querying: general reference books (e.g., ��������� Galindo, Urrutia, & Piattini, 2006�������������������������� ), but also many articles in journals and communications at conferences. Some of them propose fuzzy extensions of the standard query language for relational database (SQL), able to interpret and evaluate fuzzy selection criteria; others proposed intelligent interfaces to fuzzy querying classical databases. The most important of those included: •
•
SQLf (Bosc & Pivert, 1995; Goncalves & Tineo, 2001a, 2001b; Projet �������������� BADINS, 1995, 1997�������������������������������������� ) and FSQL (Galindo et al., 2006) are extensions of the SQL language, allowing flexible querying. FQUERY (Kacprzyk & Zadrozny, 1995��, 2001����������������������������������������� ) and FuzzyBase (Gazzotti, Piancastelli, Sartori, & Beneventano, 1995) are fuzzy querying engine for relational databases.
In this book, the reader can find a chapter by Urrutia, Tineo, and González, studying SQLf and FSQL languages.������������������������������� There is also another chapter including a review about flexible querying and it has been written by Kacprzyk, Zadrożny, de Tré, and de Caluwe. ������������������������������� Other works have developed new data models able to taking into account imperfect information. Fundamental contributions have been made by Buckles and Petry (1982); Medina, Pons, and Vila (1994); and Prade ���������������������������� and Testemale (1984��� ). Galindo et al. (2006) also define a running fuzzy database, which stores imperfect data represented by fuzzy possibilistic distributions, fuzzy degrees, and so forth. From the beginning, it is important to remark that we deal in this chapter with relational database fuzzy querying (fuzzy queries on crisp data), but not fuzzy database querying (queries on fuzzy data). The first presented items are already talked about in Project BADINS (1995, 1997); Bosc and Pivert (1992); Bosc and Prade (1997); Dubois and Prade (1996); Kacprzyk and Zadrozny (2001); and many others, but we consider that it is useful to rediscuss them in order to propose an original classification of the various kinds of object linguistic qualifications. In this context, the relative object qualification will be proposed as a new kind of selection criteria, and it will be included in this classification. Some practical examples inspired us to develop this study. Let us compare the queries: Retrieve the cars having the speed greater than 240 Retrieve the cars having high speed Retrieve the inexpensive and high speed cars Retrieve the inexpensive cars among the high speed ones They are increasingly more complex: they start with a classical crisp query and go to more complex queries, including vague terms in the selection
219
Qualifying Objects in Classical Relational Database Querying
criteria. They correspond to different kinds of object linguistic qualifications, to different ways to model their semantics in fuzzy sets style (i.e., fuzzy models), and to different ways to compute the fulfillment degree of the selection criterion. All these will be discussed in the following sections. A great part of the chapter (the section titled Relative Object Qualification) is reserved to present a new kind of fuzzy queries with two conditions in which the first one depends on the results of the second one. In connection with this, a new fuzzy aggregation operator, AMONG, is defined. Some variants, particular cases of the queries based on relative qualification, are analyzed. The section titled Dynamic Modeling of the Linguistic Values proposes a procedure to define the linguistic terms by automatically extracting their fuzzy models from the actual content of the database. This procedure is generally useful, but it is mandatory in the relative qualification case. In order to make the query evaluation process more efficient, we propose in the section titled The Unified Model of the Context a solution to incorporate the knowledge base (containing the fuzzy models of the linguistic terms, or at least their metadescriptions) in the target database. Both kinds of pieces of knowledge are taken into discussion: effective models of the linguistic values as static definitions, but also metadescriptions of the linguistic values for a dynamic modeling process. Finally, we present several laboratory implementations, the conclusions, and future trends.
Absolute Object Qualification Querying a relational database in a classical system means selecting data (table rows) satisfying Boolean criteria. For example, the following crisp query is sent to a database including Table 1: Retrieve the cars having the speed greater than 240
220
The answer consists of a table, containing the database rows which satisfy the Boolean formula. So, the criterion max speed>240 is evaluated and the answer is in Table 2. The classical query searches the database objects having a certain property, expressed by a Boolean predicate: if “B Coupe” is selected, that means it has the property to have the speed superior to 240. The fuzzy predicate is an affirmation that may be more or less true, depending on the argument
Table 1. A relational database table (car) Max Speed
Price
AA
236
46000
AA4
221
28450
B3
226
31562
B7
243
57200
B Coupe
250
39000
C 300M
230
32000
IO
145
24000
LRD
130
28000
MBS
240
69154
MC
190
18200
M L200
145
19095
NV
132
15883
OA
186
16042
OCS
120
26259
OF
192
43615
OV
208
20669
OZ
178
18364
P 206
170
10466
P 607
222
31268
P 806
177
20633
P 911 C
280
65000
RC
186
12138
Name
…
…
Qualifying Objects in Classical Relational Database Querying
Table 2. The cars having speed greater than 240 Name
...
Max Speed
Price
P 911 C
280
65000
B Coupe
250
39000
B7
243
57200
...
value: for example, Max Speed (“B Coupe,” high). It is an extension of the classical logical predicate, which can be either definitely true or definitely false. The truth-value of the fuzzy predicate may be expressed as a number in [0,1], 1 standing for absolutely true and 0 for absolutely false. In a context related to query selection criteria, a fuzzy predicate is useful to model the gradual property: if “B Coupe” is selected, that means it has the property to have high maximum speed. Moreover, accepting a certain meaning of the “high” term (for example, a fuzzy set as semantic model), then the fuzzy query evaluation consists in computing a corresponding fulfillment degree of “high speed” property. Including a gradual property in the database vague queries, like ������� “x are Р”,��������� gives a qualification to the objects; that means selecting a number of objects (x) from the database that satisfy to a certain degree the gradual property (Р). More examples for fuzzy selection criteria like “x Р ” may be��: high speed cars, inexpensive cars, expensive cars, good students, big salary, young people, and so forth. When a query selection criterion is expressed by a gradual property, a fulfillment degree for each tuple is computed, starting from the definition of the fuzzy predicate. This is equal to the value of the membership function corresponding to the attribute value in the current tuple. Definition 1. Let R[A1, A2, … , An] be a table of a relational database, that is, a set of tuples t: R ⊂ {t | t ∈D1 × D2 × … × Dn } where Di are the domains of the attributes Ai, accepted (within this chapter) as intervals [ai,bi].
Then the fulfillment degree of a vague criterion, referring to an attribute A, with domain D=[a,b], or, in other words, the fulfillment degree of a gradual property Р by referring to an attribute A, is defined by the membership function of the fuzzy predicate: μР : D → [0,1] or μР : [a,b] → [0,1],
v → ׀μР (v)
The fulfillment degree of the gradual property Р, associated with the attribute A, may be considered a characteristic of each tuple so that the fulfillment degree may also be defined for table R: μР (t) = � μР (t.A), μР : R → [0,1], t ��� → �׀ where t.A is the crisp value of the attribute A for the tuple t, t.A ∈ [a, b]. Definition 2. If a crisp query on a database table R, based on a condition P (Boolean predicate) referring to an attribute A, is an application: Q : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn R { → ���׀t ∈R | P(t) } then a vague query based on a gradual property Р associated with the attribute A is the application
QP
: 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , μ �Р (t) ) | t ∈R ∧ � μР (t) > 0 } where � μР (t) = � μР (t.A). Usually, more such gradual properties, expressed by linguistic labels, can be linked to the same database attribute. They are named linguistic values, and the set of these labels may be the definition set of a linguistic variable. ���� The definition of linguistic variable can be found in Zadeh (1975) and in another chapter of this book written by Xexeo.� The set of linguistic values makes up the linguistic domain for the database attribute (more details in Tudorie & Dumitriu, 2004). In order to
221
Qualifying Objects in Classical Relational Database Querying
evaluate any vague query sent to a database, it is necessary to define both the crisp domain and the linguistic one for each attribute frequently used in searching operations. For example, [120,280] is the crisp domain, and ��{ low, medium, high }is the linguistic domain for the Max Speed attribute of the car table. Each linguistic value can be considered a gradual property and modeled as a fuzzy predicate defined on the attribute crisp domain as referential set (like in Figure 1). For example, according to the definitions in Figure 1, the answer to the vague query: Retrieve the cars having high speed applied to the car table is in Table 3. In other words, we found the cars with the property to have high speed. The fulfillment degree (μ) expresses the intensity of the property: between the 0 degree (i.e., a not high speed car) and the 1 degree (i.e., an absolutely high speed car).
The AND Operator: Multiqualification In the classical (precise) context, a compound selection criterion is a Boolean expression containing comparisons and logical operators. In a vague context, the operators AND, OR, NOT, are extended to fuzzy aggregation connectives. They are able to compute a global fulfillment degree for each database tuple, starting with the fulfillment Figure 1. Linguistic values defined on the Max Speed and Price attribute domains µ
low
high
medium
1 0.5 0
Max Speed 120
160 180
200 220
240
280
µ inexpensive
expensive
medium
1 0.5
Price
222
69154
54482
47146
39810
32474
25138
10466
0
degrees of each fuzzy condition and observing certain models for the fuzzy connectives. Usually, the Min and Max functions stand for the fuzzy conjunctive and disjunctive connectives; the complement stands for the fuzzy negation connective. But there are many other proposals in the literature for defining aggregation connectives (Grabisch, Orlovski, & Yager, 1998; Yager, 1991). Let us take for example a query based on a complex fuzzy selection criterion applied to the car table: Retrieve the inexpensive and high speed cars. The evaluation of this query, according to the definitions in Figure 1 and to the content of the car table (Table 1) generates the answer in Table 4. For each table row, the fulfillment degree of each linguistic value is computed and the arithmetical min function is used to implement the fuzzy conjunction between them. The answer produces the table rows (cars) having a significant global fulfillment degree. Definition 3. ������������������������������� The fuzzy model of the conjunction, AND(Р , S), of two gradual properties, Р and S, associated with two attributes, A1 and A2, is defined by the mapping: or µ Р AND S : D1 × D2 → [0,1] µ Р AND S : [a1,b1] × [a2,b2] → [0,1] , (v1,v2) → ׀min ( μР (v1), μ S (v2) ) The same fulfillment degree defined on a database table R is: µ Р AND S : R → [0,1], t �� → �׀min ( � μР (t), � μS (t) ) = min ( � μР (t.A1), � μS (t.A2) ) where t is a tuple and µР and µS are the membership functions defining each of the two gradual properties. Any conjunctive operator is a triangular norm (or t-norm), as it is defined in the fuzzy set theory (Yager, 1991), with the following properties:
Qualifying Objects in Classical Relational Database Querying
Table 3. The “high speed cars” table
fuzzy selection criterion. The objects selected by the query are defined by a double qualification: to be “inexpensive” and at the same time to have “high speed.” It is important to remark that the two qualifications are independent of each other, and they have the same significance for the user’s preferences.
Max Speed
Price
P 911 C
280
65000
1
B Coupe
250
39000
1
B7
243
57200
1
MBS
240
69154
1
AA
236
46000
0.80
C 300M
230
32000
0.50
B3
226
31562
0.30
P 607
222
31268
0.10
AA4
221
28450
0.05
Name
…
...
µ
Table 4. The “high speed and inexpensive cars” table Max Speed
Price
µ high
µ inexpensive
µ
B3
226
31562
0.3
0.12
0.12
Name
P 607
222
31268
0.1
0.16
0.1
C 300M
230
32000
0.5
0.06
0.06
AA4
221
28450
0.05
0.54
0.05
1. ������������� commutativity: AND(Р, S) = AND(S, Р ) 2. ������������� associativity: AND(Р, AND(S, T )) = AND(AND(Р, S), T ) 3. �������� monotony: AND(Р, S) ≤ AND(Р’, S’) if Р ≤ Р’ and S ≤ S’ 4. ���� unit element: AND(Р , 1) = Р It is obvious that min operator is a t-norm. A particular list of various t-norm functions is presented in Dubois and Prade (1996) and Galindo ����������� et al. (2006, p. 20��������������������������������� ). When using these functions as AND connective to database querying, they are modeling different linguistic expressions and, of course, different logical meanings of the selection criterion. The queries like the above-mentioned one include two (or more) gradual properties in the
Definition 4. The� vague query based on a double qualification is an application: QP,S : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , min ( � μР (t), μ � S (t) ) ) | t ∈R ∧ μ �Р (t) > 0 ∧� μ S (t) > 0} Similarly, a multiqualification (multiple conjunctions) can be expressed as a criterion in database vague queries.
Relative Object Qualification People use an enormous number of expressions in their common language for requesting information. This fact motivated us to search the most accurate model for as many queries as possible so that the computational treatment and the response may be as adequate as possible. There are in the literature certain approaches of modeling user’s preferences, which solve different situations, such as accepting tolerance, accepting different weight of importance for the requirements in a selection criterion, accepting conditional requirements, and so forth (e.g., Dubois & Prade, 1996). Our study has found a new class of problems, which require a partitioning of a limited subset of an attribute domain instead of the whole domain, in situations where dynamic modeling of the linguistic values is necessary. This is the case when the selection criteria are not independent, but they are in a combination which expresses a user’s preference. Two gradual properties are combined in a complex selection criterion such that a second one is applied on a subset of database rows, already selected by the first one. We assume
223
Qualifying Objects in Classical Relational Database Querying
that the second gradual property is expressed by a linguistic value of a database attribute, which is a label from the attribute linguistic domain. In this case, modeling of the linguistic domain of the second attribute requires taking into account not the whole crisp attribute domain, but a limited subset, characteristic to the first criterion-selected database rows.
The AMONG Operator: Relative Qualification to Other Gradual Property Let us consider as an example the following query based on a complex fuzzy selection criterion addressed to the car table: Retrieve the inexpensive cars among the high speed ones. The query evaluation procedure observes the following steps: Algorithm 1. The selection criterion high speed cars is evaluated, taking into account the definition in Figure 1; an intermediate result is obtained, containing the rows where the condition µ high (t)>0 is satisfied (Table 3). 2. The underlying interval containing the price for the selected cars forms the Price subdomain [28450, 69154]; this is the one considered later, instead of [10466, 69154]. 3. The linguistic value set {inexpensive, medium, expensive} is scaled to fit this
Figure 2. Linguistic values defined on a subdomain µ 1
INEXPENSIVE
MEDIUM
EXPENSIVE
0.5
224
69154
58978
53890
48802
43714
38626
28450
Price
10466
0
4.
5.
subdomain (Figure 2; in order to make the difference, the new definitions are labeled in capital letters). The selection criterion inexpensive cars is evaluated taking into account the definition in Figure 2. The fulfillment degree μ INEXPENSIVE is computed for each row of the intermediate result from step 1. The global fulfillment degree (μ) will result for each tuple and they are selected if μ(t)>0 (the shaded rows in Table 5).
At this point�������������������������������� , a new fuzzy aggregation operator can be defined in order to model the relative qualification in queries like “Р AMONG S .” Definition 5. The fuzzy model of the relative conjunction, AMONG(Р , S), of two gradual properties, Р and S, associated with two attributes, A1 and A2, is defined by the mapping: µ Р AMONG S : D1 × D2 → [0,1] or µ Р AMONG S : [a1,b1] × [a2,b2] → [0,1], (v1,v2) → ׀min (μР / S (v1), μ S (v2)) The same fulfillment degree, defined on a database table R is: µ Р AMONG S : R → [0,1] , t → ׀min (μР / S (t), μ S (t)) = min(μР / S (t.A1), μ S (t.A2)) where t is a tuple and µS is the membership function defining the S gradual property and μР / S is the fulfillment degree of the first criterion (Р ) relative to the second one (S). In Table 5, � μ INEXPENSIVE stands for�� μР / S , that is, μinexpensive/high, and μ stands for the global selection criterion computed as µ �AMONG. The membership function μР / S stands for a transformation of the initial membership function µР and is obtained by translation and compression, as in the following. After the first selection, based on the property S, associated with the attribute A2, the initial domain [a,b] of the attribute A1 becomes more limited, that is, the interval [a’,b’] (Figure 3).
Qualifying Objects in Classical Relational Database Querying
Table 5. The “inexpensive cars among the high speed ones” table Name
Max Speed
Price
µ high
µ INEXPENSIVE
µ
P 911 C
280
65000
1
0.00
0.00
B Coupe
250
39000
1
0.92
0.92
B7
243
57200
1
0.00
0.00
MBS
240
69154
1
0.00
0.00
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
1
0.50
B3
226
31562
0.30
1
0.30
P 607
222
31268
0.10
1
0.10
AA4
221
28450
0.05
1
0.05
Figure 3. Restriction of the attribute domain for a relative qualification µ 1
µ
0.5 0
µ a
a’
b−a ( v − a ' ) ) b'−a '
v → ׀μР (v)
μР / S (v) = μР ( a +
v → ׀μР/S (v)
Note. The expression of the membership function μР / S is defined based only on the original membership function µР and does not depend on the algorithm for modeling it. Consequently, the defining method of the linguistic values is preserved by the transformation f.
where f is the transformation f : [a’,b’] → [a,b] b−a (1) f (x) = a + ( x − a ' ) b'−a '
For the particular cases: a. b.
b'
Therefore:
Thus, if: μР : [a,b] → [0,1], then μР / S : [a’,b’] → [0,1], so that μР / S = f ° μР
b
/S
identical transformation a’=a , b’=b ⇒ f(x) = x interval limits x=a’ ⇒ f(a’) = a x=b’ ⇒ f(b’) = b
(2)
Definition 6. The algebraic model of the AMONG operator is: µ Р AMONG S : R → [0,1] µ Р AMONG S (t) = b −a
min ( μР ( a 1 + 1 1 ( t.A1 − a 1 ' ) ) , μ S (t.A2) ) b1 '−a 1 ' (3) where [a1’, b1’] ⊆ [a1,b1] is the sub-interval of the A1 corresponding to the table: QS (R) (obtained
225
Qualifying Objects in Classical Relational Database Querying
by the first selection, on the attribute A2, using property S ). The new operator stands for the model of a certain fuzzy conjunctive aggregation, but it cannot be considered a triangular norm. Regarding the properties of a triangular norm, one can remark: i. ���� the commutativity is not satisfied by the AMONG operator: AMONG(Р , S) ≠ AMONG(S , Р ) because μР / S (t) ≠ � μ S / P (t) and � μ S (t) ≠ � μP (t) ⇒ min (� μР / S (t), μ S (t) ) ≠ min (� μ S / P (t), � μP (t) ), ∀t
and because, semantically, such queries cannot be compared (remark (ii) next page).
ii. ���� the associativity is satisfied by the AMONG operator (see Exhibit A) and because, semantically, such queries reflect the same idea. For example: Retrieve the (inexpensive cars among the high speed ones) selected from the low fuel consumption ones. Retrieve the inexpensive cars selected from (the high speed cars among the low fuel consumption ones). iii. ���� the monotony is not satisfied by the AMONG operator:
¬ ( AMONG(Р , S) ≤ AMONG(Р ’ , S ’) if Р ≤ Р ’ and S ≤ S ’ ) because:
Although
Р ≤ Р ’ (as fuzzy model) ⇒ μР (t) ≤ μР ’ (t) , ∀t ⇒ μР / S (t) ≤ μР ’ / S (t) , ∀t , ∀S and S ≤ S ’ (as fuzzy model) ⇒ μ S (t) ≤ μ S ’ (t) , ∀t ⇒ [a1’, b1’]S ⊆ [a1’, b1’]S ’ though the comparison μР always true,
/S
(t) ≤ μР / S ’ (t) , ∀t is not
even if [a1’, b1’]S ⊆ [a1’, b1’]S ’ We marked [a1’, b1’]S , [a1’,b1’]S ’ as the sub-intervals of the A1 attribute values from the tables QS (R) and QS ’ (R) (obtained by the first selections, on the attribute A2, using property S and property S’ ). iv. ���� unit element is satisfied by the AMONG operator AMONG(Р� , 1) = Р� because min ( � μР���/ �1 (t), 1 ) = min ( � μР (t), 1 ) = � μP (t) , ∀t Q���������������������������������������������� ueries like the one above include two gradual properties in the fuzzy selection criterion, but in
Exhibit A. AMONG(Р , AMONG(S , T )) = AMONG(AMONG(Р , S) , T ) because min ( � μР������ (t), min ( � μ S/T (t), � μT (t) )) = min (min ( � μ(P/S)/T (t), � μ S/T (t) ), � μT (t)) = /(S/T) = min ( � μP/S/T (t), � μ S/T (t) , � μT (t)) , ∀t Exhibit B. QP/S : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , min (� μР / S (t), � μ S (t) ) ) | t ∈R ∧ � μР / S (t) > 0 ∧ μ � S (t) > 0}
226
Qualifying Objects in Classical Relational Database Querying
a special relationship: the second one refers to objects selected by the first one. That means that the objects are selected by a qualification relative to other gradual property. Definition 7. The� vague query based on a qualification relative to other gradual property is an application (see Exhibit B). Some remarks are interesting and very important: i ����������������������������������� A quite different query expression: Retrieve the most inexpensive cars among the high speed ones can be assimilated with the previous one, so it can be submitted to the same evaluation procedure. The most inexpensive criterion is not equivalent to the relational aggregation MIN operation on the
whole car table (Table 1), but it corresponds to a fuzzy selection on a fuzzy table. Moreover, this query expression may be even more suggestive for the database user and semantically adequate to the response in Table 5. ii ������������������������������ The new aggregation operator, AMONG, is not commutative: the inversion of the two criteria leads to a different query answer. Actually, when thinking of the semantic of the operation, the AMONG operator models exactly the importance level of the criteria, according to the user’s preference. For example, let us compare the queries:� Retrieve the inexpensive cars among the high speed ones. (µ = min(µinexpensive/high,µhigh)) and
Table 6. The “high speed cars among the inexpensive ones” table Max Speed
Price
µ inexpensive
µ HIGH
µ
IO
145
24000
1
0.00
0.00
MC
190
18200
1
0.09
0.09
M L200
145
19095
1
0.00
0.00
NV
132
15883
1
0.00
0.00
OA
186
16042
1
0.00
0.00
OV
208
20669
1
1
1
OZ
178
18364
1
0.00
0.00
P 206
170
10466
1
0.00
0.00
P 806
177
20633
1
0.00
0.00
RC
186
12138
1
0.00
0.00
OCS
120
26259
0.85
0.00
0.00
LRD
130
28000
0.61
0.00
0.00
AA 4
221
28450
0.54
1
0.54
P 607
222
31268
0.16
1
0.16
B3
226
31562
0.13
1
0.13
C 300M
230
32000
0.07
1
0.07
Name
227
Qualifying Objects in Classical Relational Database Querying
Retrieve the high speed cars among the inexpensive ones. (µ=min(µhigh/inexpensive,µinexpensive)) The difference appears as evident by looking comparatively at Tables 5 and 6 (the finally selected rows are shaded). iii
iv a.
When looking comparatively at Tables 4 and 5, one can observe the difference between the conjunctive criterion “�Р AND S ”������������� (multiqualification) and the new kind of selection criterion �“Р AMONG S ”��������������������������� (relative qualification). Semantically, the AND operator combines in one selection two independent criteria, having the same importance (priority) for the user’s preferences. On the contrary, the AMONG operator has to evaluate the second criterion prior to the first one. This is a supplementary argument for the noncommutatability of the AMONG operator. After a practical study of the use of relative qualification, we observed: Generally, the query formed by the two combined properties is searching for quite disjunct object categories. The answer of an AND conjunction is sometimes empty. On
b.
the contrary, the AMONG operator evaluates the second selection on a non-empty set of objects by adapting the model of the gradual property to the already selected objects. Obviously, the answer will be non-empty and it will be adequate to the user’s expectation. The AMONG operator does not give spectacular answers when the two properties are referring approximately to the same objects. For example, the query:
Retrieve the expensive cars among the high speed ones. (the shaded rows in Table 7)
compared to the query
Retrieve the expensive and high speed cars. (the shaded rows in Table 8)
v
Therefore, the most typical situation when this evaluation method is applicable is when the two criteria are in a special semantic relationship: the first selection brings out a hard limitation of the class of objects and the dynamic modeling of the linguistic values, for the second selection, becomes useful. The above procedure is not difficult to implement if we consider it a sequence of several operations. But an original and ef-
Table 7. The “expensive cars among the high speed ones” table
228
Name
Max Speed
Price
µ high
µ expensive/high
µ
P 911 C
280
65000
1
1
1
B Coupe
250
39000
1
0.00
0.00
B7
243
57200
1
0.65
0.65
MBS
240
69154
1
1
1
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
0.00
0.00
B3
226
31562
0.30
0.00
0.00
P 607
222
31268
0.10
0.00
0.00
AA 4
221
28450
0.05
0.00
0.00
Qualifying Objects in Classical Relational Database Querying
Table 8. The “expensive and high speed cars” table Name
Max Speed
Price
µ high
µ expensive
µ
P 911 C
280
65000
1
1
1
B Coupe
250
39000
1
0.00
0.00
B7
243
57200
1
1
1
MBS
240
69154
1
1
1
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
0.00
0.00
B3
226
31562
0.30
0.00
0.00
P 607
222
31268
0.10
0.00
0.00
AA 4
221
28450
0.05
0.00
0.00
ficient method to evaluate this kind of query is proposed in the section titled The Unified Model of the Context, where the knowledge base (fuzzy definitions of the linguistic vague terms) is incorporated in the database. vi The fuzzy predicates used to evaluate the criterion at the first step of the procedure can be previously defined, but this is not mandatory. On the contrary, at step 3, an algorithm has to be used in order to automatically obtain the adapted definitions. Various algorithms for dynamical defining linguistic values of database attributes are proposed in the section titled Dynamic Modeling of the Linguistic Values. vii Similar procedures can be used to evaluate more complex queries including relative qualification, for example:
the fuzzy aggregation, implementing the relative qualification (AMONG operator).
Relative Qualification to Other Crisp Attribute At least one more situation is relatively frequent: when the linguistic values must be dynamically defined for an attribute subdomain obtained after a crisp selection. It is about a complex selection criterion that includes a gradual property referring to the database rows already selected by a crisp value. Let us imagine a table (Table 9), containing all the sales transactions of a national company. The following query must take into account the principle that generally, in different cities (from the biggest to the smallest), the amount of the sales is different.
How many inexpensive cars are among the high speed ones?
Retrieve the clients in Galati which get large quantities of our product
The selection criterion “large quantity” has a different meaning for different cities. In the same way, the query evaluation procedure follows the same steps as in the previous section:
We need to mention that the aggregate computation on groups (how many) is not the subject of the present chapter (see, e.g., Blanco, Delgado, Martin-Bautista, Sánchez, & Vila, 2002; Delgado, Sánchez, & Vila, 2000; Rundensteiner & Bic, 1991), but only
229
Qualifying Objects in Classical Relational Database Querying
Table 9. The transactions in sales table Client
...
Quantity
City
AA
70
Galati
AA4
21
Tecuci
B3
67
Galati
B7
200
Bucharest
BC
30
Galati
CM
230
Bucharest
IO
145
Bucharest
LRD
130
Galati
MBS
24
Tecuci
MC
90
Tecuci
ML
145
Bucharest
NV
132
Galati
OA
86
Tecuci
OCS
120
Galati
OF
102
Galati
OV
8
Tecuci
OZ
17
Tecuci
P2
166
Galati
P6
222
Bucharest
P8
177
Bucharest
P9C
28
Tecuci
RC
186
Bucharest
1. ...
2. 3. 4.
One can remark that a large quantity at Galati (for example 130) is less than the minimum at Bucharest (i.e., 145). This is the reason for which the definitions of the linguistic values must be adapted to the context; that means that the qualification (large quantity) is relative to the other crisp attribute (city). The presented situation is a special case of the previous one. The evaluation procedure is the same, using the same AMONG operator; the simplification consists in the character of the second property (S), which is a crisp and not gradual property and for which the classical selection operation is enough. In this case, the simplified algebraic model of the AMONG operator becomes:
Table 10. The transactions in Galati city table Client
230
...
Quantity
City
AA
70
Galati
B3
67
Galati
BC
30
Galati
LRD
130
Galati
NV
132
Galati
OCS
120
Galati
OF
102
Galati
P2
166
Galati
The crisp selection criterion city=‘Galati’ is classically evaluated and an intermediate result is obtained (Table 10). The interval containing the quantity for the selected sales forms the quantity subdomain [30, 166], instead of [8, 230]. The linguistic value set {small, medium, large} will be defined on the new subdomain (Figure 4). The fuzzy selection criterion large quantity is evaluated according to the new definitions and the fulfillment degree will result for each tuple (Table 11).
...
µ Р AMONG S : R → [0,1] ,
t → ׀μР / S (t.A1)
b −a µ Р AMONG S (t) = μР ( a 1 + 1 1 ( t.A1 − a 1 ' ) ) b1 '−a 1 ' (4)
where [a1’, b1’] ⊆ [a1,b1] is the sub-interval of the attribute A1 values in the table Q S (R) (obtained by the first selection, on the attribute A2, using property S) One can observe that, this time, the property S brings its contribution in expressing the criteria satisfaction degree only by the limitation of the domain for the property Р ; [a1,b1] becomes [a1’, b1’].
Qualifying Objects in Classical Relational Database Querying
Figure 4. Linguistic values defined on a subdomain µ
SMALL
MEDIUM
LARGE
1 0.5 0
Quantity 30
64
81
108 115 132
…
...
Quantity
City
P2
166
Galati
1
NV
132
Galati
1
LRD
130
Galati
0.88
OCS
120
Galati
0.29
µ
Table 12. Transactions of large quantities AND in Galati city Client P2
…
Quantity
City
166
Galati
...
expresses a qualification relative to a crisp attribute and can be evaluated like above. But if the query is: Retrieve the clients which get large quantities
166
Table 11. Transactions of large quantities in Galati city Client
Retrieve the clients which get large quantities of soap
µ 0.98
Note. If the above query is interpreted as a multiqualification and not as a relative qualification, the answer will be Table 12, absolutely different from Table 11. Therefore, a more suggestive formulation of the query should be: Retrieve the clients which get large quantities of our product among the clients in Galati
Relative Qualification to Group on Other Attribute Let us start with an example: The queries in the previous paragraph consider that all sales refer to the same product (“our product”); that is, the quantities can be compared. But, let us consider now that all sales for different products are stored in a database (Table 13). In this case, the query
then the values of the quantity attribute for different products cannot be compared. It is impossible. This example suggests the evaluation of the large quantity criterion by taking into account one product at a time, that is: Retrieve the clients which get large quantities of some product For each product, the linguistic value is defined on the interval of the quantity values existing in the database, but only for that product. According to the definitions in Figure 5, the answer will be in Table 14. One can remark the “higher weight” of the 11 vacuum cleaners than the 162 envelopes.
Dynamic Modeling of the Linguistic Values Our study has found a new class of queries, where two fuzzy criteria are combined in a complex selection criterion such that a second fuzzy criterion is applied on a subset of database rows already selected by the first one. We assume that the secondly applied fuzzy criterion is expressed by a linguistic value of a database attribute, which is a gradual property; not an absolute property, but a relative one. In this case, modeling the linguistic domain of the second attribute requires taking into account not the whole crisp attribute domain, but a limited subset, characteristic to the first criterion-selected database rows. Actually, the main problem of the relative qualification is how to dynamically define the linguistic values on the subdomains (step 3 of the above algorithm), depending on an instant context? 231
Qualifying Objects in Classical Relational Database Querying
Table 13. Transactions of various product sales Client
...
Quantity
Product
...
AA
70
soap
AA4
11
vacuum cleaner
B3
6
soap
BC
30
soap
CM
230
envelope
IO
145
envelope
MBS
4
vacuum cleaner
ML
162
envelope
NV
10
soap
OA
14
vacuum cleaner
OCS
2
soap
OF
102
soap
OV
1
vacuum cleaner
OZ
1
vacuum cleaner
P6
200
envelope
P8
70
envelope
P9C
2
vacuum cleaner
RC
18
envelope
Some procedures for automatic discovering of the linguistic values definitions can be implemented, having a great advantage: details regarding effective attribute domain limits, or distributions of the values, can be easily obtained thanks to directly connecting to the database (more details in Tudorie, 2004; Tudorie & Dumitriu, 2004). Two examples of algorithms are presented in the following. We assume that there are usually three linguistic values, modeled as trapezoidal membership functions. The first algorithm (by uniform domain covering). In most applications, defining the linguistic values set covers almost uniformly the referential domain (Figure 6).
•
Obtaining the definition for the three linguistic values l1, l2 , and l3 on a database attribute starts from the predefined values α and β, and the attribute crisp domain limits, I and S; these ones are coming from the database content. For example: 1 = (S − I ) 8
and
=2 =
Table 14. Transactions of large quantities sales Client
232
...
Quantity
Product
...
µ
CM
230
envelope
1
P6
200
envelope
1
OF
102
soap
1
OA
14
vacuum cleaner
1
AA4
11
vacuum cleaner
1
ML
162
envelope
0.53
AA
70
soap
0.46
1 (S − I ) 4
(5)
Qualifying Objects in Classical Relational Database Querying
Figure 5. Linguistic values defined on subdomains of the quantity attribute for each product
µ
soap small
1
medium
large
0.5 0
Quantity 2
µ
27
39.5
52
64.5
77
102
vacuum cleaner small
1
medium
large
0.5 0
Quantity 2
5
6.5
8
9.5
11
14
envelope
µ
small
1
medium
large
0.5
Quantity
0 18
70
96
122
Figure 6. A set of linguistic values uniformly distributed on an attribute domain µ 1
l1
l2
174
226
0 , I ≤ v ≤ I + 2β + a I + 2β + 2a − v m l 3 ( v) = 1 − , I + 2β + a ≤ v ≤ I + 2β + 2a a 1 , v ≥ I + 2β + 2a
l3
0
I
S
The membership functions for l1, l2 , and l3 are: 1 , I ≤ v ≤ I + β v − (I + β) m l1 ( v) = 1 − , I+β ≤ v ≤ I+β+a a 0 , v ≥ I + β + a 0 , I ≤ v ≤ I + β I+β+a−v 1 − , I+β ≤ v ≤ I+β+a a m l 2 ( v) = 1 , I + β + a ≤ v ≤ I + 2β + a v − (I + 2β + a) 1 − , I + 2β + a ≤ v ≤ I + 2β + 2a a 0 , v ≥ I + 2β + 2a
148
(6)
where v=t.A is a value in the domain D = [I,S] ) of an attribute A of a table R. Based on this idea, a software interface FuzzyKAA system, presented in Tudorie (2006a, 2006b), is able to assist the user for linguistic values defined in a database context. Starting from a uniform partitioning of the attribute domain, the user can adjust the shapes either by changing numerical coordinates of graphical points or by directly manipulating them. •
The second algorithm (statistical mean–based algorithm) takes into account the real distribution of the attribute values in the database. The idea is to center the middle trapezium on the statistical mean of the attribute values (M). The other membership functions are 233
Qualifying Objects in Classical Relational Database Querying
Figure 7. Linguistic values defined on the basis of the statistical mean µ 1
’ l1
0I
l2
l3 M
A S
distributed on the left and on the right on the rest of the interval (Figure 7). In this case, the basic data used to determine the fuzzy models are the attribute crisp domain limits (I and S), but also the statistical mean value into the [I,S] interval: M=
n
∑ t .A / n , i −1
i
where n is the cardinality of the relation R and ti is a tuple ti ∈ R. The values of α, β, and α’ are based on I, S, M and they can be: 1 ⋅ min (M - I , S - M) 4 1 β = 2a = ⋅ min (M - I , S - M) 2 a=
7 a' = (S − I) − ⋅ min (M - I , S - M) 4 If 0 < α < 1 (S − I) , 0 < β < 1 (S − I) then 8 4 1 (S − I) < α’ < (S – I) 8
The formulae of the membership functions for the linguistic values l1, l2, and l3 are depending on the asymmetry of the statistical mean value into the [I,S] interval. They are obtained in a similar way with the above algorithm. ����������������� The fuzzy models obtained by this method seem to be closer to the meaning accepted by the user’s mind. It is important to remark that the same online method to model the linguistic domain of a database attribute can be used any time, instead of an
234
off-line process to knowledge acquisition from a human expert.
The Unified Model of the Context Generally, an intelligent interface for flexible querying database is an extra layer using its own data (knowledge base) containing the fuzzy model of the linguistic terms included in vague queries. Such an interface must be conceived as general as possible, that is, able to connect to any database, assuming that the knowledge base corresponding to it is already available (Figure 8). FSQL ����� (Galindo et al., 2006), for example, uses a FMB (Fuzzy Metaknowledge Base) with the definitions of labels, quantifiers, and more information about the fuzzy capabilities. The fuzzy query evaluation is possible by building an equivalent crisp query. The knowledge (fuzzy model of the linguistic terms) is used first for the SQL query building and after that for the computing of the fulfillment degree for each tuple. The context is defined in this case as the pair database and the knowledge base corresponding to it. The functionality and the utility of such an intelligent interface have been practically proved by some software systems presented in the next section. One of the most important points of an interface to databases is the performance, more specifically, the response time in query evaluation. In order to have good performance, an efficient solution needs to model the context in a uniform approach, as a single database, incorporating the fuzzy model of the linguistic terms, or their description in the target database. So, a unified model of the context is proposed in the following. There are two possibilities to model a unified context: •
Static Context: Including in the database of the static definitions of the linguistic terms (their fuzzy models) established a priori before the querying process.
Qualifying Objects in Classical Relational Database Querying
Figure 8. The ��������������������������������������������������������� flexible interface integrated in the querying system� Database Server
Database 1
KB 1 Fuzzy Knowledge
Database 2
KB 2 Fuzzy Knowledge
. . .
. . .
Database n Flexible Interface for Database Querying
KB n Fuzzy Knowledge
Dynamic Context: Including in the database only of the data necessary to dynamically define the linguistic terms at the moment of (or during) the querying process.
tabase; Terms and Points contain the description of the linguistic value shapes.
In the first case, only the absolute qualification or multiqualification queries can be processed. On the contrary, in order to evaluate relative qualification queries, the second model must be accepted; in other words, the fulfillment degree is dynamically computed by taking into account the subdomains of the attributes.
The section titled Relative Object Qualification has presented certain types of queries that require dynamically defining the linguistic values by partitioning an attribute subdomain already obtained by a previous selection. Moreover, if we accept that the dynamical definition is suitable to the user’s perception, then even the initial acquisition of the knowledge can be dynamically realized by a minimal involvement of the user (knowledge engineer). In all these situations, the above model is not functional anymore; this time, the complexity is transferred to the expressions evaluated at the moment of querying; these expressions can stand for the linguistic values model. At querying time, only the labels corresponding to the linguistic domain are necessary to be known. Therefore, the points table does not exist anymore in the context model, only the terms table does. According to the proposed model of the context, the vague query evaluation consists in building only one crisp SQL query, so as to provide the searched database objects and at the same time the degree of criteria satisfaction for each of them. Various situations will be analyzed by observing each of the two proposed models of the context, thus:
•
Static Model of the Context The fuzzy model of the linguistic terms can be described by various methods. Some complex graphical interfaces are developed in the Computer Science Department of the “Dunarea de Jos” University and are presented in Tudorie (2006a, 2006b) and Tudorie, Neacsu, and Manolache (2005) and in the next section of this chapter. Usually, the shape of the membership function of a fuzzy set is trapezoidal. However, we chose a more general model, which is a polygonal shape (Figure 9). In this case, the knowledge base can be modeled as a set of tables, which can be incorporated in the database. One possible unified model of the context is presented in Figure 10. Table1, Table2, …, Tablen are the tables of the target da-
Dynamic Model of the Context
235
Qualifying Objects in Classical Relational Database Querying
Figure 9. A possible model of the linguistic domain of a database attribute A m 1
0
pi11
pf11
pf12
pf13 pi21 pf14 pf21 pf22 pi31 pf23
Figure 10. The unified context model, based on static definitions of the linguistic values
pf31 pf24 pi41 pf32 pf41
A pf42
Evaluation of an Absolute Qualification Criterion Let us consider a pseudo-SQL query, generated from the user’s interface (any kind of interface style: natural language, graphical, command language, etc.). The general form is: SELECT * FROM table WHERE attribute = # ’term’
• • •
an absolute qualification criterion can be evaluated in a static context or in a dynamic context; also, a multiqualification criterion can be evaluated in both kinds of context; on the contrary, the relative qualification can be evaluated only according to the dynamic context model.
It is important that we accept, in the following, three linguistic values, represented by trapezoidal shapes as a membership function on the database attribute domain. For the dynamic definitions, we adopt the first algorithm proposed in the section titled Dynamic Modeling of the Linguistic Values. Any generalization is possible.
236
where the symbol #, preceding a linguistic term, denotes a gradual property. The crisp SQL query corresponding to the user’s request by observing the context model based on static definitions can be as shown in Equation 7. The EXPRESSION has to be replaced with the algebraic model of the fulfillment degree by taking into account the context (data and knowledge) in Figure 10. (See Equation 8.) The crisp SQL query corresponding to the user’s request, by observing the context’s model based on dynamic definitions, can be as shown in Equation 9. The EXPRESSION has to be replaced with the algebraic model of the fulfillment degree by taking into account the context (data and knowledge) and formulae (5)–(6). For example, the expression for the first linguistic value is: m l1 ( v) = s( v − I) −
v − I −β v − I −β ⋅ s( v − I − β) − s( v − I − β − a) ⋅ 1 − a a
(10)
Qualifying Objects in Classical Relational Database Querying
Equation 7. SELECT r.*, EXPRESSION AS ”degree”
FROM table r, terms t, points p
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
t.ID_t = p.ID_t AND r.attribute >= p.pi AND r.attribute 0 ORDER BY degree DESC;
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
t.ID_t = p.ID_t AND r.attribute >= p.pi AND r.attribute 0
ORDER BY degree DESC;
Equation 9. SELECT r.*, EXPRESSION AS ”degree”
FROM table r, terms t
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
degree > 0
ORDER BY degree DESC;
where v=r.attribute, I and S are the interval limits of the attribute domain: I = (SELECT MIN(r.attribute) FROM table r) S = (SELECT MAX(r. attribute) FROM table r) α and β are: 1 a = (S − I) 8
WHERE t1.table=’table‘ AND t1.attribute=’attribute1’ AND t1.label=’term1’ AND
t1.ID_t = p1.ID_t AND r. attribute1>= p1.pi AND r. attribute1= p2.pi AND r.attribute2 0
ORDER BY degree DESC;
Equation 16. SELECT r.*, LEAST(EXPRESSION1, EXPRESSION2) AS ”degree”
FROM table r, terms t1, terms t2
WHERE t1.table=’table‘ AND t1.attribute=’attribute1’ AND t1.label=’term1’ AND
t2.table=’table‘ AND t2.attribute=’attribute2’ AND t2.label=’term2’ AND
degree > 0 ORDER BY degree DESC;
LEAST( EXPRESSION1, EXPRESSION2 )
where EXPRESSION1 and EXPRESSION2 are the satisfaction degrees of the two criteria and LEAST corresponds to the mathematical min function (in Oracle). Thus, for a query like Equation 15 and the SQL query according to a dynamic context is as shown in Equation 16. These command lines can be considered the algorithmic model of the AND conjunction operator in a database fuzzy querying context.
Evaluation of a Relative Qualification Criterion: The Model of the AMONG Operator This time, only the dynamic context can be considered because the second criterion is always evaluated by taking into account the selection
already obtained after the first criterion, and the linguistic values are always dynamically modeled on subdomains of the attribute. Thus, for a query like SELECT * FROM table WHERE attribute1 = # ’term 1’ AMONG attribute2 = # ’term2’
the SQL query according to a dynamic context is as shown in Equation 17. The expression EXPRESSION2 corresponds to the first criterion and observes the above model, represented in formulae (10)-(13). But EXPRESSION1, that corresponds to the secondly evaluated criterion, is referring to the attribute subdomain obtained by the first selection. So, in EXPRESSION1 the table will be replaced by the answer table Q and the new values of the parameters will be:
239
Qualifying Objects in Classical Relational Database Querying
Equation 17. SELECT r.*, LEAST(EXPRESSION1, Q.firstdegree) AS ”degree”
FROM ( SELECT r.*, EXPRESSION2 AS ”firstdegree”
FROM table r, terms t
WHERE t.table=’table‘ AND t.attribute=’attribute2’ AND t.label=’term2’ AND
firstdegree > 0 ) AS Q ,
table r, term t WHERE t.table=’table‘ AND t.attribute=’attribute1’ AND t1.label=’term1’ AND degree > 0 ORDER BY degree DESC;
Figure 11. FuzzyQE interface for linguistic domain defining and database complex querying
I = (SELECT MIN(Q.attribute1) FROM Q) S = (SELECT MAX(Q.attribute1) FROM Q) α = 1/8*(SELECT MAX(Q.attribute1)-MIN(Q.attribute1) FROM Q)
β = 1/4*(SELECT MAX(Q.attribute1)-MIN(Q.attribute1) FROM Q)
(18)
The command line (17) can be considered the algorithmic model of the AMONG operator for the relative qualification in a database fuzzy querying context.
Retrieve the inexpensive cars among the high speed ones
Laboratory Software Tools for Database Flexible Querying Some software tools are developed for scientific purposes to be used for studying flexible queries (Tudorie, 2006a, 2006b; Tudorie et al., 2005). These multifunctional systems enable the analysis of many aspects. The most important is the opportunity of studying the relative qualification phenomenon and of validating the proposed algorithms for fuzzy queries evaluation. All these systems allow acquiring fuzzy models of the linguistic terms by graphical interfaces. Some of them are also able to interpret natural language (Romanian language) queries. It is important to
240
Qualifying Objects in Classical Relational Database Querying
remark that these systems are general enough to enable the any-time connection to any database and its associated knowledge base, newly created or already used. Here are three examples:
Interface for Fuzzy Knowledge Acquisition and Database Fuzzy Querying (FuzzyQE System) Goal: This software tool is able to connect the user to any database and to assist the user in the definition of linguistic values and fuzzy queries in that database context. The system proposes a uniform partitioning of the attribute domain and then the definitions implicitly obtained can be adjusted either by changing numerical coordinates of graphical points or by directly manipulating them. Simple queries (absolute object qualification) to existing definitions, but also complex queries (relative object qualification) are evaluated by implementing the AMONG operator (Figure 11).
Multi-User System for Linguistic Values Modeling and Database Fuzzy Querying (MultiDef System) Goal: This software tool is able to connect more users (e.g., knowledge engineers) to the same database; each of them has the possibility to describe each linguistic value of the database attributes. One defining process starts with an initial implicit model; the user may modify it according to his own semantic for the current linguistic term. An administrator, having his own interface, is monitoring and managing all this activity; he has any moment a total view of all membership functions drawn by the users for the same linguistic terms on the same attribute domain (Figure 12). Many types of queries can be evaluated.
Flexible Interface for Linguistic Values Modeling and Database Fuzzy Querying (CALIF �������������� System) Goal: This software tool enables the connection to any database, via a graphical interface
and linguistic values modeling (as fuzzy sets) by various algorithms by choice. Three main types of queries (simple, conjunction, and relative) can be evaluated. The interface is very flexible, providing many ways of adjusting parameters, definitions, and options (Figure 13).
Conclusion This chapter formulates a number of new problems, not very complicated, but referring to quite frequent situations, which were not discussed so far. The main aim of the chapter and its originality consists in the introduction of the concept of relative qualification of objects in the context of relational database querying. Under this circumstance, some important new problems, strongly related to the relative qualification, were developed (dynamic modeling of the linguistic values and the unified context based on dynamical definitions). Moreover, in an extended framework, we discussed in this chapter the problems of objects linguistic qualification and context modeling in all their aspects. More precisely, the complex criteria we studied (relative qualification) include two vague conditions in a special relationship: the first gradual property, expressed by a linguistic qualifier, is interpreted and evaluated relatively to the second one; accordingly, the fulfillment degree is computed in a particular way. The main idea of the evaluation procedure is to dynamically define sets of linguistic values on limited attribute domains, determined by previous fuzzy selections. This is the reason why it is not useful to create a priori the knowledge base with the fuzzy definitions, but to define the vague terms included in queries each time they need. One more reason is the great advantage of the direct connecting to the database: because details regarding effective attribute domain limits, or distributions of the values, can be easily obtained. In this idea, we developed the problem of dynamic modeling of linguistic values. Methods for automatic extraction of the linguistic values definitions from the actual database attribute values and solutions for
241
Qualifying Objects in Classical Relational Database Querying
Figure 12. MultiDef interface for linguistic domain defining and database complex querying
Figure 13. CALIF interface for linguistic domain defining and database complex querying
uniformly modeling the context (database and knowledge base) were proposed. The theoretical contribution consists in the new fuzzy aggregation operator AMONG, defined by this chapter; it stands for the model of the relative selection criterion in database fuzzy queries. A detailed discussion on the semantics, properties, and other remarks are present in the chapter. Some implementations that validate all these ideas were also briefly presented. They are developed and are running in the laboratory of our department. Future works would explore the implications of the new proposed kind of query in real fields like business
intelligence, OLAP, or data mining; but also other application fields of the new connective AMONG, like fuzzy control, or fuzzy database, where we are working with fuzzy values.
242
Acknowledgment The author is thankful for all the remarks made by the reviewers, particularly the editor. Many thanks also to the “English reviewer.”
Qualifying Objects in Classical Relational Database Querying
References Blanco, I., Delgado, M., Martín-Bautista, M. J., Sánchez, D., & Vila, M. A. (2002). ����������� Quantifier guided aggregation of fuzzy criteria with associated importances. In T. Calvo, R. Mesiar, & G. Mayor (Eds.), Aggregation operators: New trends and applications (Studies on Fuzziness and Soft Computing Series 97, pp. 272-290). Physica-Verlag. Bosc, P., & Pivert, O. (1992). Fuzzy querying in conventional databases. In L. A. Zadeh & J. Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty (pp. 645-671). New York: Wiley. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Sys������� tems, 3, 1-17. Bosc, P., & Prade, H. (1997). An ������������������� introduction to fuzzy set and possibility theory-based approaches to the treatment of uncertainty and imprecision in data base management systems. In A. Motro & P. Smets (Eds.), Uncertainty management in information systems: From needs to solutions (pp. 285-324). ���������������������������� Kluwer Academic Publishers. Bouchon-Meunier, B. (1995). La logique floue et ses applications. Paris: ���������������������� Addison-Wesley. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7(3), 213-226. Delgado, M., Sanchez, D., & Vila, M. A. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Dubois, D., Ostasiewicz, W., & Prade, H. (1999). Fuzzy sets: History and basic notions (����������� Tech. Rep. No. IRIT/99-27 R). Toulouse, France: Institut de Recherche en Informatique. Dubois, D., & Prade, H. (1996). Using ����������������� fuzzy sets in flexible querying: Why and how? In H. Christiansen, H. L. Larsen, & T. Andreasen (Eds.), Workshop on flexible query-answering systems (pp. 89-103),������������������� Roskilde, Denmark. ������������������
Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Gazzotti, D., Piancastelli, L., Sartori, C., & Beneventano, D. (1995). FuzzyBase: A fuzzy logic aid for relational database queries. Paper presented at the 6th International Conference on Database and Expert Systems Applications, DEXA’95 (pp. 385-394), London, UK. Goncalves, M., & Tineo, L. (2001a). SQLf flexible querying language extension by means of the norm SQL2. Paper presented at the 10th IEEE International Conference on Fuzzy Systems, Fuzz-IEEE 2001 (Vol. 1), Melbourne, Australia. Goncalves, M., & Tineo, L. (2001b). SQLf3: An extension of SQLf with SQL3 features. Paper presented at the 10th IEEE International Conference on Fuzzy Systems, Fuzz-IEEE 2001 (Vol. 1), Melbourne, Australia. Grabisch, M., Orlovski, S. A., & Yager, R. R. (1998). Fuzzy aggregation of numerical preferences. In R. Slowinski (Ed.), Fuzzy sets in decision analysis, operations research and statistics (pp. 31-68). Boston: Kluwer Academic Publishers. Kacprzyk, J., & Zadrozny, S. (1995). FQUERY for ACCESS: Fuzzy querying for a Windows-based DBMS. In P. Bosc & J. Kacprzyk (Eds.), Fuzzyness in datatabase management systems (pp. 415-433). Heidelberg: Physica-Verlag. Kacprzyk, J., & Zadrozny, S. (2001). Computing with words in intelligent database querying: Standalone and Internet-based applications. Information Sciences, 134, 71-109. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model for fuzzy relational databases. Information Sciences, 76, 87-109. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143.
243
Qualifying Objects in Classical Relational Database Querying
Projet BADINS: Bases de données multimédia et interrogation souple. (1995). Rennes, ���������������� France:� Institut de recherche en informatique et systèmes aléatoires. Projet BADINS: Bases de données multimédia et interrogation souple. (1997). Rennes, France: Institut de recherche en informatique et systèmes aléatoires. Rundensteiner,������������������������������������� E., & Bic, L. (1991). �������������� Evaluating aggregates in possibilistic relational databases. Data & Knowledge Engineering, 7, ��������� 239-267��. Tudorie, C. (2004). Linguistic values on attribute subdomains in vague database querying. Journal on Transactions on Systems, 3(2), 646-650. Tudorie, C. (2006a). Contributions to interfaces for database flexible querying. Doctoral thesis, University “Dunărea de Jos,” Galaţi, Romania. Tudorie, C. (2006b). Laboratory software tools for database flexible querying. Paper presented at the 2006 International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, IPMU’06 (pp. 112115). Tudorie, C., & Dumitriu, L. (2004). ������������ How are the attribute linguistic domains involved in database fuzzy queries evaluation. Scientific Bulletin of “Politehnica” University of Timisoara, 49(63), 61-64. Tudorie, C., Neacsu, C., & Manolache, I. (2005). Fuzzy queries in Romanian language: An intelligent interface. Annals of “Dunarea de Jos,” III, 45-53. Yager, R. R. (1991). Connectives and quantifiers in fuzzy sets. Fuzzy Sets and Systems, 40(1), 39-75. Elsevier Science.
Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357; 9, 43-80.
Key Terms Absolute Object Qualification: Including one gradual property (simple qualification) or a conjunction of gradual properties (multiqualification) in a query selection criterion. The fulfillment degree is computed accordingly to the fuzzy sets describing the meaning of the linguistic qualifiers. A conjunctive aggregation operator is used to model the connective AND (in the case of multiqualification). AMONG Operator: Fuzzy aggregation operator used for evaluation of a query selection criterion based on a relative qualification. �������������� The algebraic model of the AMONG operator is: µ Р AMONG S : R → [0,1] b −a µ Р AMONG S (t) = min ( μР ( a 1 + 1 1 ( t.A 1 − a 1 ' ) ) b1 '−a 1 ' , μS (t.A2) ) where R is a relation, A1 and A2 are two attributes of the R relation, A1 is defined on the interval [a1,b1] , t is a tuple, Р and S are two gradual properties corresponding to the attributes A1 and A2, µР and µS are the membership functions defining the Р and S gradual properties,
Yager, R. R., & Zadeh, L. A. (Eds.). (1992). Introduction to fuzzy logic applications in intelligent systems. Kluwer Academic Publishers.
[a1’, b1’] ⊆ [a1,b1] is the sub-interval of the A1 corresponding to the table QS (R) (obtained by the first selection, on the attribute A2, using property S ).
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Context for Fuzzy Querying Interface: The pair: target database and the knowledge base (con-
244
Qualifying Objects in Classical Relational Database Querying
taining the fuzzy model of the linguistic terms) corresponding to it. Dynamic Model of the Context: Including in the database only the data necessary to dynamically define the linguistic terms, at the moment of (or during) the querying process. Dynamic Model of the Linguistic Value: Automatic discovering of the linguistic values definitions from the actual content of the database. Appropriate algorithms can be implemented, based on a great advantage: by directly connecting to the database, one can easily obtain details regarding effective attribute domain limits, or distributions of the values. This procedure is generally useful, instead an off-line process of knowledge acquisition from a human expert; but it is mandatory in the relative qualification case. Multiqualification: Including more gradual properties in a query selection criterion. They are independent of each other and they have the same significance for the user’s preferences. The fulfillment degree is computed accordingly to the
fuzzy sets describing the meaning of the linguistic qualifiers and to the conjunctive aggregation operator, as model of the connective AND. Relative Object Qualification: Two gradual properties, as fuzzy conditions, are combined in a complex selection criterion, such that one of them is applied on a subset of database rows, already selected by the other one; dynamic definition of the linguistic value, corresponding to the secondly evaluated condition, is needed. The fulfillment degree is computed accordingly to the fuzzy sets describing the meaning of the linguistic qualifiers and to the AMONG aggregation operator. Static Model of the Context: Including in the database the static definitions of the linguistic terms (their fuzzy models) established apriori, before the querying process. Unified Model of the Context: Modeling the context in a uniform style, as a single database, i.e. incorporating the fuzzy model of the linguistic terms or their description, in the target database.
245
246
Chapter X
Evaluation of Quantified Statements Using Gradual Numbers Ludovic Liétard IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher IRISA/IUT & IRISA/ENSSAT, France
Abstract This chapter is devoted to the evaluation of quantified statements which can be found in many applications as decision making, expert systems, or flexible querying of relational databases using fuzzy set theory. Its contribution is to introduce the main techniques to evaluate such statements and to propose a new theoretical background for the evaluation of quantified statements of type “Q X are A” and “Q B X are A.” In this context, quantified statements are interpreted using an arithmetic on gradual numbers from ℕf, ℤf, and ℚf. It is shown that the context of fuzzy numbers provides a framework to unify previous approaches and can be the base for the definition of new approaches.
Introduction Linguistic quantifiers are quantifiers defined by linguistic expressions like “around 5” or “most of,” and many types of linguistic quantifiers can be found in the literature (Diaz-Hermida, Bugarin, & Barro, 2003; Glockner, 1997, 2004a, 2004b; Losada, Díaz-Hermida, & Bugarín, 2006) (as semifuzzy quantifiers which allow modeling expressions like “there are twice as many men as women”). We limit this presentation to the original linguistic quantifiers defined by Zadeh (1983) and
the two types of quantified statements he proposes. Such linguistic quantifiers allow an intermediate attitude between the conjunction (expressed by the universal quantifier ∀) and the disjunction (expressed by the existential quantifier ∃). Two types of quantified statements can be distinguished. A statement of the first type is denoted “Q X are A” where Q is a linguistic quantifier, X is a crisp set and A is a fuzzy predicate. Such a statement means that “Q elements belonging to X satisfy A.” An example is provided by “most of employees are well-paid” where Q is most of and X is a set
Evaluation of Quantified Statements Using Gradual Numbers
of employees, whereas A is the condition to be well-paid. In this first type of quantified statement, the referential (denoted by X) for the linguistic quantifier is a crisp set (a set of employees in the example). A second type of quantified statements can be defined where the linguistic quantifier applies to a fuzzy referential. This is the case of the statement “most of young employees are well-paid” since most of applies to the fuzzy referential made of young employees. This statement means that most of elements from this fuzzy referential (most of young employees) can be considered well-paid. Such a quantified statement is written “Q B X are A” where A and B are two fuzzy predicates (when referring to the previous example, Q is most of, X is a set of employees, B is to be young while A is to be well-paid). Linguistic quantifiers can be used in many fields, and we briefly recall their use in multicriteria decision making, expert systems, linguistic summaries of data, and flexible querying of relational databases; some minor applications of linguistic quantifiers as in machine learning (Kacprzyck & Iwanski, 1992) or neural networks (Yager, 1992) are not dealt with. Multicriteria decision making consists mainly in finding optimal solutions to a problem defined by objectives and constraints. A solution must fullfill all objectives and must satisfy all constraints. The use of linguistic quantifiers in decision making (Fan & Chen, 2005; Kacprzyck, 1991; Malczewski & Rinner, 2005; Yager, 1983a) aims at retrieving solutions fullfilling Q objectives with respect to Q’ constraints, where Q and Q’ are either a linguistic quantifier or the universal quantifier. A typical formulation is then “find the solution where almost all objectives are achieved and where all constraints are satisfied.” The use of linguistic quantifiers in expert systems concerns mainly the expression and handling of logical propositions. An example is provided by logical statements accepting exceptions. A typical statement accepting exceptions is the proposition “all Sweden are tall” which can be turned into “almost all Sweden are tall” involving the linguistic quantifier “almost all.” Many inferences involving
quantified statements are possible (Dubois, Godo, De Mantaras, & Prade, 1993; Dubois & Prade, 1988a; Laurent, Marsala, ���������������������������� & Bouchon-Meunier��, 2003; Loureiro Ralha & Ghedini Ralha, 2004; Mizumoto, Fukami, & Tanaka, 1979; Sanchez, 1988). It is possible to consider the following one set in the probabilistic framework: if I know that “Karl is Sweden” and that “almost all Sweden are tall,” it is then possible to infer that the event “Karl is tall” is probable. The challenge is then to compute the degree of probability (which may be imprecise) attached to the event “Karl is tall.” Data sumarization (Kacprzyck, Yager, & Zadrozny, 2006; Sicilia, Díaz, Aedo, & García, 2002) is another field where linguistic quantifiers can be helpful. Yager (1982) defines summaries expressed by expressions involving linguistic quantifiers (the summary of a database could be “almost the half young employees are well-paid”). SummarySQL language (Rasmussen & Yager, 1997) has been proposed to define and evaluate linguistic summaries of data defined by quantified statements. As an example, it is possible to use this language to determine the validity (represented by a degree) on a given database of the linguistic summary “almost the half young employees are well-paid.” Flexible querying of relational databases aims at expressing preferences into queries instead of Boolean requirements as is the case for regular (or crisp) querying. Consequently, a flexible query returns a set of discriminated answers to the user (from the best answers to the less preferred). Many approaches to define flexible queries have been proposed, and it has been shown that the fuzzy set based approach is the more general (Bosc & Pivert, 1992). Extensions of the SQL language, namely SQLf (Bosc & Pivert, 1995) and FSQL (Galindo, 2005, 2007; Galindo, Medina, Pons, & Cubero, 1998; Galindo, Urrutia, ���������������������������� & Piattini��������� , 2006), have been proposed to define sophisticated flexible queries calling on fuzzy sets (in���������������� this book, the reader can find a chapter by Urrutia, Tineo, and Gonzalez including a comparison between FSQL and SQLf)������������������������������������������ . In this context, predicates are defined by fuzzy sets and are called fuzzy predicates, and they can be combined using various operators 247
Evaluation of Quantified Statements Using Gradual Numbers
such as generalized conjunctions and generalized disjunctions (respectively expressed by norms and t-norms) or using more sophisticated operators such as averages. Fuzzy predicate can also be defined by a quantified statements, as in the query “retrieve the firms where most of employees are well-paid.” After query evaluation, each firm is associated to a degree in [0,1] expressing its satisfaction with respect to the quantified statement of the first type: “most of employees are well-paid.” The higher this degree, the better answer is the firm. To evaluate a quantified statement is to determine the extent to which it is true. This chapter proposes a new theoretical framework to evaluate quantified statements of type “Q X are A” and “Q B X are A.” Propositions are based on the handling of gradual integers (from ℕf and ℤf) (Rocacher & Bosc, 2003a, 2003b) and gradual rational numbers (from ℚ f) as defined in Rocacher and Bosc (2003c, 2005). These specific numbers express well-known but gradual numbers and differ from usual fuzzy numbers which define imprecise (illknown) numbers. The section titled ��������������������������� Linguistic Quantifiers and Quantified Statements��������������������������� introduces the definition of quantified statements while the section titled Previous Proposals for the Interpretation of Quantified Statements is a brief overview of the proposition made for the evaluation of quantified statements. Gradual numbers are introduced in the section titled Gradual Numbers and Gradual Truth Value, and the section titled Interpretation of Quanitified Statements Using Gradual Numbers proposes to evaluate quantified statements using gradual numbers. In the following, we denote A(X) the fuzzy set made of elements from a crisp set X which satisfy a fuzzy predicate A (A(X) being defined by X ∩ A).
(∃), which are too limited to model all natural lan���� guage quantified sentences. For this reason, fuzzy quantifiers (Zadeh, 1983) have been introduced to represent linguistic expressions (many of, at least 3, etc.) and to refer to gradual quantities. It is possible to distinguish between absolute quantifiers (which refer to an absolute number such as about 3, at least 2, etc.) and relative quantifiers (which refer to a proportion such as about the half, at least a quarter, etc.). An absolute (resp. relative) quantifier Q in the statement “Q X are A” means that the number (resp. proportion) of elements satisfying condition A is compatible with Q. A linguistic quantifier can be increasing (resp. decreasing) (Yager, 1988) which means that an increase in the satisfaction to condition A cannot decrease (resp. increase) the truth value of the statement “Q X are A.” At least 3 and almost all (resp. at most 2, at most the half ) are examples of increasing (resp. decreasing) quantifiers. A quantifier is monotonic when it is either increasing or decreasing, and it is also possible to point out unimodal quantifiers which refer to a quantity such as about the half, about 4, and so forth. The representation of an absolute quantifier is a fuzzy subset of the real line while a relative quantifier is defined by a fuzzy subset of the unit interval [0,1]. In both cases, the membership degree µQ (j) represents the truth value of the statement “Q X are A” when j elements in X completely satisfy A, whereas A is fully unsatisfied by the others ( j being a number or a proportion). In other words, the definition of a linguistic quantifier provides the evaluation for “Q X are A” in case of a Boolean predicate. Consequently, the representation of an increasing (resp. decreasing) linguistic quantifier is an increasing (resp. decreasing) function µQ such that µQ (0) = 0 (resp. µQ (0) = 1) and ∃ k such as µQ (k) = 1 (resp. ∃ k such as µQ (k) = 0).
Linguistic Quantifiers and Quantified Statements
Example. Figure 1 describes the increasing relative linguistic quantifier almost all.
First order logic involves two quantifiers, the universal quantifier (∀) and the existential one
It is worth mentioning that, in case of an absolute quantifier, a quantified statement of type “Q B X are A” reverts to the quantified statement of the other
248
Evaluation of Quantified Statements Using Gradual Numbers
Figure 1. A representation for the quantifier almost all• 1 malmost all(p)
0
0.7
0.8 0.9 1 proportion p
type: “Q X are (A and B).” As an example, “at least 3 young employees are well-paid” is equivalent to “at least 3 employees are (young and well-paid).” As a consequence, when dealing with quantified statements of type “Q B X are A,” this chapter only deals with relative quantifiers.
Previous Proposals for the Interpretation of Quantified Statements In this section, the main propositions suggested to determine the truth value of quantified statements are briefly overviewed. An in-depth study of quantified statements interpretations can be found in Liu and Kerre (1998a, 1998b); Delgado, ��������� Sanchez, and Vila��������������������������������������� (2000); Barro, Bugarin, Cariñena, and Diaz-Hermida (2003); or Diaz-Hermida, Bugarin, Cariñena, and Barro (2004). The subsection titled Quantified Statements of Type “Q X are A” is devoted to the evaluation of quantified statements of type “Q X are A,” whereas the subsection titled Quantified Statements of Type “Q B X are A” is devoted to the evaluation of quantified statements of type “Q B X are A.” A short conclusion about these proposals is provided in the subsection titled About the Proposed Approaches to Evaluate Quantified Statements.
Quantified Statements of Type “Q X are A”
only the change of the quantity µQ (i/n) into µQ (i) with n as the cardinality of set X involved in the quantified statement. In the particular case of a Boolean predicate A, the evaluation of “Q X are A” is given by µQ (c) where c is the number of elements satisfying A. Some approaches (interpretations based on a precise and an imprecise cardinality) extend this definition to a fuzzy predicate A assuming that the cardinality of a fuzzy set can be computed. Other approaches (using an OWA operator or a Sugeno fuzzy integral) are based on a relaxation principle which implies the neglect of some elements. As an example, the interpretation of “almost all employees are young” means that some of the oldest employees can be (more or less) neglected before assessing the extent to which the remaining employees are young.
Interpretation Based on a Precise Cardinality Zadeh (1983) suggests computing the precise cardinality of fuzzy set A (called sigma-count and denoted ∑Count(A)). The sigma-count is defined as the sum of membership degrees, and the degree of truth of “Q X are A” is then µQ (∑Count(A)/n) with n as the cardinality of set X. The definition of ΣCount(A) implies that a large number of small µA(x) values has the same effect on the result than a small number of large µA(x) values. As a consequence, many drawbacks can be found such as the one shown by the next example. Example. Set X = {x1, x2 , ..., x10} is such that ∀i, µA(xi) = 0.1. In this case, the result for “∃ X are A” is expected to be 0.1 (or at least extremely low). The existential quantifier ∃ is defined by µ∃(0) = 0 and ∀i > 0, µ∃(i) = 1, and the absolute quantified statement is evaluated by µ∃(∑Count(A)). Computations give ∑Count(A) = 1 which implies that expression “∃ X are A” is entirely true (µ∃(1) = 1). This result is very far from the expected one. ♦
Relative quantifiers are assumed hereafter, and the adaptation to absolute quantifiers requires
249
Evaluation of Quantified Statements Using Gradual Numbers
Interpretation Based on an Imprecise Cardinality The method proposed in Prade (1990) involves two steps. The first one computes the imprecise cardinality πc of the set made of elements from X which satisfy A (it is a fuzzy number represented by a possibility distribution of integers). Then, quantifier Q is considered a vague predicate serving as a basis for a matching with πc. The result is a couple of degrees, the possibility and the necessity of the fuzzy event “πc is compatible with Q.” The imprecise cardinality of the set F of elements from X which satisfy A is given by the following possibility distribution (Dubois & Prade, 1985; Prade, 1990): let k be the number of values of F whose degree is 1: πc(k) = 1 (k may equal 0), ∀i < k, πc(i) = 0, th ∀j > k, πc(j) is the j largest value µ (x). F
In the ���������������������������������������� particular case where F is a usual set, πc describes a precise value (πc(k) = 1 and πc(i) = 0 ∀ i ≠ k) which��������������������������������������� is the usual cardinality of this set. The possibility Π(Q ; πc) and the necessity N(Q ; πc) of the fuzzy event “πc is compatible with Q” are (Dubois, Prade, & Testemale, 1988b): Π(Q ; πc) = max
1≤i≤n
min (µ (i/n), πc(i)) Q
and Ν(Q ; πc) = min
1≤i≤n
max (µ (i/n), 1 - πc(i)). Q
Figure 2. A representation for the quantifier almost all
Example. Let Q be the increasing relative quantifier almost all defined in Figure 2 and X = {x1, x2, x3, x4, x5, x6 , x7, x8 , x9, x10} with µA(x1) = µA(x2) = ... = µA(x7) = 1, µA(x8) = 0.9, µA(x9) = 0.7, µA(x10) = 0. We have: πc(7) = 1, πc(8) = 0.9 and πc(9) = 0.7, with: µalmost all(1/10) = ... = µalmost all(7/10) = 0, µalmost all(8/10) = 0.25, µalmost all(9/10) = µalmost all(1) = 1. The interpretation of “almost all X are A” leads to Exhibit A.
Interpretation by the OWA Operator We assume that X = {x1, ..., xn} and µA(x1) ≥ µA(x2) ≥ ... ≥ µA(xn). The interpretation of “Q X are A” (Q being increasing) by an ordered weighted average (OWA operator) is given by (Yager, 1988): n
∑ (wi *
i =1
A ( xi ) )
,
where wi = µQ (i/n) - µQ (i-1/n). Each weight wi represents the increase of satisfaction when comparing a situation where (i-1) elements are entirely A with a situation where i elements are entirely A (and the others are not at all A). This operator conveys a semantics of relaxation since the smaller wi, the more neglected µA(xi). An extension of the use of the OWA operator to decreasing quantifiers has been proposed by Yager (1993) and Bosc and Liétard (1993). The extension is based on the equivalence: “Q X are A” ⇔�� ∀Q' X are A ,”
1
where Q’ is the antonym of the decreasing quantifier Q (Q’ is then an increasing quantifier given
malmost all (p) 0.25 0
250
0.7
0.8
0.9
1 proportion
Evaluation of Quantified Statements Using Gradual Numbers
Exhibit A. Π(Q ; πc) = max min(µalmost all(7/10), πc(7)), max(µalmost all(8/10), πc(8)), max(µalmost all(9/10), πc(9))
= max min(0, 1), min (0.25, 0.9), min (1, 0.7) = 0.7, Ν(Q ; πc) = min max(µalmost all(7/10), 1 - πc(7)), max(µalmost all(8/10), 1 - πc(8)), max(µalmost all(9/10), 1 - πc(9)) = min max (0 , 0), max (0.25, 0.1), max(1, 0.3) = 0♦
by ∀p ∈ [0,1], µQ'(p) = µQ (1-p)). It is then possible to use the initial proposition to interpret “Q’ X are A .” In addition, when Q is not monotonic, this approach leads to the GD method introduced in the section titled The Probabilistic Approach (GD Method).
The Probabilistic Approach (GD Method) This method (Delgado et al., 2000) is based on the following imprecise cardinality of the fuzzy set A(X): ∀k ∈ {0,1,2,.. n}, p(k) = bk – bk+1, where n is the cardinality of set X and bk is the kth largest value of belongingness of an element to the fuzzy set A(X) (with b0 = 1 and bn+1 = 0). A value p(k) can be interpreted as the probability that set A(X) contains k elements. The evaluation of a “Q X are A” statement with an absolute quantifier is: n
∑ p (k ) ×
k =0
Q (k )
When Q is relative, the evaluation becomes: n
∑ p(k ) ×
k =0
Q ( k / n) .
This interpretation is clearly the average value of the different values taken by the linguistic quantifier.
The ZS Method The ZS method proposed in Delgado et al. (2000) considers the following fuzzy cardinality π of the fuzzy set A(X) of elements which satisfy predicate A: π(k) = 0 if it does not exist a level cut α such that |A(X)α| = k, otherwise π(k) = sup{α such that |A(X)α| = k}. This fuzzy cardinality can be interpreted as a possibility. The interpretation δ of the quantified statement “Q X are A” is the compatibility of the fuzzy quantifier Q with that fuzzy cardinality: δ = max
1≤k≤n
min (µ (k), π(k)), Q
where n is the cardinality of set X. This evaluation clearly provides the possibility of the event “the cardinality satisfies Q” (as in the approach briefly introduced in 3.1.2). In addition, it is a generalization (Delgado et al., 2000) of the Sugeno fuzzy integral approach since when Q is increasing, the ZS and the Sugeno integral methods lead to a same result.
Interpretation Based on a Sugeno Fuzzy Integral The ������������������� interpretation of “Q X are A” (Q being increasing) by a Sugeno fuzzy integral (Bosc & Liétard��, 1994a, 1994b; Ying, 2006) is given by:
251
Evaluation of Quantified Statements Using Gradual Numbers
δ = max 1 ≤ i ≤ n min (µQ (i), µA(xi)), where µA(x1) ≥ µA(x2) ≥ ... ≥ µA(xn). Due to the properties of the Sugeno fuzzy integral, δ states the existence of a subset C of X such that: • •
each element in C is A with some concrete degree, subset C is in agreement with the linguistic quantifier Q.
Since Q is increasing, the more these two aspects are met, the higher the truth value for “Q X are A.” As an example, “almost all employees are young” is evaluated by the existence of a subset of young employees which gathers almost all the employees. More precisely, δ can also be defined by: δ = max C ∈ P(X) min(p1(C), p2(C)), where P(X) denotes the powerset of X and p1(C) is defined by min x ∈ C µA(x), whereas p2(C) is given by µQ (|C|/n) with n as the cardinality of set X. In addition, it can be demonstrated (Dubois et al., 1988b) that this interpretation can also be given by a weighted conjunction: δ = min 1 ≤ i ≤ n max (1 - wi , µA(xi)), where wi = 1 - µQ ((i-1)/n) is the importance given to degree µA(xi). Here again, the smaller wi, the more neglected µA(xi). This Sugeno fuzzy integral based evaluation is a particular case of a proposition (Bosc & Liétard, 2005) made in a more general framework to evaluate the extent to which an aggregate (computed on a fuzzy set, the cardinality in case of a quantified statement) is confronted with a fuzzy predicate (a linguistic quantifier). So, it can be easily extended to any kind of linguistic quantifiers.
Quantified Statements of Type “Q B X are A” This section presents the previous propositions for the interpretation of fuzzy quantified statements of 252
type “Q B X are A.” Here again, a relative quantifier Q is considered.
Interpretation with an OWA Operator Yager (1988) suggests interpreting the expression “Q B X are A” by an ordered weighted averaging. Let X = {x1, ..., xn} with: µB(x1) ≤ µB(x2) ≤ ... ≤ µB(xn) and: n
∑
i =1
( xi ) = d.
The weights of the average are defined by: wi = µQ (Si) - µQ (Si-1) with Si = j = 1;i; mBxj/d, and S0 = 0. This operator aggregates the values of the implication µB(x) →K-D µA(x) where →K-D denotes Kleene-Dienes implication (a →K-D b = max (1 - a, b)). If the implication values ci are sorted in a decreasing ordrer c1 ≥ c2 ≥... ≥ cn, the interpretation of “Q B X are A” is: n
∑(ci * w ). =1
This calculus uses an OWA operator to aggregate implication values. As an example, the truth value obtained for “most of young employees are well-paid” is that of “for most of the employees, to be young implies to be well-paid.” The obtained result is far from the original meaning of the quantified statement.
Interpretation by Decomposition The interpretation by decomposition described in Yager (1983, 1984) is limited to increasing quantifiers. The proposition “Q B X are A” is true if an ordinary subset C of X satisfies the conditions p1 and p2 given hereafter: p1: there are Q elements B in C,
Evaluation of Quantified Statements Using Gradual Numbers
p2: each element x of C satisfies the implication: (x is B) → (x is A).
Table 1. Satisfaction degrees with respect to B and A
The truth value of the proposition: “Q B X are A” is then defined by:
x1
x2
x3
B
1
1
1
A
1
0
0
sup C ∈ P(X) min (p1(C), p2(C)), where p1(C) (resp. p2(C)) denotes the degree of satisfaction of C with respect to the condition p1 (resp. p2). The value p1(C) is defined by µQ (h) where h is the proportion of elements B in set C. Yager suggests the following definition of h (using ∑Counts):
∑ ∑
h = x∈C x∈ X
B( x ) B( x )
.
The value of p2(C) is: ∧x ∈ C µB(x) → µA(x) where ∧ is any triangular norm and → a fuzzy implication. This interpretation leads to evaluate the quantified statement by an aggregation of implication values µB(x) → µA(x). Similarly to the OWA based interpretation of “Q B X are A,” this interpretation is far from the original meaning for “Q B X are A.”
Proposition of Vila, Cubero, Medina, and Pons According to this proposition (Vila et al., 1997), the degree of truth for “Q B X are A” is defined by: δ = α * max x∈X min(µA(x), µB(x)) + (1 – α) * min max(µA(x), 1–µB(x)), x∈X where α is a degree of Orness (Yager & Kacprzyck, 1997) computed from the linguistic quantifier:
α=
n
(n−i)
∑( ( n − 1 ) ∗( i =1
Q( i /
n ) − Q (( i − 1 ) / n ))).
The interpretation of “Q B X are A” is a degree set between the truth value of “∃ B X are A” (given by maxx∈X min(µA(x), µB(x))) and that of ”∀ B X are A” (given by minx∈X max(µA(x),1–µB(x))). The closer to one is α, the more “Q B X are A” is interpreted as “∃ B X are A.” Example. Let us consider X = {x1, x2, x3} where the satisfaction degrees with respect to predicates B and A are given by Table 1. Th����������� e���������� value of α is given by: α = 1 * (µalmost all(1/3) - µalmost all(0)) + 1/2 * (µalmost (2/3) - µalmost all(1/3)) all + 0* (µalmost all(1) - µalmost all(2/3)). The linguistic quantifier almost all is such that: µalmost all(0) = 0, µalmost all(1/3) = 0.2, µalmost all(2/3) = 0.8 and µalmost all(1) = 1 and we get: α = 1 * (0.2 - 0) + 1/2 * (0.8 - 0.2) + 0 * (1 - 0.8) = 0.2 + 0.3 = 0.5. The final result is then: δ = α * max x∈X min(µA(x), µB(x)) + (1 – α) * min max(µA(x),1-µB(x)) x∈X = 0.5 * 1 + (1-0.5) * 0 = 0.5.
253
Evaluation of Quantified Statements Using Gradual Numbers
As a consequence, “almost all B X are A” is true at degree 0.5 which is far from the expected result (since the proportion of A elements among the B elements is 1/3 and µalmost all(1/3) = 0.2). ♦
The evaluation of “Q B X are A” is then:
∑ p( c ) ×
c in P
Q ( c ).
The GD Method for “Q B X are A” Statements
About the Proposed Approaches to Evaluate Quantified Statements
Delgado et al. (2000) propose a probabilistic view of the proportion of A elements among the B elements. Computations are related to the two fuzzy sets B(X) and A(X)∩B(X). In addition, when fuzzy sets B(X) and (A(X) ∩ B(X)) are not normal, they should be normalized (using any technique). The set S = {α1, α2, …, αm} is the set made of the different satisfaction degrees of elements from X with respect to fuzzy conditions B and A ∩ B (it is considered that 1 = α1 > α2 > …> αm > αm+1 = 0) and P the set of the different proportions provided by the α-cuts:
Some properties to be verified by any technique to evaluate quantified statements of type “Q X are A” and “Q B X are A” have been proposed in the literature (Blanco, Delgado, �������������������������� Martín-Bautista, Sánchez, & Vila����������������������������������� , 2002; Delgado et al., 2000), and it is possible to situate the different propositions with respect to these properties. At first, these properties are introduced and then the evaluation of quantified statements are discussed. Concerning “Q X are A” statements, the following properties can be considered:
P={
A( X ) ∩ B( X ) B( X )
where α is in S}.
If we denote P-1(c), the set of levels from S having c as relative cardinality (c being in P): P-1(c) = {αi from S such that = c},
A( X ) i ∩ B( X ) B( X )
i
i
the probability p(c) for a proportion c (in [0,1]) to represent A( X ) ∩ B( X ) is defined by: B( X )
p(c) =
∑(
i − i +1 ) i in P (c ) -1
∑(
=
i − i +1 ) A(X) i ∩ B(X) i such that c = i B(X) i
254
.
Property 1. If predicate A is crisp, the evaluation must deliver µ (|A(X)|) in case of absolute quantiQ fier and µ (|A(X)|/n) in case of a relative quantiQ fier (where A(X) is the crisp set made of element from X which satisfy A and n is the cardinality of crisp set X). Property 2. The evaluation is coherent with the universal and existential quantifiers. It means the evaluation of “Q X are A” is x∈∨X A (x) when Q is ∃ and x∈∧X A (x) when Q is ∀ ( ∨ and ∧ being respectively a co-norm and a norm). Property 3. The evaluation is coherent with quantifiers inclusion. Given two quantifiers Q and Q’ such that Q ⊆ Q’ (∀x, µQ(x) ≤ µQ’(x)), the evaluation of “Q X are A” cannot be larger than that of “Q’ X are A.” Concerning the “Q B X are A” statements, it is possible to recall: Property 4. If A and B are crisp and Q is relative, the evaluation must deliver µ (|A(X) ∩ B(X)|/|B(X)|) Q where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B).
Evaluation of Quantified Statements Using Gradual Numbers
Property 5. When B is a Boolean predicate, the evaluation of “Q B X are A” is similar to that of “Q B(X) are A” where B(X) is the (crisp) set made of elements from X which satisfy B. Property 6. If the set of elements which are B is included in the set of A elements, Q is relative and B is normalized, then the evaluation of “Q B X are A” is µQ(1) (since 100% of B elements are A due to the inclusion). Property 7. If A(X) ∩ B(X) = ∅ (where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B)), then the evaluation must return the value µQ(0). When considering the evaluation of “Q X are A” statements, the approaches based on cardinalities deliver a result which can be difficult to interpret. In case of a precise cardinality, the main drawback is that a large number of elements with small membership degrees may have the same effect on the result than a small number of elements with large membership degrees. As a consequence, property 2 cannot be satisfied (this behavior is demonstrated in Delgado et al., 2000). In addition as shown in Delgado et al. (2000), properties 1 and 3 are satisfied. In case of an imprecise cardinality, the result of the interpretation is imprecise since it takes the form of two indices: a degree of possibility and a degree of necessity. This imprecision tied to the result is difficult to justify because computations take into account a precise quantifier and precise degrees of satisfaction, so why should it deliver an imprecise result? Moreover, the approaches to evaluate “Q X are A” using a relaxation mechanism provide a result with a clear meaning and easy to interpret. Theses approaches (including ZS technique) satisfy (Delgado et al., 2000) properties 1, 2, and 3. When considering the evaluation of “Q B X are A” statements, the approach based on the OWA operator and on a decomposition technique considers a modification of the meaning of the quantified statement since “Q B X are A” is interpreted as “for
Q elements in X, to satisfy B implies to satisfy A.” These two approaches satisfy properties 4 and 5, while properties 6 and 7 are not fulfilled (Delgado et al., 2000). The approach proposed by Vila et al. (1997) interprets the quantified statement by a compromise between “∃ B X are A” and “∀ B X are A.” As a consequence, it may lead to a result which does not fit the quantifier’s definition, and none of the properties introduced in this section can be satisfied (Delgado et al., 2000). The GD method satisfies all properties (properties 4, 5, 6, 7). The next sections show that the framework of gradual numbers offers powerful tools to evaluate quantified statements. This context allows unifying the previous propositions made to evaluate quantified statements of type “Q X are A” (and based on a relaxation mechanism). In addition, gradual numbers offer new techniques to evaluate “Q B X are A” statements.
Gradual Numbers and Gradual Truth Value It has been shown (Rocacher, 2003) that dealing with both quantification and preferences defined by fuzzy sets leads to define gradual natural integers (elements of ℕf) corresponding to fuzzy cardinalities. Then, ℕf has been extended to ℤf (the set of gradual relative integers) and ℚf (the set of gradual rationals) in order to deal with queries based on difference or division operations (Rocacher & Bosc, 2005). These new frameworks provide arithmetic foundations where difference or ratio between gradual quantities can be evaluated. As a consequence, gradual numbers are essential in particular for dealing with flexible queries using absolute or relative fuzzy quantifiers. This is the reason why this section shortly introduces ℕf, the set of gradual integers, and its extensions ℤf and ℚf. Then, it is shown that applying a fuzzy predicate on a gradual number provides a specific truth value which is also gradual.
255
Evaluation of Quantified Statements Using Gradual Numbers
Gradual Natural Integers The fuzzy cardinality |F| of a fuzzy set F, as proposed by Zadeh (1983), is a fuzzy set on ℕ, called FGCount(F), defined by: ∀ n ∈ ℕ, µ|F|(n) = sup{α | |Fα| ≥ n}, where Fα denotes an α-cut of fuzzy set F. The degree α associated with a number n in the fuzzy cardinality |F| is interpreted as the extent to which F has at least n elements. It is a normalized fuzzy set of integers and the associated characteristic function is nonincreasing. Example. The fuzzy cardinality of the fuzzy set F = {1/x1, 1/x2 , 0.8/x3, 0.6/x4} is: |F| = {1/0, 1/1, 1/2, 0.8/3, 0.6/4}. The amount of data in F is completely and exactly described by {1/0, 1/1, 1/2, 0.8/3, 0.6/4}. Degree 0.8 is the extent to which F contains at least three elements.♦ It is very important to notice that we do not interpret a fuzzy cardinality as a fuzzy number based on a possibility distribution (which has a disjunctive interpretation). In fact, the knowledge of all the cardinalities of all different α-cuts of a fuzzy set F provides an exact characterization of the number of elements belonging to F. Consequently, |F| must be viewed as a conjunctive fuzzy set of integers. As a matter of fact, the considered fuzzy set F represents a perfectly known collection of data (without uncertainty), so its cardinality |F| is also perfectly known. We think that it is more convenient to qualify such cardinality as a “gradual” number rather than a “fuzzy” number. Other fuzzy cardinalities based on the definition of FGCounts, such as FLCounts or FECounts, have been defined by Zadeh (1983) or Wygralak (1999). Dubois and Prade (1985) and Delgado et al. (2000) have adopted a possibilistic point of view where a fuzzy cardinality is interpreted as a possibility distribution over α-cuts corresponding to a fuzzy number (Dubois & Prade, 1987). The rest of this chapter is based on such a fuzzy cardinality defined as FGCounts
256
and the set of all fuzzy cardinalities is called ℕf (the set of gradual natural integers). The α-cut xα of gradual natural integer x is an integer defined as the highest integer value appearing in the description x associated with a degree at least equal to α. In other words, it is the largest integer appearing in the α-level cut of its representation: xα = max{c ∈ ℕµx(c) ≥ α}. When x describes the FGCount of a fuzzy set A, the following equality holds: xα = Αα. This approach is along the line presented by Dubois and Prade (2005) where they introduce the concept of fuzzy element e in a set S defined as an assignment function ae from a complete lattice to L-{0} to S. Following this view, a gradual natural integer x belonging to ℕf can be defined by an assignment function ax from ]0, 1] to ℕ such that: ∀ α ∈ ]0, 1], ax(α) = xα . If x is identified to a fuzzy cardinality |F| of a the cardinality of the α fuzzy set F, then ax(α) is������������������������ level cut of F. Example. |�F| = {1/0, 1/1, 1/2, 0.8/3, 0.6/4} is a gradual natural integer defined by an assignment a function a|F| graphically represented by Figure 3. As an example, a|F|(0.7) = |F0.7| = 3. ♦ Any operation # between two natural integers can then be extended to gradual natural integers x and y (Rocacher & Bosc, 2005) by defining the corresponding assignment function ax#y as follows: ∀ α ∈ ]0, 1], ax#y(α) = ax(α) # ay(α) = xα # yα. Due to the specific characterization of gradual integers, it can easily be shown that ℕf is a semiring structure. So the addition and product opera-
Evaluation of Quantified Statements Using Gradual Numbers
Figure 3. The assignment function of a fuzzy cardinality N 4 3 2 1 0
a|F|(a)
∀ α ∈ ]0, 1], ax(α) = x�+α - x-α = xα Example. The compact denotation of the fuzzy relative (x, y) (with��: x = {1/0, 1/1, 0.8/2, 0.5/3, 0.2/4} and y = {1/0, 1/1, 0.9/2}������ ) is: (x, y)c = {1/0, 0.9/-1, 0.8/0, 0.5/1, 0.2/2}c.
0
0.6
0.8
1
a
tions satisfy the following properties: (ℕf, +) is a commutative monoïd (+ is closed and associative) with the neutral element {1/0}; (ℕf, ×) is a monoïd with the neutral element {1/0, 1/1}; the product is distributive over the addition.
Gradual Relative Integers In ℕf the difference between two gradual natural integers may be not defined. As a consequence, ℕf has to be extended to ℤf in order to build up a group structure. The ������������������������������������� set of gradual relative integers ℤf is defined by the quotient set (ℕf × ℕf) / ℛ of all equivalence classes on (ℕf × ℕf) with regards to ℛ the equivalence relation characterized by: ∀ (x+, x-) ∈ ℕf × ℕf, ∀ (y+, y-) ∈ ℕf × ℕf,(x+, x-) ℛ (y+, y-) iff x+ + y- = x- + y+. The α-cut of a fuzzy relative integer (x+, x-) is defined as the relative integer (x+α - x-α). Any fuzzy relative integer x has a unique canonical representative xc which can be obtained by enumerating the values of its different α-cuts on ℤ:
As an example, for a level of 0.9 we get: x0.9 = 1 while y0.9 = 2. As a consequence, the α-cut of (x, y) at level 0.9 is x0.9 – y0.9 = -1. The assignment function of (x, y) is represented by Figure 4. ♦ If x and y are two gradual relative integers, the addition x + y and the multiplication are respectively defined by the classes (x+ + y+, x- + y-) and ((x+ × y+) + (x- × y-), (x+ × y-) + (x- × y+)). The addition is commutative, associative, and has a neutral element, denoted by 0ℤf, defined by the class {(x, x) / x ∈ ℕf}. Each fuzzy relative integer (x+, x-) has an opposite, denoted by -x = (x-, x+). This is remarkable because in the framework of usual fuzzy numbers, this property is not always satisfied. �������������� It can be easily checked that the product in ℤf is commutative, associative, and distributive over the addition. The neutral element is the fuzzy relative integer ({1/0, 1/1}, {1/0}). Therefore we conclude that (ℤf, + , ×) forms a ring.
Figure 4. Assignment function of the gradual relative integer (x, y)
xc = ∑ αi / (x+αi - x-αi) degrees appearwhere αi correspond to the different ��������������� ing in the representation of� x+ and x-. Each value xα can be computed from the canonical representation since xα equals µ c(β) with β the immediate value x larger than or equal to α. The assignment function ax of x is a function from ]0, 1] to ℤ such that: 257
Evaluation of Quantified Statements Using Gradual Numbers
Gradual Rational Numbers
Figure 5. The fuzzy predicate high
The question is now to define an inverse to each gradual integer and to build up the set of gradual rational numbers. We define ℤf * as the set of gradual integer x such that: ∀α ∈ ]0, 1], xα ≠ 0 and ℛ’ as the equivalence relation such that: ∀(x, y) and (x’, y’) ∈ ℤf× ℤf *, [x, y] ℛ’ [x’, y’] iff x × y’ = x’ × y. The set of fuzzy rational numbers ℚ f is defined by the quotient set (ℤf × ℤf *) / ℛ’. The representation of a fuzzy relational number x can also be represented thanks to a more simple compact representation (denoted by xc) by enumerating values associated with the different α-cuts which are rationals. ������������������������ The assignment function ax of x is a function from [0, 1] to ℚ is defined by:
Example. We consider the fuzzy predicate high defined by Figure 5. If the number of young employees is the gradual integer x = {1/15, 0.7/20, 0.2/25} (which means that 15 employees are completely young, 5 employees have the same age and are young at the level 0.7, whereas 5 other people are rather not young since their level of youth is estimated at 0.2). The assignment function for x is the following:
where the operator reduce means that the rational is reduced to its canonical form.
The application of the predicate high on the gradual integer x produces a global satisfaction S whose function of assignment is defined by:
Gradual Truth Value This section proposes a computation to determine the truth value obtained when applying a fuzzy predicate on a gradual number. Let x be an element of ℕf or ℤf or ℚ f ;������������������������� its assignment function ax is defined by: ∀ α ∈ ]0, 1], ax(α) = xα . If T is a fuzzy predicate, the application of the predicate T on x produces a global satisfaction S (called gradual truth value) characterized by the assignment function defined by: ∀ α ∈ [0, 1], aS(α) = T(xα) = T(ax(α)). For a given level α, aS(α) represents the satisfaction of the corresponding α-cut of the fuzzy number. In other words, for a given level α, the fuzzy number satisfies predicate T at degree aS(α).
258
a ∈ [ 0, 1] , aS(a) = T (xa) = T(ax (a)).
We get the gradual truth value given by Figure 6.♦ This gradual truth value shows the different results associated to the different α-cuts. When Figure 6. Gradual truth value corresponding to a global satisfaction S
Evaluation of Quantified Statements Using Gradual Numbers
referring to a previous example and when considering level 0.8, the fuzzy cardinality x states that the cardinality of this α-cut is 15 (x0.8 = 15). Since µalmost all(15) = 0.25, this cardinality satisfies to be high at degree 0.25. It can be checked that αs(0.8) = 0.25.
Interpretation of Quantified Statements Using Gradual Numbers The section titled Quantified ������������������������������ Statements of Type “Q X are A”������������������������������������������ considers ����������������������������������������� the evaluation of a quantified statement of type “Q X are A” while ������������������ the section titled Quantified ������������������������������� Statements of Type “Q B X are A” Where Q Is Relative��������������������������������� is interested in the evaluation of statement of type “Q B X are A,” where Q is relative. Each one of these computations provides a gradual truth value. As a consequence, the section titled A Scalar Truth Value for the Interpretation proposes a scalar interpretation computed from this gradual truth value.
In case of a relative linguistic quantifier, the truth value of “Q X are A” is given by the satisfaction of the linguistic quantifier into the proportion of elements which are A. We get: ∀ α ∈ [0, 1], µS(α) = µQ (c(α)/n) = µQ (|A(X)α|/n), where n is the cardinality of set X. Example. We consider the statement “about 3 X are A” where X = {x1, x2 , x3, x4} such that µA(x1) = µA(x2) = 1, µA(x3) = 0.8, µA(x4) = 0.6. The linguistic quantifier about 3 is given by Figure 7. The gradual truth value for “about 3 X are A” (defined by: ∀ α ∈ [0, 1], µS (α) = µQ (c(α))) is given by Figure 8. This gradual truth value provides the satisfaction obtained for the different α-cuts of A(X) (set made of elements from X which satisfy fuzzy condition A). As an example, µS(0.7) = µQ (|A(X)0.7|) = µQ (3) = 1. ♦
Quantified Statements of Type “Q X are A” The gradual cardinality of the fuzzy set A(X) made of elements from X which satisfy A is a FGCount denoted c and belongs to ℕf. When Q is absolute, the gradual truth value for “Q X are A” is given by the satisfaction of a fuzzy condition (a constraint represented by the quantifier) for that gradual number. As described in the section titled Gradual �������� Truth Value��������� , we get:
Figure 7. A representation for the quantifier about 3 mabout 3(n) 1
0
∀ α ∈ [0, 1], µS(α) = µQ (c(α)). From the definition of the FGCount, we get: ∀ α ∈ [0, 1], µS(α) = µQ (|A(X)α|). In other words, the fuzzy truth value S expresses the satisfaction of each α-cut of A(X) with respect to the linguistic quantifier.
1
2
3
4
5
n
Figure 8. A fuzzy truth value for “about 3 X are A” mS(a)
1 0.5
0
0.6
0.8
1
a
259
Evaluation of Quantified Statements Using Gradual Numbers
Quantified Statements of Type “Q B X are A” Where Q Is Relative
∀ α ∈ [0, max x ∈ X µB(x)], µS (α) = µS (p(α)) = µQ (|(A∩B)(X)α|/|B(X)α|).
The truth value of “Q B X are A” (Q being relative) is given by the satisfaction of the linguistic quantifier into the proportion of elements which are A among the elements which satisfy B. This proportion is a ration between two gradual integers:
The fuzzy truth value S expresses the satisfaction of each α-cut of A(X) and (A∩B)(X) with respect to the linguistic quantifier. The value α is viewed as a quality threshold for the satisfactions with respect to A and B. When the minimum is chosen as norm to define (A∩B)(X), the value of µS (α) states that: “among the elements which satisfy B at least at level α, the proportion of elements x with µA(x) ≥ α, is in agreement with Q” (since we have (A∩B)(X)α = A(X)α∩B(X)α). In other words, µS(α) is the truth value of the quantified statement when considering the two interpretations A(X)α and B(X)α . In addition, the fuzzy truth value S is not defined when α > max x ∈ X µB(x). A first attitude is to normalize B and A ∩ B or to employ the degree of orness defined by Yager and Kacprzyck (1997) so that that µS (α) = orness(Q). A second attitude which will be considered in this chapter is to assume that µS (α) = 0 in that case.
p = c/d, where: •
•
c is the cardinality (FGCount) of the fuzzy set (A∩B)(X) made of elements from X which satisfy fuzzy condition A and condition B (∀x in X, µA∩B(X) =min(µA(x), µB(x)), d is the cardinality (FGCount) of the fuzzy set B(X) made of elements from X which satisfy fuzzy condition B.
The gradual rational number c/d is defined by the couple (c, d). A canonical representation for c/n is: ∀ α ∈ [0, 1], p(α) = c(α)/d(α). This canonical definition is defined only when d(α) ≠ 0. The cardinality c (resp. d) as that of the fuzzy set (A∩B)(X) (B(X)), we get: ∀ α ∈ [0, 1], p(α) = |(A∩B)(X) α|/|B(X) α|, where |B(X) α| ≠ 0. It means that p(α) is not defined for α > max x ∈ X µB(x), and we can write: ∀ α ∈ [0, max |B(X) α|.
x∈X
µB(x)], p(α) = |(A∩B)(X) α|/
The gradual truth value for “Q B X are A” is given by the satisfaction of the constraint represented by the quantifier for that gradual proportion. According to the results introduced in the section titled Gradual ������������������������������������������� Truth Value������������������������ , a gradual truth value S is obtained:
260
t��������� atement “about half Example. We consider the s���������� B X are A” where X = {x1, x2 , x3, x4}. The satisfaction degrees are given by Table 2. The linguistic quantifier about half is given by Figure 9. The gradual truth value for “about half B X are A”is given by Figure 10. As an example, we get µS(0.6) = 1/3 because |(A∩B)(X)0.6|/|B(X)0.6| = 2/3 and µQ (2/3) = 1/3. The truth value of the statement “about half elements in {x such that µB(x) ≥ 0.6} are in {x such that µA(x) ≥ 0.6}” is 1/3. ♦
A Scalar Truth Value for the Interpretation The fuzzy truth value S computed in the previous section gathers the satisfactions of the different α-cuts with respect to the linguistic quantifier. This fuzzy truth value can be defuzzified in order to obtain a scalar evaluation (set in [0, 1]). Various
Evaluation of Quantified Statements Using Gradual Numbers
Table 2. Satisfaction with respect to B and A x1
x2
x3
x4
µB(xi)
1
0.9
0.7
0.3
µA(xi)
0.8
0.3
1
1
µΑ∩B(xi)
0.8
0.3
0.7
0.3
Figure 9. A representation for the quantifier about half mabout half(p)
1
0
0.5
1
proportion p
Figure 10. A fuzzy truth value for “about half B X are A”
interpretations can be associated to this defuzzification and we consider the following one (since it is the more natural): “the more α-cuts highly satisfies the constraint defined by the linguistic quantifier, the higher the scalar interpretation”. Obviously, when the scalar interpretation is 1, each α-cut fully satisfies the constraint. When dealing with a quantified statement of type “Q X are A,” a scalar evaluation of 1 means that whatever is the chosen interpretation for A(X) (set made of
elements from X which satisfy A), its cardinality is in agreement with the linguistic quantifier (i.e., ∀ α, µQ(|A(X)α|) = 1 or µQ(|A(X)α|/n) = 1). Otherwise, the higher the scalar evaluation, the more interpretations exist of A(X) with a high satisfaction with respect to the linguistic quantifier. When dealing with a quantified statement of type “Q B X are A,” the scalar evaluation is also interpreted in terms of α-cuts, that is, in terms of interpretations of fuzzy sets. For a given level α, the degree µS(α) provided by the gradual truth value represents the satisfaction of the quantifier with respect to the proportion: |(A∩B)(X)α|/|B(X)α| (µS(α) is the truth value of the quantified statement when considering the two interpretations (A∩B)(X) α and B(X) α). The scalar value aggregates the different satisfactions provided by the different levels and a scalar evaluation of 1 means that whatever is the chosen quality threshold α, the proportion is in complete agreement with Q. Otherwise, the higher the scalar evaluation, the more quality thresholds exist such that the proportion highly satisfies Q. In the section titled A ������������������������� Quantitative Approach��, we consider a quantitative defuzzification (since based on an additive measure, a surface) while in the section titled A Qualitative Approach, we consider a qualitative defuzzification (since based on a non-additive process). The section titled Satisfaction of Properties situates the results provided by these two defuzzifications with respect to the properties introduced in the section titled About the Proposed Approaches to Evaluate Quantified Statements.
A Quantitative Approach In this approach, the surface of the fuzzy truth value is delivered to the user. The scalar interpretation is then (Liétard & Rocacher, 2005): δ = (
1
∫0
1/ p
p −1 d ) S( )* p *
.
261
Evaluation of Quantified Statements Using Gradual Numbers
When p = 1, value δ is the area delimited by function µS. Since this function is a stepwise function, we get: δ = (α1 – 0) * µS(α1) + (α2 – α1) * µS(α2) +...+ (1 – αn) * µS(1), where the discontinuity points are: (α1, µS(α1)),..., (αn , µS(αn)) with α1 < α2 < ... < αn. Example. We consider the statement “about half B X are A” and the fuzzy truth value given by Figure 9. We compute: δ = (0.7 – 0.3) * 1/3 + (0.8 – 0.7)*1 = 0.233. The scalar result is rather low. When referring to Table 2, it seems that the proportion of elements which are A among the B elements is near to be 2/3. A low result for “about half B X are A” is coherent since the proportion 2/3 poorly satisfies the constraint about half. ♦ It has been shown (Liétard & Rocacher, 2005) that, when dealing with quantified statements of type “Q X are A,” this approach is a generalization of the OWA based interpretation (introduced in Interpretation by the OWA Operator). In addition, the next proof shows that when considering “Q B X are A” statements and when B is normalized, this defuzzication leads to the GD method introduced in The GD Method for “Q B X are A” Statements (when B is not normalized, the two methods differ since GD method imposes to normalize B, while the gradual truth value associates a value 0 when the α-cuts of B(X) is not defined). Proof. In case of a “Q B X are A” statement, the discontinuity points (αi, µS(αi)) of the gradual truth value are associated to αi values where the quantities µS(αi) vary. In other words: • •
262
αi values are coming from the set D ={µA∩B(x) where x is in X}∪ {�µB(x) where x is in X}, µS(αi) = µQ(|(A∩B)(X)αi| /|B(X)αi|).
The defuzzication gives: δ = (α1 – 0) * µQ(|(A∩B)(X)α1| /|B(X)α1|) + (α2 – α1) * µQ(|(A∩B)(X)α2| /|B(X)α2|) +...+ (αn – αn-1) * µS(1), where the values from D are denoted α1 < α2 < ... < αn. This expression is clearly that of an interpretation using the GD method (cf. 3.2.4).
A Qualitative Approach According to this defuzzification, the scalar interpretation takes into consideration two aspects: • •
a guaranteed (minimal) satisfaction value β associated to the α-cuts (β must be higher as possible), the repartition of β among the α-cuts (β should be attained by the most possible α-cuts).
Obviously, these two aspects are in opposition since, in general, the higher β, the smaller the repartition. The scalar interpretation δ reflects a compromise between these two aspects and we get: δ = max β in [0,1] min(β, each(β)), where each(β) means “for each level α, µS (α) ≥ β.” A definition of each(β) delivering a degree is the more convenient (Bosc & Liétard, 2005) and we propose to sum the lengths of intervals (of levels) where the threshold β is reached: each(
)=
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i ). ) ≥
The higher each(β), the more numerous the levels α for which µS (α) ≥ β. In particular, each(β) equals 1 means that for each level α, µS (α) is larger than (or equal to) β. In addition, from a computational point of view, the definition of δ needs to handle an infinity of
Evaluation of Quantified Statements Using Gradual Numbers
values β. However, it is possible (Bosc & Liétard, 2005) to restrict computations to β values belonging to the set of “effective” µS (α) values: δ = max {β| ∃ α such that β = µ (α)} min(β, each(β)), S
Example. We consider the statement “about half B X are A” and the fuzzy truth value given by Figure 9. The values β to be considered are 1/3 and 1. Furthermore: each(1/3) = 0.5, each(1) = 0.1.
each(
)=
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i ). ) ≥
and the discontinuity points of the gradual truth value are: (α1, µS(α1)),..., (αn, µS(αn)) with α1 < α2 < ... < αn. We first demonstrate the validity of properties related to “Q X are A” statements, then that of properties related to “Q B X are A” statements.
Properties Related to “Q X are A” Statements
We get δ = max (min(1/3, 0.5), min(1, 0.1)) = 1/3. As in the previous example, a low result for “about half B X are A” is coherent. ♦ It has been shown (Bosc & Liétard, 2005) that when dealing with quantified statements of type “Q X are A” (Q increasing), this defuzzification leads to the Sugeno fuzzy integral based approach introduced in the section titled The Probabilistic Approach (GD Method).
Satisfaction of Properties This section situates the results provided by these two defuzzifications with respect to the properties introduced in About the Proposed Approaches to Evaluate Quantified Statements. All the properties are satisfied, except property 7 which holds only when the set made of element from X which satisfy B is normalized. We recall that the quantitative approach delivers:
In case of a “Q X are A” statement, the αi are coming from the set D = {µA(x) where x is in X} and µS(αi) = µQ(|A(X)α |) (or µQ(|A(X)α |/n) in case of a i i relative quantifier with n the cardinality of set X). The different properties to be satisfied are: Property 1. If predicate A is crisp, the evaluation must deliver µ (|A(X)|) in case of absolute quantiQ fier and µ (|A(X)|/n) in case of a relative quantiQ fier (where A(X) is the crisp set made of element from X which satisfy A and n is the cardinality of crisp set X). Proof. When A is crisp, D is a singleton ({1}) and the only discontinuity point of the gradual truth value (cf. Figure 11) is (1, µQ(|A(X)|) (or (1,µQ(|A(X)|/n)). Figure 11. The gradual truth value associated to property 1
δ = (α1 – 0) * µS(α1) + (α2 – α1) * µS(α2) +...+ (1 – αn) * µS(1), while the qualitative approach delivers δ = max {β| ∃ α such that β = µ
S (α)}
min(β, each(β)),
where each(β) is defined by
263
Evaluation of Quantified Statements Using Gradual Numbers
The quantitative approach delivers: (1 – 0) * µQ(|A(X)|) (or (1-0) * µQ(|A(X)|/n) when Q is relative) and property 1 holds. Concerning the qualitative approach, we demonstrate the validity of property 1 only the case of an absolute quantifier. In case of a relative quantifier, it is necessary to change each expression µQ(|A(X)|) into µQ(|A(X)|/n) and the demonstration remains valid. When dealing with the qualitative approach, there is only one value β to be considered. This value equals µQ(|A(X)|) and each(β) = 1 (since for every level α in [0,1] µS(α) = µQ(|A(X)α|) = µQ(|A(X)|) = β). As a consequence, the result of the qualitative approach is min(β, each(β)) = min(µQ(|A(X)|), 1) and the property is valid. Property 2. The evaluation is coherent with the universal and existential quantifiers. It means the evaluation of “Q X are A” is x∈∨X A ( x ) when Q is ∃ and x∈∧X A (x) when Q is ∀ ( ∨ and ∧ being respectively a co-norm and a norm). Proof. The universal quantifier is relative and defined by µ (1) = 1 and for any k in [0,1[, µ (k) ∀ ∀ = 0. The gradual truth value is defined by: •
µS(α) = µ∀(|A(X)α|/n) = 1 when |A(X)α|/n = 1 which means when α is smaller than the minimum of membership degrees (denoted α1). This value α1 can be equal to 0 (when there exists at least one element x with µA(x) = 0).
Figure 12. The gradual truth value associated to the universal quantifier
264
•
µS(α) = 0 otherwise.
As a consequence, we obtain the gradual truth value given by Figure 12. The fuzzy truth value has a unique discontinuity point (α1, 1) and the quantitative approach delivers δ = (α1 – 0) * 1 = α1 which is the minimum of the membership degrees. Property 2 is then satisfied using the minimum as a norm. When dealing with the qualitative approach, there is only one value β to be considered. This value equals β = 1 with each(β) = α1. As a consequence, the result is min(β, each(β)) = α1 and the property is valid. The existential quantifier is absolute and defined by µ (0) = 0 and for any k ≠ 0, µ (k) = 1. ∃ ∃ The discontinuity points of the gradual truth value are: (α1, 1), ..., (αn, 1) (see Figure 13), where αn is the highest degree among the µA(x)s. The quantitative approach delivers: δ = (α1 – 0) * 1 + (α2 – α1) * 1 +...+ (αn - αn-1 ) * 1 + (1 – αn) * 0 = αn which is the maximum of the membership degrees. Property 2 is then satisfied using the maximum as a co-norm. When dealing with the qualitative approach, there is only one value β to be considered. This value equals β = 1 with each(β) = αn. As a consequence, the result is min(β, each(β)) = αn and the property is valid. Property 3. The evaluation is coherent with quantifiers inclusion. Given two quantifiers Q and Q’ such that Q ⊆ Q’ (∀x, µQ(x) ≤ µQ’(x)), the
Figure 13. The gradual truth value corresponding to the existential quantifier
Evaluation of Quantified Statements Using Gradual Numbers
evaluation of “Q X are A” cannot be larger then that of “Q’ X are A.” Proof. The gradual truth value for “Q X are A” is denoted S, while that associated to “Q’ X are A” is denoted S’. Since we have: ∀x, µQ(x) ≤ µQ’(x), it implies ∀ α in [0, 1] , µS(α) ≤ µS’(α) (since the two quantified statements are dealing with the same set X and the same fuzzy predicate A). We denote δ and δ’ the respective evaluation of “Q X are A” and “Q’ X are A.” If the quantitative approach is chosen, we have
∫
Figure 14. The gradual truth value associated to property 4
1
δ = 0 S ( )d
•
and
•
∫
1
δ' = 0 S ' ( )d .
The different properties to be satisfied are:
As a consequence, δ ≤ δ’ and property 3 is valid. If the qualitative approach is chosen: δ = ma x each(
β i n [0,1]
)=
min(β, each(β)), with
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i) ) ≥
δ'= max β in [0,1] min(β, each’(β)),with each' (
)=
αi values are coming from the set D ={µA∩B(x) where x is in X}∪ �{µB(x) where x is in X}, µS(αi) = µQ(|(A∩B)(X)αi| /|B(X)αi|).
∑(
] i , j ] such that ∀ ∈ ] i , j ], S' (
j
− i) ) ≥
Since ∀ α in [0, 1] , µS(α) ≤ µS’(α), we have each(β)≤ each’(β) which gives δ ≤ δ’ and property 3 is demonstrated.�
Properties Related to “Q B X are A” Statements In case of a “Q B X are A” statement, the discontinuity points (αi, µS(αi)) of the gradual truth value are associated to αi values where the quantities µS(αi) varies. In other words:
Property 4. If A and B are crisp and Q is relative, the evaluation must deliver µ (|A(X) ∩ B(X)|/|B(X)|) Q where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B). Proof. When A and B are crisp, D is a singleton ({1}) and the only discontinuity point of the gradual truth value is (1, µQ(|(A∩B)(X)| /|B(X)|)) where (A∩B)(X) and B(X) are crisp sets (see Figure 14). The quantitative approach delivers: (1 – 0) * µQ(|(A∩B)(X)| /|B(X)|) and property 1 holds. When dealing with the qualitative approach, there is only one value β to be considered. This value β equals µQ(|(A∩B)(X)| /|B(X)|) and each(β) = 1. As a consequence, the result of the qualitative approach is min(β, each(β)) = min(µQ(|(A∩B)(X)| /|B(X)|), 1) and the property is valid. Property 5. When B is a Boolean predicate, the evaluation of “Q B X are A” is similar to that of “Q B(X) are A” where B(X) is the (crisp) set made of elements from X which satisfy B. Proof. This proof shows that the gradual truth value S associated to “Q B X are A” and the gradual
265
Evaluation of Quantified Statements Using Gradual Numbers
truth value S’ associated to “Q B(X) are A” are exactly the same: ∀ α in [0,1], µS(α) = µS’(α). When B is a Boolean predicate B(X) α is the crisp set B(X) for any level α. As a consequence, µS(α) = µQ(|(A∩B)(X)α| /|B(X)|). Since (A∩B)(X)α can be rewritten A(X)α∩B(X), we have:
Conclusion
Property 7. If A(X) ∩ B(X) = ∅ (where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B)), then the evaluation must return the value µQ(0).
This chapter takes place at the crossword of quantified statements evaluation and fuzzy arithmetic introduced in Rocacher and Bosc (2003a, 2003b, 2003c, 2005). It shows that fuzzy arithmetic allows to evaluate quantified statements of type “Q X are A” and “Q B X are A.” The evaluation can be either a fuzzy truth value or a scalar value obtained by the defuzzification of the fuzzy value. Two types of scalar values can be distinguished: the first one corresponds to a quantitative view of the fuzzy value, the second one of a qualitative view. When dealing with quantified statements of type “Q X are A,” the two scalar values are respectively generalizations of the OWA based interpretation and the Sugeno integral based interpretation. When dealing with “Q B X are A” statements, our approach presents the advantage of providing a theoretical framework for computation. It is the first attempt to set this evaluation in the framework of an extended arithmetic and algebra. This aspect is very important since properties provided by the algebraic framework hold and we expect to obtain more interesting properties for the qualitative and quantitative approaches (in addition to the ones already stated in this chapter). As a consequence, further studies may concern the comparison of the qualitative and quantitative approach in terms of properties. In addition, since they are both summaries of the same evaluation (in the form of a gradual number), they should not differ significantly.
We show this property holds only when B(X) is normalized.
References
µS(α) = µQ(A(X)α∩B(X)/|B(X)|). It means that µS(α) is restricted to elements belonging to B(X) as it is the case for “Q B(X) are A” statement and we obviously get: ∀α in [0,1], µS(α) = µS’(α). Property 6. If the set of elements which is B is included in the set of A elements, Q is relative and B is normalized, then the evaluation of “Q B X are A” is µQ(1) (since 100% of B elements are A due to the inclusion). Proof. When the set of elements which is B is included in the set of A elements, we have µB(x) ≤ µA(x) for any element x from X. As a consequence, B(X)α ⊆ A(X)α for any level α. As a consequence, ∀α in [0,max x∈X µB(x)], µS(α) = µQ(1). Since B(X) is normalized, ∀α in [0,1], µS(α) = µQ(1) and it is obvious to show that the two defuzzifications give µQ(1) as final results.
Proof. When A(X) ∩ B(X) = ∅, we have A∩B(X)α = ∅ for any level α in [0,1]. As a consequence, ∀α in [0,max x∈X µB(x)], µS(α) = µQ(0). When B(X) is normalized fuzzy set, we get ∀α in [0,1], µS(α) = µQ(0) and it is obvious to show that the two defuzzifications give µQ(0) as final results.
266
Barro, S., Bugarin, A., Cariñena, P., & Diaz-Hermida, F. (2003). A framework for fuzzy quantification models analysis. IEEE Transactions on Fuzzy Systems, 11, 89-99. Blanco, I., Delgado, M., Martín-Bautista, M. J., Sánchez, D., & Vila, M. P. (2002). Quantifier ������������������ guided aggregation of fuzzy criteria with associated importances. In T. Calvo, R. Mesiar, & G. Mayor (Eds.),
Evaluation of Quantified Statements Using Gradual Numbers
Aggregation operators: New trends and applications (Studies on Fuzziness and Soft Computing Series, pp. 272-290). Physica-Verlag. Bosc, P., & Liétard L. (1993). On the extension of the OWA operator to evaluate some quantifications. In Proceedings of the 1st European Congress on Fuzzy and Intelligent Technologies (EUFIT’93) (pp. 332-338), Aachen, Germany. Bosc, P., & Liétard, L. (1994a). Monotonous ������������������ quantifications and Sugeno fuzzy integrals. In Proceedings of the 5th IPMU Conference (pp. 1281-1286), Paris, France. Bosc, P., & Liétard L. (1994b). Monotonic quantified statements and fuzzy integrals. In NAFIPS/IFIS/ NASA’94 Joint Conference (pp. 8-12), San Antonio, Texas. Bosc, P., & Liétard, L. (2005). A general technique to measure gradual properties of fuzzy sets. In Proceedings of the 10th International Fuzzy Systems Association (IFSA) Congress, Beijing, China. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Delgado, M., Sanchez, D., & Amparo M. V. (2002). A probabilistic definition of a nonconvex fuzzy cardinality. Fuzzy Sets and Systems, 126, 177-190. Delgado, M., Sanchez, D., & Vila, M. P. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Diaz-Hermida, F., Bugarin, A., & Barro, S. (2003). Definition and classification of semi-fuzzy quantifiers for the evaluation fuzzy quantified sentences. International Journal of Approximate Reasoning, 34, 49-88. Diaz-Hermida, F., Bugarin, A., Cariñena, P., & Barro, S. (2004). Voting-model based evaluation of fuzzy
quantified sentences: A general framework. Fuzzy Sets and Systems, 146(1), 97-120. Dubois, D., Godo, L., De Mantaras, R. L., & Prade, H. (1993). Qualitative reasoning with imprecise probabilities. Journal of Intelligent Information Systems, 2, 319-363. Dubois, D., & Prade, H. (1985). Fuzzy cardinality and the modeling of imprecise quantification. Fuzzy Sets and Systems, 16, 199-230. Dubois, D., & Prade, H. (1987). Fuzzy numbers: an overview. Analysis of Fuzzy Information, Mathematics and Logics, I, 3-39. Dubois, D., & Prade, H. (1988a). On fuzzy syllogisms. Computational Intelligence, 4, 171‑179. Dubois, D., & Prade, H. (2005). Fuzzy elements in a fuzzy set. In Proccedings of the 10th International Fuzzy Systems Association (IFSA) Congress, Beijing, China. Dubois, D., Prade, H., & Testemale, C. (1988b). Weighted fuzzy pattern matching. Fuzzy Sets and Systems, 28, 315-331. Fan, Z. P., & Chen, X. (2005). Consensus measures and adjusting inconsistency of linguistic preference relations in group decision making. In Fuzzy Systems and Knowledge Discovery (pp. 130-139). Berlin/Heidelberg: Springer. Galindo, J. (2005). New characteristics in FSQL, a fuzzy SQL for fuzzy databases. WSEAS Transactions on Information Science and Applications, 2(2), 161-169. Galindo, J. (2007). FSQL (fuzzy SQL): A fuzzy query language. Retrieved February 6, 2008, from http://www.lcc.uma.es/~ppgg/FSQL Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 164-174). Springer. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: IGI Publishing. 267
Evaluation of Quantified Statements Using Gradual Numbers
Glockner, I. (1997). DFS: An axiomatic approach to fuzzy quantification (Tech. Rep. No. TR97-06). University of Bielefeld. Glockner, I. (2004a). Fuzzy quantifiers in natural language: Semantics and computational models. Der Andere Verlag (Germany). Glockner, I. (2004b). Evaluation of quantified propositions in generalized models of fuzzy quantification. International Journal of Approximate Reasoning, 37, 93-126. Kacprzyck, J. (1991). Fuzzy linguistic quantifiers in decision making and control. In Proceedings of International Fuzzy Engineering Symposium (IFES’91) (pp. 800-811), Yokohama, Japan. Kacprzyck, J., & Iwanski, C. (1992). Fuzzy logic with linguistic quantifiers in inductive learning. In L. A. Zadeh & J. Kacprzyk (Eds), Fuzzy logic for the management of uncertainty (pp. 465-478). John Wiley and Sons. Kacprzyck, J., Yager, R. R., & Zadrozny, S. (2006). Fuzzy linguistic summaries of databases for an efficient business data analysis and decision support. In Knowledge Discovery for Business Information Systems (pp. 129-159)���������������������������� . The Netherlands: Springer. Laurent, A., Marsala, C., & Bouchon-Meunier, B. (2003). ��������������������������������������������� Improvement of the interpretability of fuzzy rule based systems: Quantifiers, similarities and aggregators. In Modelling with words (pp. 102-123). Berlin/Heidelberg: Springer. Liétard, L., & Rocacher, D. (2005). A generalization of the OWA operator to evaluate non monotonic quantifiers. ��� In Proceedings of the 2005 Rencontres Francophones sur la Logique Floues et ses Applications (LFA’05). Liu, Y., & Kerre, E. (1998a). An overview of fuzzy quantifiers (I): Interpretations. Fuzzy Sets and Systems, 95, 1-21. Liu, Y., & Kerre, E. (1998b). An overview of fuzzy quantifiers (II): Reasoning and applications. Fuzzy Sets and Systems, 95, 135-146
268
Losada, D. E., Díaz-Hermida, F., & Bugarín, A. (2006). Semi-fuzzy ���������������������������������������������� quantifiers for information retrieval. ��� In Soft Computing in Web Information Retrieval (pp. 195-220). Berlin/Heidelberg: Springer. Loureiro Ralha, J. C., & Ghedini Ralha, C. (2004). Towards a natural way of reasoning. In Advances in Artificial Intelligence–SBIA 2004 (pp. 114-123). Berlin/Heidelberg: Springer. Malczewski, J., & Rinner, C. (2005). Exploring multicriteria decision strategies in GIS with linguistic quantifiers: A case study of residential quality evaluation. Journal of Geographical Systems, 7(2)��, 249-268. Mizumoto, M., Fukami, S., & Tanaka, K. (1979). Fuzzy conditional inferences and fuzzy inferences with fuzzy quantifiers. In Proceedings of the 6th International Joint Conference on Artificial Intelligence (pp. 589-591), Tokyo, Japan. Prade, H. (1990). A two-layer fuzzy pattern matching procedure for the evaluation of conditions involving vague quantifiers. Journal of Intelligent and Robotic Systems, 3, 93-101. Rasmussen, D., & Yager, R. R. (1997). ������������ A fuzzy SQL summary language for data discovery. In D. Dubois, H. Prade, & R. R. Yager (Eds.), Fuzzy information engineering: A guided tour of applications (pp. 253264). New York: Wiley. Rocacher, D. (2003). On fuzzy bags and their application to flexible querying. Fuzzy Sets and Systems, 140(1), 93-110. Rocacher, R., & Bosc, P. (2003a). About Zf, the set of fuzzy relative integers, and the definition of fuzzy bags on Zf. Lecture Notes in Computer Science, 2715, 95-102. Springer-Verlag. Rocacher, R., & Bosc, P. (2003b). Entiers relatifs flous et multi-ensembles flous. In Rencontres francophones sur la logique floues et ses applications (LFA’03) (pp. 253-260). Rocacher, R., & Bosc, P. (2003c). ������������������ Sur la définition des nombres rationnels flous. In Rencontres francophones sur la logique floues et ses applications, (LFA’03) (pp. 261-268).
Evaluation of Quantified Statements Using Gradual Numbers
Rocacher, R., & Bosc, P. (2005). The set of fuzzy rational numbers and flexible querying. Fuzzy Sets and Systems, 155(3), 317-339.
Yager, R. R., & Kacprzyk, J. (1997). The ordered weighted averaging operators: Theory and applications. Boston: Kluwer.
Sanchez, E. (1988). Fuzzy quantifiers in syllogisms, direct versus inverse computation. Fuzzy Sets and Systems, 28, 305-312.
Ying, M. (2006). Linguistic quantifiers modeled by Sugeno integrals. Artificial Intel., 170, 581-606.
Sicilia, M. A., Díaz, P., Aedo, I., & García, E. (2002). Fuzzy linguistic summaries in rule-based adaptive hypermedia systems. In Adaptive Hypermedia and Adaptive Web-Based Systems Second Int. Conference, Malaga, Spain. Vila, M. A., Cubero, J. C., Medina, J. M., & Pons, O. (1997). Using OWA operator in flexible query processing. In The Ordered Weighted Averaging Operators: Theory, Methodology and Applications (pp. 258-274). Wygralak, M. (1999). Questions of cardinality of finite fuzzy sets. Fuzzy Sets and Systems, 102, 185-210. Yager, R. R (1982). A new approach to the summarization of data. Information Sciences, 28, 69-86. Yager, R. R. (1983a). Quantifiers in the formulation of multiple objective decision functions. Information Sciences, 31, 107-139. Yager, R. R. (1983b). Quantified propositions in a linguistic logic. International Journal of Man-Machine Studies, 19, 195-227. Yager, R. R. (1984). General multiple-objective decision functions and linguistically quantified statements. International Journal of Man-Machine Studies, 21, 389-400. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18, 183-190. Yager, R. R. (1992). On a semantics for neural networks based on fuzzy quantifiers. International Journal of Intelligent Systems, 7, 765-786. Yager, R. R. (1993). Families of OWA operators. Fuzzy Sets and Systems, 59, 125‑148.
Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computer Mathematics with����������������� Applications, 9, 149-183.
Key Terms Fuzzy Predicate: Predicate defined by a fuzzy set. A fuzzy predicate delivers a degree of satisfaction. Gradual Integer: Integer which takes the form of a fuzzy subset of the set of naturals (interpreted as a conjunction). Such integers differ from fuzzy numbers which are interpreted as disjunctions of candidates. Gradual Relational Number: Gradual number interpreted as a conjunction and defined as the ratio of two relative integers. Gradual Relative Integer: Gradual number represented by a fuzzy subset of the set of relatives (interpreted as a conjunction). It is defined as the substraction of two gradual integers. Linguistic Quantifiers: Quantifiers defined by linguistic expressions like “around 5” or “most of.” Such quantifiers allow an intermediate attitude between the conjunction (expressed by the universal quantifier ∀) and the disjunction (expressed by the existential quantifier ∃). OWA Operator: Ordered Weighted Average Operator. The inputs are assumed to be sorted and the weights of this average are associated to input data depending on their rank (weight w1 is associated to the largest input, weight w2 is associated to the second largest input, and so forth). Sugeno Fuzzy Integral: Aggregate operator which can be viewed as a compromise between two aspects: (1) a certain quantity (a fuzzy measure) and (2) a quality of information (a fuzzy set). 269
270
Chapter XI
FSQL and SQLf:
Towards a Standard in Fuzzy Databases Angélica Urrutia Universidad Católica del Maule, Chile Leonid Tineo Universidad Simón Bolivar, Venezuela Claudia Gonzalez Universidad Simón Bolivar, Venezuela
Abstract Actually, FSQL and SQLf are the main fuzzy logic based proposed extensions to SQL. It would be very interesting to integrate them with a standard for fuzzy databases. The issue is what to take from one or other proposal. In this chapter, we analyze FSQL and SQLf making a comparison in several ways: approach direction, fuzzy components, system architecture, satisfaction degree, evaluation mechanisms, and experimental performance. We observe that there are powerful and interesting features in both proposals that could be mixed in a unified language for fuzzy relational databases.
Introduction In order to give greater flexibility to relational dababase management systems (RDBMS), different languages and models have been conceived with the incorporation of fuzzy logic concepts into information treatment. Two outstanding proposals in fuzzy logic application to databases are those of FSQL (Galindo, 1999, 2007) and SQLf (Bosc & Pivert, 1995a). This chapter shows a comparison on these two applications from different points of view.
FSQL was created in order to allow the treatment of the uncertainty in fuzzy RDBMS. It allows the representation and manipulation of precise and vague data. It distinguishes three data categories: crisp, referential ordered, and referential not ordered. It uses possibility distributions and similarity relations for the representation of vague data, using the model GEFRED (Medina, 1994; Medina, Pons, & Vila, 1994). For the manipulation of these data, FSQL extends some components of SQL with elements of fuzzy logic. It includes the use of possibility and necessity measures. Surroundings to
FSQL have been conceived as a catalogue named FMB to represent vague data and linguistic terms in a relational database. Additionally, FuzzyEER, an extension of the EER model (Extended EntityRelationship), has been conceived to allow the conceptual design of databases that incorporate vague data (Urrutia, 2003; Urrutia, Galindo, Jiménez, & Piattini, 2006;��������������������� Urrutia, Galindo, & Piattini, 2002����������������������������������� ). A mechanism for the translation of a conceptual scheme in FuzzyEER to FMB has been settled (Galindo, Urrutia, & Piattini, 2006). There exist two known implementations of FSQL at the present time, one in Oracle (Galindo, 1999, 2007) and the other in PostgreSQL (Maraboli & Abarzua, 2006). SQLf was conceived in order to represent vague requirements in queries to relational databases. It includes extensions based on fuzzy logic for all the elements of the SQL standards until the SQL3. In this language, query conditions may involve diverse linguistic user defined terms that are specified through an extension of the DDL. SQLf allows fuzzy queries over precise data, producing discriminated answers. That is to say, each row in the answer has associated its satisfaction degree of the vague requirement represented by the query. In order to evaluate queries in SQLf, it has been proposed to take advantage of the existing connections between the fuzzy and classic sets. From a fuzzy query, the principle of the derivation allows obtaining a derived precise query. The processing of the fuzzy query is made on the result of the derived consultation. There are two known SQLf implementations, both on Oracle. The comparison made in this work is related with the following aspects: variety in the use of fuzzy logic elements; satisfaction degree semantics of the answer set; evaluation mechanisms for query processing; proposed architectures for the implementation and performance experimental analysis with current prototypes. With the research work presented in this chapter, we open the way for the integration of FSQL and SQLf towards a new standard for fuzziness treatment in databases. This chapter has been organized as follows:
The next section gives a basic background on fuzzy sets. You can read more about this in the first chapter of this handbook. In the following section, we present the approach directions of FSQL and SQLf prior to pointing out the fuzzy components of both languages. Then, we give a general view of the architecture in the SQLf and FSQL implementations. We also explain in one section the use of satisfaction of fulfillment degrees in these languages. Evaluation mechanisms for fuzzy queries, according to the two proposals, are discussed, along with experimental performance analysis of existing prototypes. Finally, we address some conclusions and future trends of this work.
Fuzzy Sets Background Fuzzy sets were introduced in Zadeh (1965) to model fuzzy classes in control systems, and their use has been expanded to different domains: mathematics, classification, pattern matching, artificial intelligence, and so forth. In the first chapter of this volume, Galindo introduces fuzzy logic and fuzzy databases. See also the overview chapter by Kacprzyk, Zadrozny, De Tré, and De Caluwe in this book about fuzzy approaches to flexible database querying.
Fuzzy Sets The fuzzy sets theory stems from the classic set theory, adding a membership function to the elements of the set, which is defined in a way that each element is assigned a real number between 0 and 1 (Zadeh, 1965, 1978). A fuzzy set A over the universe of discourse U, is defined by means of a membership function µA:U → [0,1]. This function indicates the degree to which the element u is included in the concept represented by the fuzzy set. The degree 0 means that the element is completely excluded of the set, while the degree 1 means that it is completely included. It is also possible to represent a fuzzy set with a set of pairs.
271
FSQL and SQLf
A = {µA (u) /u : u ∈ U, µA (u) ∈ [0,1]}
(1)
Linguistic Label A linguistic label is a natural language word that expresses or identifies a fuzzy set. With this definition, we can assure that in our everyday life we use several linguistic labels for expressing abstract concepts such as: “young,” “old,” “cold,” “hot,” “cheap,” “expensive,” and so forth. This intuitive definition does not only vary from one to another person and depend on the moment, but also it varies with the context in which it is applied. For example, the linguistic label “high” does not measure the same in the phrases “a high person” and a “high building.” Example 1: If we express the qualitative concept “young” by means of a fuzzy set, where the X axis represents the universe of discourse “age” (in natural integer numbers) and the Y axis represents the membership degrees in the interval [0,1]. The fuzzy set that represents that concept could be expressed in the following way (considering a discrete universe): Young = 1/0 + ... + 1/25 + 0.9/26 + 0.8/27 + 0.7/28 + 0.6/29 + 0.5/30 + ... + 0.1/34
Fuzzy Number The concept of fuzzy number was introduced in Zadeh (1978) with the purpose of analyzing and manipulating approximate numeric values, for example, “near 0” and “almost 5.” The concept has been refined (Dubois & Prade, 1985, 1998), and several definitions exist. Definition 1: Let A be a fuzzy set in X and mA : X → [0,1] be its membership function A which is a fuzzy number if mA is convex, upper semicontinuity and the support of A is bounded. These requirements could be relaxed. Some authors add the constraint on being normalized
272
in the definition, that is, sup(mA(x)) = 1. The general form of the membership function of a fuzzy number A can be seen in Figure 1, which can be defined as: rA ( x) h ( x ) = A s A ( x) 0
if x ∈ [ , ) if x ∈ [ , if x ∈ ( ,
] ]
otherwise
where rA:X → [0,1], sA:X → [0,1], rA is increasing, sA is decreasing, and rA(β) = h = sA(γ) with h ∈ (0,1] and α, β, γ, δ ∈ X. The number h is called the height of the fuzzy number, and the interval [β, γ] is the kernel. A particular case of fuzzy numbers is obtained when we consider the functions rA and sA as linear functions. This type of function is often used. We call this type of fuzzy number triangular or trapezoidal. We will usually work with normalized fuzzy numbers due to which h=1, and in this case, we will be able to characterize normalized trapezoidal fuzzy number A, using the four really necessary numbers: A ≡ (α, β, γ, δ).
Fuzzy Logic Fuzzy set theory is the base of fuzzy logic. In this logic, the truth-value of a sentence (or satisfaction degree) is in the real interval [0,1]. The value 0 represents completely false, and 1 is completely true. The truth-value of a sentence “s” will be denoted as µ(s). This logic allows giving an interpretation to linguistic terms:
Figure 1. General fuzzy number h
0
FSQL and SQLf
•
•
•
•
•
Predicates (synonym of linguistic labels) are atomic components of this logic defined by a membership function on the fuzzy set. For example, linguistic terms such as “young,” “tall,” “heavy,” and “low” are predicates. Modifiers, linguistic terms that allow defining modified fuzzy predicates, are interpreted by means of transformations of the membership function. In this category are the natural language adverbs, for example, “very,” “relatively,” and “extremely.” Comparators, kinds of fuzzy predicates defined on pairs of elements, establish fuzzy comparisons; for example, “more greater than,” “approximately equal to,” and “close to.” are fuzzy comparators. Connectors are operators defined for combining fuzzy sentences. Fuzzy negation, conjunction, and disjunction are extensions of the classical. They preserve the existent correspondence with set operations complement (negation), intersection and union, respectively. Connectors may be classified by the number of their operands in unary (such as negation), binary (such as implication), or multi-ary (such as average). Quantifiers are terms describing quantities such as “most of,” “about a half,” and “around 20.” They are an extension of classical existential and universal quantifiers. Two types of fuzzy quantifiers are distinguished: absolute and proportional (relative). Absolute quantifiers represent amounts that are absolute in nature such as “about 5” or “more than 20.” An absolute quantifier can be represented by a fuzzy subset Q, such that for any non-negative real p∈R+, the membership grade of p in Q (denoted by µQ(p)) indicates the degree to which the amount p is compatible with the quantifier represented by Q. Proportional or relative quantifiers, such as “at least half” or “most,” can be represented by fuzzy subsets defined in the unit interval [0,1]. For any proportion p∈[0,1], µQ(p) indicates the degree to which the proportion p is compatible with the meaning of the quantifier.
Definition 2: Functionally, linguistic quantifiers are usually of one of three types: Increasing quantifiers (as “at least n”, “all”, “most”) are characterized by Q (a )≤ Q (b ) for all a, >=, and 0, FOR-ALL by K(x) = 0 for x < 1, and SOME by K(x) = x, while one possibility (of many) to introduce MOST is by a power of SOME, for example, K(x) = x3. Thus, we assume the general query expression: Q =< q1 , …, qn : M : K >,
(29)
where q1 , …, qn are the query descriptors, M specifies their importance weighting, and K specifies a linguistic quantifier, thereby indicating an order weighting. So with qi(D) as the degrees to which D satisfies the descriptor qi, the corresponding generalized valuation function is (compare with Formula (27)):
ValQ (D) = FM,w(K)(q1(D) , …, qn(D)),
(30)
where w is a function that takes a quantifier K and maps it to an n-vector w(K)→ [0,1]n of order-weights (for instance w(ALL)= (0,…,0,1)). A hierarchical approach to aggregation, generalizing OWA, is introduced in Yager (2000). Basically, hierarchical aggregation extends OWA to capture nested expressions. Query attributes may be grouped for individual aggregation and the language is orthogonal in the sense that aggregated values may appear as arguments to aggregations. Thus, queries may be viewed as hierarchies. To illustrate, consider the following nested query expression: < q1(D), < q2 (D),q3 (D), < q4 (D),q5(D),q6 (D) : M3 : K3 >. : M2 : K 2 >, : M1 : K1 >
(31)
Again, qi(D)∈[0,1] measures the degree to which descriptor qi conforms to the text object with description D, while Mj and Kj are the importance and quantifier applied in the j’th aggregate. In the expression above, M1 : K1 parameterizes aggregation at the outermost level of the two components q1(D) and the expression in lines 2 to 4. M2 : K2 parameterizes aggregation of the three components q2 (D), q3 (D), and the innermost expression (line 3), while M3 : K3 parameterizes aggregation of the three components q4 (D), q5(D), and q6 (D).
Query Evaluation Approaches On top of the OWA aggregation principle and the extended hierarchical version of OWA, we can distinguish two major cases of description structures: simple un-nested sets and nested sets, the former perfectly handled by OWA aggregation and the latter by hierarchical aggregation.
343
Query Expansion by Taxonomy
Aggregation on Un-Nested Descriptions The simple set-of-descriptors structure for descriptions in Formula (25) admits a straightforward valuation approach for a similarity query: Qsim =< q1 , …, qn : (1,1, … ) : SOME >. (32) The aggregation here is simple in that importance is not distinguished and SOME, corresponding to simple average, is used as quantifier. An example of a valuation is: Val Qsim (D) = F(1,1,:::),w(SOME)(q1(D) , …, qn(D)), (33) with individual query-descriptor valuation functions as qi(D) = maximumj{xx/dj ∈ similar(qi), dj∈D}. (34) To illustrate, assume a weighted shared node similarity and consider again Figure 4. In continuation of the WSN example from the previous section, assume with ρ = 0.8 and consider the query Q =< dog [CHR:black], noise >. With a threshold for similar on 0.4, we have what is shown in Box 2.
With the example valuation function (33), thus giving all query terms equal importance and taking simple arithmetic average as aggregation, the following are examples of query valuations to the query Q =< dog [CHR:black], noise >: ValQsim ({noise [CBY:dog]}) = 0.90 ValQsim ({noise [CBY:dog [CHR:black]]}) = 0.87 ValQsim ({dog, noise}) = 0.84 ValQsim ({black,dog, noise}) = 0.72. That ValQsim ({noise [CBY:dog]}) = 0.90 is derived according to Formula (34) as max(0.42,0.90)=0.9, while ValQsim ({dog, noise}) = 0.84 is the average (according to (33)) between the degree to which dog [CHR:black] respectively noise corresponds to the document represented by {dog, noise}: ValQsim ({dog, noise}) = max(0.68,0,42) + max(0.47,1) = 0.84. 2
Nested Aggregation on Un-Nested Descriptions An alternative is to expand the query Q to a nested expression:
where for each qi we set < μi1 /qi1 , …, μiki /qiki >= similar(qi) and use as individual valuation: qij ( D ) = ij 0
when qij ∈{d1 ,..., d m} . otherwise
(36)
In the event that we use equal importance and the following combination of quantifiers: ValQsim (D) = , < q21(D) , …, q2k2 (D) : (1,1, … ) : EXIST >, … , < qn1(D) , …, qnkn (D) : (1,1, … ) : EXIST >, : (1,1, … ) : SOME >, (37) we get a valuation identical to that of Formula (33). Nested expressions, however, facilitate importance adjustment in connections with query expansion according to the kinds of relations contributing to the expansion. Assigning 1.0 importance to IS-A and 0.5 importance to CHR would, for the query Q=<dog[CHR:black], noise>, lead to the following expansion (compare with Figure 4): ValQsim (D) = , < qnoise(D), … : (1,1, … ) : EXIST > : (1,1, … ) : SOME >. (38) Nested expressions are thus a way of distinguishing different kinds of relations influences on similarity.
Aggregation on Nested Descriptions In some cases, when text is processed by partial analysis as indicated earlier, an intrinsic structure appears as the most obvious choice for the description. The parser used in the project reported
on here is a two-phase parser, grouping words in the sentence into groups corresponding to noun phrases in the first phase, and deriving compound descriptors from the words in each noun phrase individually in the second phase. Thus, we have as an intrinsic structure from the first phase a set of sets (or lists) of words. If we could always extract a unique compound concept as descriptor from an inner set, the resulting intrinsic structure from the second phase would be the single set as assumed above. However, it is in many cases not possible, and we would therefore lose information by flattening to a single set. This suggests that descriptions should be sets of sets of descriptors such that the query structure becomes: Q = < Q1 , …, Qn > = ,… ,< qn1 , …, qnkn >>, (39) where the Qis are sets of descriptors qi j, j = 1 , …, ki, and a text index is: D = {D1 , …, Dm} = {{d11 , …, d1l1} , …, {dm1 , …, dmlm}},
(40)
where the Dis are sets of descriptors di j, j = 1 , …, li. This, however, demands a modified valuation and since, in this case, the initial query expression is nested, a valuation over a nested aggregation also becomes the obvious choice. Note first that the grouping of descriptors in descriptions has the obvious interpretation of a closer binding of descriptors within a group compared to across different groups. So we cannot individually evaluate each qi j(D), but have to compare at the level of the groups, for instance, by a restrictive quantification over qi1(Dj) , …, qiki (Dj) and an EXIST quantification over j to get the best matching Dj for a given Qi. A valuation can thus be: ValQsim (D) = : M1 : EXIST >, … , , …, < qn1(Dm) , …, qnkn (Dm) : Mn1 : MOST > : Mm : EXIST >, : M0 : SOME >. (41) The individual query-descriptor valuation functions can be set to: qi j(Dk) = maximuml{xx=dkl ∈ similar(qi j)}. (42) As opposed to the single set description example above, the qi,js in this instance are the original descriptors from the query. While choices of inner quantifiers are significant for correct interpretation, the choice of SOME at the outer level for the component description is just one of many possible choices for reflecting the user’s preference of overall aggregation.
Conclusion The emphasis in this chapter has been on a specific application of knowledge structures like taxonomies and ontologies, namely the expansion of queries. Ontologies, as a generalization of taxonomies, were briefly surveyed, similarity was introduced as the key to avoid reasoning while still reflecting ontological knowledge, and approaches to query expansion and comparison on the level of descriptions were discussed. The general idea is to provide retrieval guided by domain-specific knowledge as comprised by the ontology. As far as ontologies are concerned, we have, in addition to the survey, presented a specific lattice algebraic representation formalism. We consider this formalism as appropriate for the purpose partly because it easily generalizes to generative ontologies and also captures derived concepts (generativity is in fact inherent) and partly because 346
so-called instantiated ontologies can be derived by simple means using this formalism. When it comes to similarity measures, there are many alternatives, which is also the case for the properties proposed to characterize these measures. In conclusion, it is our view that taxonomic structure should play a key role as a source for similarity. The fact that Resnik ignores the path length below the least upper bound, for instance, appears to be too course-grained. Corpus statistics, on the other hand, should be taken into account whenever available. Regarding the simple generic approach presented in this chapter, taking instantiated ontologies as a source for similarity would most probably in many cases give better results, regardless of the taxonomic measure applied. The connectivity in the taxonomic structure becomes especially interesting in connection with document retrieval when this structure reflects the actual content of the document base. However, it appears that more sophisticated statistics like Resnik’s original idea of applying information theory has great potential, especially in combination with more thorough taxonomic excerpts. Moreover, it is probably also worth considering alternatives that include regular distributional similarity as discussed in Mohammad and Hirst (2006) and Weeds and Weir (2005) as well as considering possibilities for combining these approaches with ontology-based approaches. Query expansion is first of all a matter of comparison at the level of descriptions. The query is represented by a single description, which is then to be compared with descriptions in the information base referring to documents. So the most obvious way to realize this is by means of expansion of the query embedding similar concepts, of course, provided that the evaluation principle can aggregate the degree of match appropriately. It appears, however, that a more detailed interpretation of the query expression leading to a description reflecting structure (formal language queries) and/or semantics (NL queries) has interesting potential and should be investigated further. The flexible hierarchical aggregation
Query Expansion by Taxonomy
can be applied to embed quantification, logical connectors and the importance specification of a formal query language and syntax and semantics for the NL query, such as noun phrase structuring and part-of-speech, opening up for a more refined interpretation.
References Andreasen, T., Bulskov, H., & Knappe, R. (2003). Similarity for conceptual querying. Paper presented at the 18th International Symposium on Computer and Information Sciences, Antalya, Turkey (pp. 268-275). Andreasen, T., Bulskov, H., & Knappe, R. (2005). On automatic modeling and use of domain-specific ontologies. Paper presented at the 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, New York (pp. 74-82). Andreasen, T., Jensen, P. A., Nilsson, J. F., Paggio, P., Pedersen, B. S., & Thomsen, H. E. (2002). Ontological extraction of content for text querying. Paper presented at the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers, Stockholm Sweden (pp. 123-136). Andreasen, T., Knappe, R., & Bulskov, H. (2005). Domain-specific similarity and retrieval. Paper presented at the 11th International Fuzzy Systems Association World Congress, Beijing, China (pp. 496-502).
mation retrieval. In P. P. Wang, Da Ruan, & E. E. Kerre (Eds.), Fuzzy logic: A spectrum of theoretical and practical issues (pp. 193-218). Springer. Berners-Lee, T. (1998). Semantic Web roadmap. Retrieved February 8, 2008, from http://www. w3.org/DesignIssues/Semantic.html Borst, W. N. (1997). Construction of engineering ontologies for knowledge sharing and reuse. Enschede, The Netherlands: Centre for Telematics and Information Technology. Budanitsky, A., & Hirst, G. (2006). EvaluatingWordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 1-35. Bulskov, H., Knappe, R., & Andreasen, T. (2002). On measuring similarity for conceptual querying. Paper presented at the 5th International Conference on Flexible Query Answering Systems, Copenhagen, Denmark (pp. 100-111). Chaudhri, V. K., Farquhar, A., Fikes, R., Karp, P. D., & Rice, J. P. (1998). OKBC: A progammatic foundation for knowledge base interoperability. Paper presented at the 15th National Conference on Artificial Intelligence, Madison, Wisconsin (pp. 600-607). Cross, V. (2004). Fuzzy semantic distance measures between ontological concepts. Paper presented at the International Conference of the North American Fuzzy Information Processing Society. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. The MIT Press.
Baeza-Yates, R. A., & Ribeiro-Neto, B. A. (1999). Modern information retrieval. ACM Press/Addison-Wesley.
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.
Bandos, J. A., & Resnick, M. L. (2002). Understanding query formation in the use of Internet search engines. Paper presented at the Human factors and ergonomics society 46th annual meeting (pp. 1291-1296).
Guarino, N., & Giaretta, P. (1995, April). Ontologies and knowledge bases: Towards a terminological clarification. Paper presented at the Towards very large knowledge bases, Amsterdam, The Netherlands (pp. 25-32).
Baziz, M., Boughanem, M., Loiseau, Y., & Prade, H. (2007). ������������������������������������� Fuzzy logic and ontology-based infor-
Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and
347
Query Expansion by Taxonomy
correction malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press. Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. Paper presented at the Proceedings on International Conference on Research in Computational Linguistics, Taiwan (pp. 19-33). Knappe, R., Bulskov, H., & Andreasen, T. (2006). Perspectives on ontology-based querying. International Journal of Intelligent Systems. Lassila, O., & McGuinness, D. (2001). The role of frame-based representation on the Semantic Web (Tech. Rep. No. KLS-01-02). Standford, CT: Knowledge Systems Laboratory, Standford University. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press. Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), 33-39.
Proceedings of the International Conference on Computers and Communications (pp. 809-814). Miller, G. A. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4). Miller, G. A. (1995). WordNet: A lexical database for English. Communication of the ACM, 38(11), 39-41. Mohammad, S., & Hirst, G. (2006). Distributional measures of concept-distance: A task-oriented evaluation. Paper presented at the Conference on Empirical Methods in Natural Language Processing (pp. 35-43). Nilsson, J. F. (2001). A logico-algebraic framework for ontologies: ONTOLOG. In ��� Proceedings of the First International OntoQuery Workshop, Department of Business Communication and Information Science, Kolding, Denmark (pp. 11-38). Penev, A., & Wong, R. (2006). Shallow NLP techniques for Internet search. Paper presented at the 29th Australasian Computer Science Conference Hobart, Tasmania, Australia (pp. 167-176).
Lenat, D., & Guha, R. V. (1990). Building large knowledge-based systems: Representation and inference in the Cyc Project. Addison-Wesley.
Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). ���������������������������������������� Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17-30.
Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. Paper presented at the Annual Meeting of the Association for Computational Linguistics (pp. 64-71).
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. Paper presented at the International Joint Conference on Artificial Intelligence (pp. 448-453).
Lin, D. (1998). An information-theoretic definition of similarity. Paper presented at the International Conference on Machine Learning (pp. 296-304).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95-130.
Loiseau, Y., Boughanem, M., & Prade, H. (2005). Evaluation of term-based queries using possibilistic ontologies. In E. Herrera-Viedma, G. Pasi, & F. Crestani (Eds.), Soft computing for information retrieval on the Web. Springer-Verlag. Lucarella, D. (1990). Uncertainty in information retrieval: An approach based on fuzzy sets. In
348
Shannon, C. E., & Weaver, W. (1949). A mathematical theory of communication. Urbana, IL: University of Illinois Press. Sussna, M. (1993, November). Word sense disambiguation for tree-text indexing using a massive
Query Expansion by Taxonomy
semantic network. In Proceedings of the 2nd International Conference on Information and Knowledge Management, New York, NY (pp. 67-74). Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327-352. Weeds, J., & Weir, D. (2005). Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4), 439-476. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Morristown, NJ, (pp. 133-138). Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183-190. Yager, R. R. (2000). A hierarchical document retrieval language. Information Retrieval, 3(4), 357-377.
Key Terms Description: A description for a text unit (document, paragraph, sentence) is the index terms related to this. Ontology: An ontology specifies a conceptualization, that is, a structure of related concepts for a given domain. Ontology-Based Querying: Evaluation of queries against a database utilizing an ontology describing the domain of the database Precision: The proportion of retrieved and relevant documents to all the documents retrieved. Query Expansion: Given a similarity relation over query terms, expansion of a query refers to the addition of similar terms to the query, leading to a relaxed query and an extended answer. Recall: The proportion of relevant documents that are retrieved out of all relevant documents available. Similarity: Similarity refers to the nearness or proximity of concepts. Taxonomy: A taxonomy is a hierarchical structure displaying parent-child relationships (a classification). A taxonomy extends a vocabulary and is a special case of a the more general ontology.
349
Section III
Implementation, Data Models, Fuzzy Attributes, and Applications
351
Chapter XIV
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata Mohamed Ali Ben Hassine Tunis El Manar University, Tunisia Amel Grissa Touzi Tunis El Manar University, Tunisia José Galindo University of Málaga, Spain Habib Ounelli Tunis El Manar University, Tunisia
Abstract Fuzzy relational databases have been introduced to deal with uncertain or incomplete information demonstrating the efficiency of processing fuzzy queries. For these reasons, many organizations aim to integrate flexible querying to handle imprecise data or to use fuzzy data mining tools, minimizing the transformation costs. The best solution is to offer a smooth migration towards this technology. This chapter presents a migration approach from relational databases towards fuzzy relational databases. This migration is divided into three strategies. The first one, named “partial migration,” is useful basically to include fuzzy queries in classic databases without changing existing data. It needs some definitions (fuzzy metaknowledge) in order to treat fuzzy queries written in FSQL language (Fuzzy SQL). The second one, named “total migration,” offers in addition to the flexible querying, a real fuzzy database, with the possibility to store imprecise data. This strategy requires a modification of schemas, data, and eventually programs. The third strategy is a mixture of the previous strategies, generally as a temporary step, easier and faster than the total migration.
Introduction New enterprise information systems are requested to be flexible and efficient in order to cope with rapidly changing business environments and ad-
vancement of services. An information system that develops its structure and functionality in a continuous, self-organized, adaptive, and interactive way can use many sources of incoming information and can perform intelligent tasks such as language
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
learning, reasoning with uncertainty, decision making, and more. According to Bellman and Zadeh (1970), “much of the decision making in the real world takes place in an environment in which the goals, the constraints, and the consequences of possible actions are not known precisely.” Management often makes decisions based on incomplete, vague, or uncertain information. In our context, the data which are processed by the application system and accumulated over the lifetime of the system may be inconsistent and may not express the reality. In fact, one of the features of human reasoning is that it may use imprecise or incomplete information and in the real world, there exists a lot of this kind of fuzzy information. Hence, we can assert that in our every day life we use several linguistic labels to express abstract concepts such as young, old, cold, hot, cheap, and so forth. Therefore, human-computer interfaces should be able to understand fuzzy information, which is very usual in many human applications. However, the majority of existing information systems deal with crisp data through crisp database systems (Elmasri & Navathe, 2006; Silberschatz, Korth, & Sudarshan, 2006). In this scenario, fuzzy techniques have proven to be successful principles for modeling such imprecise data and also for effective data retrieval. Accordingly, fuzzy databases (FDBs) have been introduced to deal with uncertain or incomplete information in many applications demonstrating the efficiency of processing fuzzy queries even in classical or regular databases. Besides, FDBs allow storing fuzzy values, and of course, they should allow fuzzy queries using fuzzy or nonfuzzy data (Bosc, 1999; De Caluwe & De Tré, 2007; Galindo, Urrutia, & Piattini, 2006; Petry, 1996). Facing this situation, many organizations aim to integrate flexible querying to handle imprecise data or to use fuzzy data mining tools, minimizing the transformation costs. A solution of the existing (old) systems is the migration, that is, moving the applications and the database to a new platform and technologies. Migration of old systems, or legacy systems, may be an expensive and complex process. It allows legacy systems to be moved to
352
new environments with the new business requirements, while retaining functionality and data of the original legacy systems. In this context, the migration towards FDBs, which constitutes a step to introduce imprecise data in an information system, does not only constitute the adoption of a new technology, but also, and especially, the adoption of a new paradigm. Consequently, it constitutes a new culture of development of information systems, and this book is evidence of the current interest and the promising future of this paradigm and its multiple fields. However, with important amounts invested in the development of relational systems, in the enrollment and the formation of “traditional” programmers, and so forth, enterprises appear reticent to invest important sums in the mastery of a new fuzzy paradigm. The best solution is to offer a smooth migration toward this technology, allowing them to keep the existing data, schemas, and applications, while integrating the different fuzzy concepts to benefit of the fuzzy information processing. It will lower the costs of the transformations and will encourage the enterprises to adapt the concept of fuzzy relational databases (FRDBs). Moreover, although the migration of the information systems constitutes a very important research domain, there is a limited number of migration methods between two specific systems. We mention some examples (e.g., Behm, Geppert, & Dittrich, 1997; Henrard, Hick, Thiran, & Hainaut, 2002; Menhoudj & OuHalima, 1996). To our knowledge, the migration of relational databases (RDB) towards FRDB is not even studied. FDBs allow storing fuzzy values and, besides, they allow making fuzzy queries using fuzzy or nonfuzzy data. It should be noted that classic querying is qualified by “Boolean querying,” although some systems use a trivalued logic with the three values true, false, and null, where null indicates that the condition result is unknown because some data is unknown. The user formulates a query usually with a condition, for example, in SQL, which returns a list of rows, when the condition is true. This querying system constitutes a hindrance for
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
several applications because we cannot know if one row satisfies the query better than another row. Besides, the traditional querying does not make it possible for the end user to use some vague linguistic terms in the query condition or to use fuzzy quantifier such as “almost all” or “approximately the half.” Many works have been proposed in the literature to introduce the flexibility into the database querying both in crisp and fuzzy databases (Bosc, Liétard, & Pivert, 1998; Bosc, & Pivert, 1995, 1997, 2000; Dubois & Prade, 1997; Galindo, Medina, & Aranda, 1999; Galindo, Medina, Pons, & Cubero, 1998; Galindo et al., 2006;���������� Kacprzyk & Zadrożny, 1995, 2001; ���������������������� Tahani, 1977; Umano & Fukami, 1994������������������������������������� ). The essential idea in these works consists in adding an additional layer to the classic DBMS (database management systems) to evaluate fuzzy predicates. In this book, the reader can find a chapter by ��������������������������������� Zadrożny, de Tré, de Caluwe,����� and Kacprzyk ��������������������������������������� with an interesting review about fuzzy querying proposals. Also, this book includes other chapters with new applications and new advances in the field of fuzzy queries. Some examples are the chapter by Takači ���������������������������� and������������������������� Škrbić about priorities in queries, the chapter by Dubois and������������� ���������������� Prade about bipolar queries, and the chapter by Barranco, Campaña, and Medina using a fuzzy object-relational database model. Among various published propositions for different fuzzy database models, we mention the one by Medina, Pons, and Vila (1995) who introduced the GEFRED model, an eclectic synthesis of other previous models. In 1995, Bosc and Pivert introduced the first version of a language handling the flexible queries named SQLf. In their turn, Medina, Pons, and Vila (1994b) proposed the FSQL language, which was later extended (Galindo, 1999, 2005; Galindo et al., 1998, 2006). Although the basic target in FSQL is similar to the SQLf language, FSQL allows fuzzy queries both in crisp and fuzzy databases and it presents new definitions such as many fuzzy comparators, fuzzy attributes (including fuzzy time), fuzzy constants. It allows the creation of new fuzzy objects such as labels, quantifiers, and so forth. There is another
chapter by Urrutia, Tineo, and González studying both proposals. This chapter presents a new approach for the migration from RDB towards FRDB with FSQL. The aim of this migration is to permit an easy mapping of the existing data, schemas, and programs, while integrating the different fuzzy concepts. Therefore, all valid SQL queries remain useful in the fuzzy query language FSQL (fuzzy SQL). This approach studies the RDB transformations essentially at the level of the schemas (physical and conceptual), the data, and, less specifically, the applications. First, we present a very brief overview about fuzzy sets and then we present basic concepts about FRDB. After, we present our three migration strategies. The first one, named “partial migration,” is useful only to include fuzzy queries in classic databases without changing existing data. The second one, named “total migration,” offers in addition to the flexible querying the possibility to store imprecise data. The third strategy is a mixture of the previous strategies. Finally, we outline some conclusions and suggest some future research lines.
Introduction to Fuzzy Sets The fuzzy sets theory stems from the classic theory of sets, adding a membership function to the set, which is defined in such a way that each element is assigned a real number between 0 and 1. In 1965, professor L.A. Zadeh defined the concept of fuzzy sets and then many works and applications have been made (Pedrycz & Gomide, 1998). We give here the most basic notions, and for a better introduction, read the first chapter of this handbook. A fuzzy set (or fuzzy subset) A is defined by means of a membership function µA (u), which indicates the degree to which the element u is included in the concept represented by A. The fuzzy set A over a universe of discourse U can also be represented with a set of pairs given by: A = {µA (u) /u : u ∈ U, µA (u) ∈ [0,1]}
(1) 353
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
where µ is the membership function and µA(u) is the membership degree of the element u to the fuzzy set A. If µA(u)=0, it indicates that u in no way belongs to the fuzzy set A. If µA(u)=1, then u belongs totally to the fuzzy set A. For example, if we consider the linguistic variable height_of _a_ person, then three fuzzy subsets could be defined identified by three labels, Short, Medium-height, and Tall, with membership functions µShort(u), µMedium-height(u), and µTall(u), respectively, where u takes values in the referential of this attribute (or underlying domain), that would be the real positive numbers (expressing the centimetres of height). On the other hand, for domains with a nonordered referential, a similarity function can be defined, that can be used to measure the similarity or resemblance between every two elements of the domain. Usually, the similarity values are normalized in the interval [0,1], where 0 means “totally different” and 1 means “totally alike” or equal. Thus, a similarity relationship is a fuzzy relation that can be seen as a function sr, so that: sr : D×D
→ [0,1]
sr(di, dj)
→ [0,1] with di, dj ∈ D
(2)
where D is the domain of the defined labels. We can assume that sr is a symmetrical function, this is that sr(di, dj) = sr(dj, di), as this is the most usual, although it does not necessarily have to be this way. We can also construct possibility distributions (or fuzzy sets) on the labels of D, extending the possibilities for expressing imprecise values (������� Zadeh, 1978), i������������������������������ n such a way that each value di ∈ D has a degree of truth or possibility pi associated to it, obtaining expressions for specific values that can be expressed generically as: {pi/di : pi ∈ [0,1], di ∈ D}
(3)
The domains with ordered and non-ordered referentials can adequately represent concepts of
354
“imprecision” using fuzzy sets theory. It should be noted that many of these natural concepts depend, in a greater or lesser degree, on the context and on the person that expresses them. From this simple concept, a complete mathematical and computing theory has been developed which facilitates the solution of certain problems. Fuzzy logic has been applied to a multitude of objectives such as control systems, modeling, simulation, patterns recognition, information or knowledge systems (databases, expert systems, etc.), computer vision, artificial intelligence, artificial life, and so forth.
Introduction to Fuzzy Relational Databases The first chapter of this handbook includes a brief introduction to this topic, explaining some basic models. We give here a brief overview in order to facilitate the reading of this chapter. The term “imprecision encompasses various meanings, which might be interesting to highlight. It alludes to the facts that the information available can be incomplete (vague), that we don’t know whether the information is true (uncertainty), that we are totally unaware of the information (unknown), or that such information is not applicable to a given entity (undefined). Usually, the total ignorance is represented with a NULL value. Sometimes these meanings are not disjunctive and can be combined in certain types of information”. (Galindo et al., 2006, p. 45) This imprecision was studied in order to elaborate systems, databases, and, consequently, applications which support this kind of information. Most works studying the imprecision in information have used possibility, similarity, and fuzzy techniques. The research on FDBs has been developed for about 20 years and concentrated mainly on the following areas: flexible querying in classical databases, extending classical data models in order to achieve fuzzy databases (including, of course, fuzzy queries on these fuzzy databases and fuzzy conceptual modeling tools), fuzzy data mining
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
techniques, and applications of these advances in real databases. All these different issues have been studied in different chapters of this volume and also in many other publications (De Caluwe & De Tré, 2007; Bosc, 1999; Bosc et al., 1998; Galindo et al., 2006; Petry, 1996). The querying of a FRDB, contrary to classical querying, allows the users to use fuzzy linguistic labels (also named linguistic terms) and express their preferences to better qualify the data that they wish to get. An example of flexible query, also named in this context fuzzy query, would be “list of the young employees, well paid and working in department with big budget.” This query contains the fuzzy linguistic labels “young,” “well paid,” and “big budget.” These labels are words, in natural language, that express or identify a fuzzy set that may or may not be formally defined. In fact, the flexibility of a query reflects the preferences of the end user. This is manifested by using a fuzzy set representation to express a flexible selection criterion. The extent to which an object in the database satisfies a request then becomes a matter of degree. The end user provides a set of attribute values (fuzzy labels), which are fully acceptable for the user, and a list of minimum thresholds for each of these attributes. With these elements, a fuzzy condition is built for the fuzzy query. Then, the fuzzy querying system ranks the answered items according to their fulfillment degree or level of acceptability. Some approaches, the so-called bipolar queries, need both the fuzzy condition (or fuzzy constraint) and the positive preferences or wishes, which are less compulsory. (A very interesting chapter about bipolar queries may be found in this volume in the chapter by Dubois and Prade.) Hence, the interests of fuzzy queries for a user are twofold: 1. 2.
A better representation of the user’s preferences while allowing the use of imprecise predicates. Obtaining the necessary information in order to rank the answers contained in the database according to the degree to which they satisfy
the query. It contributes to avoid empty sets of answers when the queries are too restrictive, as well as too large sets of answers without any ordering when queries are too permissive. This preface led us to establish the definition of FRDB as an extension of RDB. This extension introduces fuzzy predicates or fuzzy conditions under shapes of linguistic expressions that, in flexible querying, permits to have a range of answers (each one with its membership degree) in order to offer to the user all intermediate variations between the completely satisfactory answers and those completely dissatisfactory (Bosc et al., 1998). Yoshikane Takahashi (1993, p. 122) defined FRDB as “an enhanced RDB that allows fuzzy attribute values and fuzzy truth values; both of these are expressed as fuzzy sets”. Then, a fuzzy database is a database which is able to deal with uncertain or incomplete information using fuzzy logic. There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, an attribute in the range [0,1]. However, there are others kind of databases allowing fuzzy values to be stored in a fuzzy attribute using fuzzy sets or possibility distributions or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree...). The main models are those of Prade-Testemale (1987), Umano-Fukami (Umano, 1982; Umano & Fukami, 1994), Buckles-Petry (1982), Zemankova-Kaendel (1985) and GEFRED by Medina-Pons-Vila (1994a). This chapter deals mainly, with the GEFRED model (GEneralised model for Fuzzy RElational Database), and some later extensions (Galindo et al., 2006). This model constitutes an eclectic synthesis of the various models published so far with the aim of dealing with the problem of representation and treatment of fuzzy information by using RDB. One of the major advantages of this model is that it consists of a general abstraction that allows for the use of various approaches,
355
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Table 1. Data types in the GEFRED model 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
A single scalar (e.g., Behavior=Good, represented by the possibility of distribution 1/Good). A single number (e.g., Age=28, represented by the possibility of distribution 1/28). A set of mutually exclusive possible scalar assignations (e.g., Behavior={Bad, Good}, represented by {1/Bad, 1/Good}). A set of mutually exclusive possible numeric assignations (e.g., Age={20, 21}, represented by {1/20, 1/21}). A possibility distribution in a scalar domain (e.g., Behavior={0.6/Bad, 1.0/Regular}). A possibility distribution in a numeric domain (e.g., Age={0.4/23, 1.0/24, 0.8/25}, fuzzy numbers or linguistic labels). A real number belonging to [0, 1], referring to a degree of matching (e.g., Quality=0.9). UNKNOWN value with possibility distribution {1/u: u ÎU} , where U is the considered domain. UNDEFINED value with possibility distribution {0/u: u ÎU}, where U is the considered domain. NULL value, given by NULL={1/Unknown, 1/Undefined}.
regardless of how different they might look. In fact, it is based on the generalized fuzzy domain and the generalized fuzzy relation, which include respectively classic domains and classic relations. The original data types supported by this model are showed on Table 1.
Preliminary Concepts In order to implement a system which represents and manipulates “imprecise” information, Medina et al. (1995) developed the FIRST (fuzzy interface for relational systems) architecture, which has been enhanced with FIRST-2 (Galindo, Urrutia, & Piattini, 2004b; 2006). It has been built on some DBMS client-server architecture, such as Oracle1 and PostgreSQL2 (Galindo, 2007; Maraboli & Abarzua, 2006). It extends the existing structure and adds new components to handle fuzzy information. This architecture adds a server, named FSQL server, assuring the translation of flexible queries written in FSQL in a comprehensible language for the host DBMS (SQL). FSQL is an extension of the popular SQL language, in order to express fuzzy characteristics, especially in fuzzy queries, with many fuzzy concepts (fuzzy conditions, fuzzy comparators, fulfillment degrees, fuzzy constants, fuzzy quantifiers, fuzzy attributes, etc.). The first
356
versions of FSQL were developed during the last decade of the 20th century (Galindo et al., 1998; Medina et al., 1994b), and the more recent version is defined by Galindo et al. (2006). In the following subsections, we present this language and the supported fuzzy attributes types. The RDBMS (relational DBMS) dictionary or catalog which represents the part of the system allowing the storage of information about the data collected in the database, and other information (such as users, data structures, data control, etc.), is prolonged in order to collect the necessary information related to the imprecise nature of the new collection of data processing (������������������ fuzzy attributes, their type, their objects such as labels, quantifiers, etc.��������������������������������������������� ). This extension, named fuzzy metaknowledge base���������������������������������������������� (FMB)���������������������������������������� , is organized following the prevailing philosophy in the host RDBMS catalog. In this chapter, we designate by fuzzy RDBMS (FRDBMS) the addition of the FSQL server and the FIRST-2 methodology to the RDBMS.
Fuzzy Attributes In order to model fuzzy attributes, we distinguish between two classes of fuzzy attributes: Fuzzy attributes whose fuzzy values are fuzzy sets (or possibility distributions) and fuzzy attributes whose values are fuzzy degrees. Each class includes some
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
different fuzzy data type (Galindo et al., 2006; Urrutia, Galindo, & Piattini, 2002).
Fuzzy Sets as Fuzzy Values
•
These fuzzy attributes may be classified in four data types. This classification is performed taking into account the type of referential or underlying domain. In all of them, the values Unknown, Undefined, and Null are included: Fuzzy Attributes Type 1 (FTYPE1): These are attributes with “precise data,” classic or crisp (traditional with no imprecision). However, we can define linguistic labels over them, and we can use them in fuzzy queries. This type of attribute is represented in the same way as precise data, but they can be transformed or manipulated using fuzzy conditions. This type is useful for extending a traditional database, allowing fuzzy queries to be made about classic data. For example, enquiries of the kind “Give me employees that earn a lot more than the minimum salary.” Fuzzy Attributes Type 2 (FTYPE2): These are attributes that gather “imprecise data over an ordered referential.” These attributes admit, like Table 2 shows, both crisp and fuzzy data, in the form of possibility distributions over an underlying ordered dominion (fuzzy sets). It is an extension of Type 1 that does, now, allow the storage of imprecise information, such as “he is approximately 2 metres tall.” For the sake of simplicity, the most
•
•
Figure 1. Trapezoidal, linear, and normalized distribution function 1
0
U a
b
c
d
•
complex of these fuzzy sets are supposed to be trapezoidal functions (Figure 1). Fuzzy Attributes Type 3 (FTYPE3): They are attributes over “data of discreet non-ordered dominion with analogy.” In these attributes, some labels are defined (e.g., “blond,” “red,” “brown,” etc.) that are scalars with a similarity (or proximity) relationship defined over them, so that this relationship indicates to what extent each pair of labels resemble each other. They also allow possibility distributions (or fuzzy sets) over this dominion, for example, the value (1/dark, 0.4/brown), which expresses that a certain person is more likely to be dark than brown-haired. Note that the underlying domain of these fuzzy sets is the set of labels, and this set is non-ordered. Fuzzy Attributes Type 4 (FTYPE4): These attributes are defined in the same way as Type 3 attributes without it being necessary for a similarity relationship to exist between the labels.
Fuzzy Degrees as Fuzzy Values The domain of these degrees can be found in the interval [0,1], although other values are also permitted, such as a possibility distribution (usually over this unit interval). The meaning of these degrees is varied and depends on their use. The processing of the data will be different depending on the meaning. The most important possible meanings of the degrees used by some authors are the fulfillment degree, uncertainty degree, possibility degree, and importance degree. The most typical kind of degree is a degree associated to each tuple in a relation (Type 7) with the meaning of membership degree of each tuple to the relation. Another typical degree is the fulfillment degree associated to each tuple in the resulting relation after a fuzzy query. In this volume, there are some chapters about these kinds of relations (see for example the ranked tables in the chapter by Belohlavek and Vychodil or the fulfillment degrees in the chapter by Voglozin, Raschia, Ughetto and Mouaddib). 357
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Sometimes is useful to associate a fuzzy degree to only one attribute (Type 5) or to only a concrete set of attributes (Type 6), for example, in order to measure the truth, the importance, or the vagueness. Finally, in some applications, a fuzzy degree with its own fuzzy meaning (Type 8) is useful in order to measure a fuzzy characteristic of each item in the relation like the danger in a medicine or the brightness of a concrete material.
Representation of Fuzzy Attributes The representation is different according to the fuzzy attribute type. Fuzzy attributes Type 1 are represented as usual attributes because they do not allow fuzzy values. Fuzzy attributes Type 2 need five (or more) classic attributes: One stores the kind of value (Table 2), and the other four store the crisp values representing the fuzzy value. Note
Table 2. Kind of values of fuzzy attributes Type 2 Kind of values
Number 0, 1, 2
UNKNOWN, UNDEFINED, NULL
3
Crisp: d
4
Label: label_identifier
5
Interval: [n,m]
6
Approximate value: d
7
Trapezoidal value: [a,b,c,d]
8
Approx. value with explicit m: d±m
9,10,11,12
Possibility distributions (different formats)
Table 3. Kind of values of fuzzy attributes Types 3 and 4 Number 0, 1, 2
in Table 2 that trapezoidal fuzzy values (Figure 1) need the other four values. An approximate value (approximately d, d±margin) is represented with a triangular function centered in d (degree 1) and with degree 0 in d–margin and d+margin, where the value margin depends on the context, as we will see later). Other approximate values (number 8) use their own margin m. Finally, we can also represent possibility distributions in the Type 2 attributes. Some of them (number 9 and 10) use only the four attributes defined previously, but we define here two new and more flexible possibilities: •
•
Number 11: Discontinuous possibility distribution, given a list of point with the format p1/v1,…, pn/vn, where the pi are the possibility degrees and the vi are the values with such degrees. Note that we need 2n attributes (instead of the four) for storing a possibility distribution with n terms. The rest of the values have a degree of zero. Number 12: Continuous possibility distribution, given a list of point with the format p1/v1,…, pn/vn, where the pi are the possibility degrees and the vi are the values with such degrees. Note that we need 2n attributes for storing a possibility distribution with n terms. Now the stored possibility distribution represents a continuous linear function, and between vi and vi+1, there is a straight line joining each consecutive two points.
Fuzzy attributes Type 3 need 2n+1 attributes: One stores the kind of value (Table 3) and the others (2n) may store a possibility distribution where n is the maximum number of elements (degree/label). Note in Table 3 that number 3 needs only two values, but number 4 needs 2n values. Value n must be defined for each fuzzy attribute Type 3, and it is stored in the FMB (see following section). Fuzzy attributes Type 4 are represented just like Type 3. The difference between them is shown in the next section. Fuzzy degrees (Types 5, 6, 7, and 8) are represented using a classic numeric attribute because their domain is the interval [0,1].
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
The FSQL Language The FSQL language (Galindo, 2005; Galindo et al., 2006; Galindo, Aranda, Caro, Guevara, & Aguayo, 2002) is an authentic extension of SQL which allows fuzzy data manipulation like fuzzy queries. It means that all the valid statements in SQL are also valid in FSQL. In addition, FSQL incorporates some novelties to permit the inexact processing of information. This chapter will only provide a summary of the main extensions added to this language: •
Linguistic labels: If an attribute is capable of fuzzy treatment, then linguistic labels can be defined on it. These labels will be preceded with the symbol $ to distinguish them easily. There are two types of labels, and they will be used in different fuzzy attribute types: 1. Labels for attributes with an ordered underlined domain (Fuzzy Attributes Type 1 and 2): every label of this type has associated a trapezoidal possibility distribution in the FMB. This possibility distribution is generally trapezoidal, linear, and normalized, as shown in Figure 1. 2. Labels for attributes with a non-ordered fuzzy domain (Fuzzy Attributes Type 3 and 4). Here, a similarity relation may
•
•
be defined between each two labels in the domain, and it should be stored in the FMB. Fuzzy comparators: Besides the typical comparators (=, >, etc.), FSQL includes all the fuzzy comparators shown in Table 4. As in SQL, fuzzy comparators compare one column with one constant or two columns of the same (or compatible) type. As possibility comparators are more general (less restrictive) than necessity comparators, necessity comparators retrieve fewer tuples, and these tuples necessarily comply with the conditions (whereas with possibility comparators, the tuples only possibly comply with the condition, without any absolute certainty). Is it necessary to note that fuzzy attributes Type 2 can be compared with crisp values but always with the FSQL language. Function CDEG: The function CDEG (compatibility degree) may be used with an attribute in the argument. It computes the fulfillment degree of the condition of the query for the specific attribute in the argument. We can use CDEG(*)to obtain the fulfillment degree of each tuple (with all of its attributes, not just one of them) in the condition. If logic operators (NOT, AND, OR) appear in the condition, the calculation of this compatibility degree is carried out, by default, using the traditional negation, the minimum t-norm,
Table 4. Fuzzy comparators for FSQL (Fuzzy SQL), 16 in the Possibility/Necessity Family, and 2 in the Inclusion Family Possibility FEQ or F= FDIF, F!= or F FGT or F> FGEQ or F>= FLT or F< FLEQ or F> MLT or F = NFLT or NF< NFLEQ or NF> NMLT or NF
NFGT or NF>
Possibly/Necessarily Fuzzy Greater Than…
FGEQ or F>=
NFGEQ or NF>=
Possibly/Necessarily Fuzzy Greater or Equal than…
FLT or F
Possibly/Necessarily Much Greater Than…
MLT or F := BEGIN <submodel definition>* <main model definition> <execution SECTION> END
The submodel definition is a complete definition or a file containing the definition model. The main model is a special model that is capable of
accessing variables from other models. The execution describes how commands and queries defined in models should be executed.
Model Definition The syntax for a model definition is shown in Exhibit 1.
Data Interfaces The two following clauses define the input and output interfaces, respectively. The <Extern name> depends on the interface type; for example, if it is a text file, it is the file name. (See Exhibit 2.) For linguistic terms, alpha-cuts are used to indicate that some membership values are below a threshold of interest. An alpha-cut value always belongs to [0,1]. In practice, the use of an alpha-cut modifies the membership function curve, creating a kind of plateau. A triangle (0,0.5,1) with alphacut 0.5 will actually be the membership function of Figure 4. := Create hedge AS ;
The curve types are defined in the previous sections. The keywords are: CONCENTRATOR, INTENSIFICATION, and DILATION.
Exhibit 2. := Create Table FROM <Extern name> []; := Create Table AS <Extern name> ; := ARQTEXT | ODBC | GIS Component Definition := Create LingTerm AS <Membership Function Curve> [ALPHA ]; <MEMBERSHIP FUNCTION CURVE> := TRIANGLE(, <Middle>, <End>) | Gauss(, <Middle>, <End>) | Trapez(, <Middle1>, <Middle2>,<End>) | CUSTOM()
Figure 4. The membership function for “CREATE LINGTERM average AS TRIANGLE(0,0.5,1) ALPHA 0.5”. The original triangular function assumes the shape of a “house” (a pentagon pointing up). 1
Possible types are: Integer, Real, and String.5 The “domain” part defines, optionally, the bounds of the referential set, while “data” refers to the table and attribute to be used as domain. Their syntax are:
0,9
:= DOMAIN <Min Value>,<Max Value> := OVER
AS
0,8 0,7 0,6 0,5 0,4
Connecting Hedges and Linguistic Terms
0,3 0,2 0,1 0 0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Fuzzy Variables The fuzzy variables may be either integer, real, or alphanumeric values, unless they are fuzzy numbers, where alphanumeric values do not apply. The domain indicates the minimum and maximum values and must be supplied if the data interface is not capable of giving this information. On the other hand, if the data interface is not supplied, the system creates a solution variable to be later used in some rule. (See Exhibit 3.)
394
The USE command associates hedges and terms, respectively, with terms and variables. Use HEDGE IN ; Use LingTerm IN ;
Rule Definition Rules in the form if-then are easily defined. As described above, CLOUDS uses the Mandani implication and Maximum-Value as default functions. (See Exhibit 4.)
The fuzzy query has the general form: QUERY SELECT FROM TO [WHERE ] [WITH <Membership Value>]
Queries can be named so they can also be referenced later, working like a view. We decided not to use the SQL “CREATE VIEW” syntax to avoid misleading the user that we had implemented a full view. A named query is only a handle to an object containing the definition of the query. The main differences to standard SQL are the clauses: TO and WITH. The TO clause is implemented to allow direction of output data to several different output interfaces, like spatial databases and GIS systems. As an external device of the database system, the traditional cursor approach when querying databases is unsuitable, because the data retrieved do not have the same data format; in other words, when we query a database, the results are fuzzy values and not selected portions of the database. The WITH clause allows us to define a minimum membership value for the proposition to be evaluated in the WHERE clause. Execution EXECMODEL <Model Name>. ; EXECQUERY <Model Name>. ; SELECT FROM TO
395
A Tool for Fuzzy Reasoning and Querying
[WHERE ] [WITH <Membership Value>]
The execution part defines which queries should be executed and in which order.
Example: Enhancing a GIS Application We coupled our system to an existing GIS, known as GISEpi (Nobre et al., 1997), which was modeling a malaria surveillance system using data from the Tocantins state in Northern Brazil. Malaria is an infectious disease that is widespread in tropical and subtropical regions, and which affects 400 million people every year, killing between one and three million. Its symptoms include fever, shivering, joint pain, vomiting, anemia, and convulsions. It can also cause cognitive impairment, especially in children, as a long-term effect. It is considered an enormous public-health problem. Since there is no vaccine available, preventive and prophylactic medicine is the only way to combat it, and surveillance systems become extremely important. Malaria is transmitted by mosquitoes carrying one of four species of the Plasmodium protozoan: falsiparum, malariae, vivax, and ovale. The diseases caused by them vary in intensity and form. In our example, we model two of them: the falsiparum and the vivax. The Plasmodium falsiparum infection is the most lethal one. It is found in tropical or subtropical zones, and it displays intermittent symptoms. Plasmodium vivax continues reproducing inside the host even after the fever has disappeared, making recurrence after long periods likely. We acquired the data used here in the Tocantins6 state, which is divided into counties. The localities are the source of individual data; they are classified as cities, farms, villages, and so forth. Each one may contain a health post, which is responsible for the treatment of the patients and the data collected by the health agents that go to the other localities.
396
The surveillance and control of malaria in Tocantins follows a methodology starting with data acquisition and passing through analysis and control. There are two kinds of data acquisition, active and passive search. In the former, the health agents go to the research area (localities) and a formulary is filled for each patient. In the second, the data are collected from the health posts. Data are aggregated by locality in a formulary that is analyzed by health agents who specify the suitable control activities, such as increasing the number of visits.
Modeling To model the system, we use the positive cases of Plasmodium falsiparum, Plasmodium vivax, or a combination of both. In addition, we look at the total number of positive cases, the number of existing houses, and the number of examined houses in the research area. These data are stored in a GISEpifile containing the following columns: LAMEXA – number of examined patients. LAMPOSF– number of positive cases of Plasmodium falsiparum. LAMPOSV – number of positive cases of Plasmodium vivax. LAMPOSFV – number of positive cases of both Plasmodium falsiparum and vivax. LAMPOSTOT– Total number of positive cases with any of the four Plasmodium species. CASEXI – Total number of houses within the research area. CASVIS – Total number of houses visited within the research area. In the following text, the variables above are analyzed to evaluate two important surveillance metrics: one that indicates if the number of visits by health agents (VISITS) must be increased and another that determines if there is the possibility of the disease recurring (RETURN).
A Tool for Fuzzy Reasoning and Querying
Figure 5. The partial map of Tocantins (Southern region) with some data superposed. The small map at the bottom left corner shows the position of Tocantins state in Brazil.
Defining Data Interfaces
Program 3. Interface Definition
We define an input data interface (TOCANTINS) that corresponds to the data collected in the Tocantins District. Each record from the input table corresponds to a county. The output data interface (MAPA) defines the table that stores the results. The presentation mode, graphics, or text will be determined by the implementation of the output interface of type GIS. We define two more output tables (Program 3), one for each of the solution variables. These tables will be generated in standard text files (see Box 1).
Variables We use abbreviated names similar to the variables described above to define the linguistic variables (Program 4). The domain of the solution variables (Program 5) is [0,1] (see Box 2).
Program 4. Definition of Basic Variables See Box 3.
Box 1. Create Create Create Create
Table Table Table Table
Tocantins FROM tab1 GIS; Mapa AS mapa.dat GIS; Visits AS Visits.dat ARQTEXT; Return AS epidem.dat ARQTEXT;
LaminasExaminadas OVER Tocantins AS LAMEXA; LamPosfalsiparum OVER Tocantins AS LAMPOSF; LamPosvivax OVER Tocantins AS LAMPOSV; LamPosMalariae OVER Tocantins AS LAMPOSM; LPfalsiparumvivax OVER Tocantins AS LAMPOSFV; LamPositivasTotal OVER Tocantins AS LAMPOSTOT; CasasExistentes OVER Tocantins AS CASEXI; CasasVisitadas OVER Tocantins AS CASVIS;
Box 3. Create SOLVAR Real Visits DOMAIN 0,1 OVER Tocantins AS VISITS; Create SOLVAR Real Return DOMAIN 0,1 OVER Tocantins AS RETURN;
AS TRIANGLE(0.0, 0.0, 1.0); AS TRIANGLE(0.0, 0.5, 1.0); AS TRIANGLE(0.0, 1.0, 1.0);
Create LingTerm Create LingTerm
Stand Increase
AS TRIANGLE(0.0, 0.0, 1.0); AS TRIANGLE(0.0, 1.0, 1.0);
Create LingTerm Create LingTerm
Unlikely Likely
AS TRIANGLE(0.0, 0.0, 1.0); AS TRIANGLE(0.0, 1.0, 1.0);
Program 5. Definition of Solution Variables
Program 6. Definition of Linguistic Terms
Linguistic Terms
See Box 5.
The common Low, Medium, and High linguistic terms are used to describe all input variables. Two pairs of terms are created to use in the solution variables. One pair indicates whether the number of visits should Stand as it is or Increase, and the other pair indicates the possibility of disease recurrence (see Box 4).
Program 7. Connection Between Variables and Terms
398
To define stricter terms, we create and assign the hedge “very” to each one of the standard terms (Program 8) (see Box 6.)
A Tool for Fuzzy Reasoning and Querying
Box 5. USe Lingterm Low In LaminasExaminadas; USe Lingterm Medium In LaminasExaminadas ; USe Lingterm High In LaminasExaminadas ; USe Lingterm Low In LamPosfalsiparum ; USe Lingterm Very-Low in LamPositivasTotal;
Box 6. Create hedge Very AS concentrator 3.0; USE hedge Very IN Low; USE hedge Very IN Medium; USe hedge Very IN High;
Box 7. IF CasasVisitadas IS Low THEN Visits IS Increase; IF LamPositivasTotal IS Low THEN Visits IS Stand;
Program 8. Hedges Definition The Rules Model Two common-sense rules are enough to describe our solution in each of the two examples.7 In the first case, we should discover whether the number of visits should stand as it is or increase. The rules of thumb for this case is: “IF the rate of positive cases to total number of patients IS high, THEN the number of visited houses should increase” “IF the rate of positive cases to total number of patients IS low, THEN the number of visited houses should stand”
Program 9. Active Variables Relative to Visited Houses The second case tries to establish the likelihood of disease recurrence in a county. The rules of thumb for this case is: “If the number of positive cases of Plasmodium vivax IS high, THEN the disease return is likely” “If the number of positive cases of Plasmodium vivax IS low, THEN the disease return is unlikely” See Box 8.
And they should be defined as in Box 7.
399
A Tool for Fuzzy Reasoning and Querying
Box 8. IF LamPosvivax IS High THEN Return IS likely; IF LamPosvivax IS Low THEN Return IS unlikely;
Box 9. SELECT GeO, LaminasExaminadas, LamPosfalsiparum, LamPosvivax, LamPosMalariae, Pfalsiparumvivax, LamPositivasTotal, FROM Tocantins TO Mapa; Program 11. Fuzzy projection
Figure 6. Result for query on Program 11. Fuzzy projection. Due to an implementation need, GISEpi uses the original names of the columns in its interface.
Program 10. Solution Variables Relative to Disease Return Analysis of Data The first query command (Program 11) does a projection of all fuzzified variables, allowing the analyst to understand the data at a more abstract level (see Box 9). The map presentation shows the number of cases in levels of gray with black as the highest. We can see in the result that the same value range, such as [5,9], may be assigned to different linguistic terms with different membership values. Also, counties
400
belonging to different ranges, like “Gurupi” and “Peixe,” can have the same linguistic term. In the second query, the percentage of cases is related to the presence of a large population, conditions that suggest an area of high epidemic risk. We added the disease return possibility and defined a minimum membership possibility of 10% (Program 12). See Box 10.
Program 12. Selection of High Epidemic Risk Areas The selected records may be seen in the table that appears in Figure 8. Many records do not assign
A Tool for Fuzzy Reasoning and Querying
Figure 7. GISEpi representation of the answer to the query for column LAMPOSTOT
Box 10. SELECT LamPositivasTotal, CasasExistentes FROM Tocantins TO Mapa WHERE ((((LamPositivasTotal IS High)OR (LamPositivasTotal IS Medium))AND ((CasasExistentes IS High)OR (CasasExistentes IS Medium))OR (Return IS Likely)); WITH 0.1
high membership values to the term “High,” although the record for “DIVINÓPOLIS” results in 100% membership. One can see that the number of positive cases and houses in the area is “Low,” which indicates a high presence of Plasmodium vivax. Data inconsistency is also exposed because the number of houses(population) is null, but there are still positive cases. Other queries could be done to catch other data inconsistencies. In the second analysis, the high number of cases is associated to a large population and these conditions suggests a
high epidemic risk area. We also added the disease return possibility caused by the Plasmodium vivax. We defined a minimum membership possibility of 10% (Program 13) (see Box 11).
Program 13. Selection of High Epidemic Risk Areas The selected records may be seen on Figure 8. Many records do not indicate high membership values to the term “High,” although the record of
401
A Tool for Fuzzy Reasoning and Querying
Figure 8. Presentation of high epidemic risk areas
Box 11. SELECT GEO, LamPositivasTotal, CasasExistentes FROM Tocantins TO Mapa WHERE ((((PerPositivasTotal IS High)OR(PerPositivasTotal IS Medium))AND ((CasasExistentes IS High)OR(CasasExistentes IS Medium)))OR (Retorno IS Provavel)); WITH 0.1
“DIVINÓPOLIS” results in 100% membership. One can see that the number of positive cases and houses in the area is “Low,” which indicates a high presence of Plasmodium vivax and forces the area to be selected. Data inconsistency is also exposed because the number of houses (population) is null, but the number of positive cases is not. Other queries could be done to catch data inconsistencies.
402
The next analysis concerns the areas with a large number of cases but low surveillance. We test the variable VISITS, the ratio of the number of visited houses to the number existent ones. We set a minimum relevance of 30% in the query of Program 14. The results are seen in Figure 9. The county “ARAGUACU” clearly shows a high membership value, indicating that it should receive more attention from the health agents (see Box 12).
A Tool for Fuzzy Reasoning and Querying
Box 12. SELECT CasasExistentes, PerVisitadas,Visits FROM Tocantins TO Mapa WHERE ((Visits IS Increase)AND ((CasasExistentes IS Medium)OR (CasasExistentes IS High))) WITH 0.3;
Figure 9. Results of the query to discover where to increase visits
Program 14. Selection of the Areas for Visit Increase See Figure 9.
Future Trends Fuzzy databases provide a fruitful environment for dealing with imprecise data and imprecise concepts, common to human thinking processes. Although they have not achieved mainstream status, their techniques are mature and ready for application, and we expect a growth in their use. As motivation factors, we see the need to deal with
the enormous amount of data available nowadays, the convergence of unstructured and structured data, and also of reliable and unreliable sources of data Front-ends like CLOUDS provide a fast option to use fuzzy concepts on nonfuzzy databases. However, front-ends lack in efficiency and force the user to learn yet another interface. We expect to be able to develop a version of CLOUDS internal to open-source DBMSs. This task, however, is much larger than building an external API, due to the intrinsic complexity of DBMSs. Another area of work is the extension of the INSERT, UPDATE, and DELETE operations to fuzzy equivalents. This was not dealt with in CLOUDS, since it was motivated by a need to analyze data. 403
A Tool for Fuzzy Reasoning and Querying
Conclusion In this chapter, we described CLOUDS, a tool that allows the creation of fuzzy reasoning systems over relational databases. CLOUDS’ main characteristic is the mix of a fuzzy SQL approach with a traditional Mandani-like inference engine. In this way, users are able to make fuzzy interpretations of crisp databases, either directly, using linguistic variables and linguistic terms, or by a fuzzy reasoning engine. CLOUDS is implemented in an extensible way, in C++, with an architecture based on abstract classes. This allows users to implement their own classes to substitute operators, hedges, types of linguistic term, membership functions, and so forth. The API is easy to use and extend. It is also possible, although a bit more difficult, to extend the language. To do that, the user is required to know how to manipulate parser-generators, such as Bison and YACC. The API was used to implement a fuzzy version of an existing epidemiological geographic information system. In our example application, we showed how important health‑care decisions could be modeled in fuzzy queries developed with CLOUDS. CLOUDS is open‑source and can be obtained directly with the authors or at SourceForge.8 We are now developing a Java version of CLOUDS, designed to extend JDBC.
References ANSI: American National Standard Institute. (1992). Database language SQL.ANXI X3, 1351992. New York: American National Standard Institute.
Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (1997). Extending SQL retrieval features for the handling of flexible queries. In D. Dubois, Yager, & H. Prade (Eds.), Fuzzy information engineering: A guided tour of applications. New York: Wiley Computer Publishing. Boullosa, J. R. F., Cruz, F. C. A., & Xexéo, G. (1999). Incerteza em bancos de dados: Tipo de dados nebulosos no GOA++ [Uncertainty in data bases: Fuzzy data types in GOA++]. Paper presented at the XIV Simpósio Brasileiro de Bancos de Dados, Florianópolis. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213-226. Cao, T. H., & Rossiter, J. M. (2003). A deductive probabilistic and fuzzy object-oriented database language. Fuzzy Sets and Systems, 140, 129-150. Cox, E. (1994). The fuzzy systems handbook: A practitioner’s guide to building and maintaining fuzzy systems. Boston: AP Professional. Galindo, J., Medina, J. M., & Aranda, M. C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14(4), 375-411. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Gamma, E., Helm, R., Johnson, R., & Vlissides, J.(1995). Design patterns. Boston: Addison-Wesley.
Bolstad, P. (2005) GIS fundamentals: A first text on geographic information systems (2nd ed.). White Bear Lake, MN: Eider Press.
Jankowski, P., & Nyerges, T. L. (2001). Geographic information systems for group decision making. London: Taylor & Francis.
Bosc, P., Kraft, D., & Petry, F. (2005). Fuzzy sets in database and information systems: Status and opportunities. Fuzzy Sets and Systems, 156, 418-426.
Klir, G. G., & Yuan, B. (1995). Fuzzy sets and fuzzy logic: Theory and applications. Upper Saddle River, NJ: Prentice Hall.
404
A Tool for Fuzzy Reasoning and Querying
Liu, D., & Li, D. (1990). Fuzzy PROLOG database system. New York: John Wiley & Sons, Inc.
interface that act as a fuzzy front-end to a relational database management system.
Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76(1-2), 87-109.
CLOUDS-L: The command language of CLOUDS that can be directly translated to API commands.
Nobre, F. F., Braga, A. L., Pinheiro, R. S., & Lopes, J. A. S. (1997). GISEpi: ��������������������������� A simple geographical information system to support public health surveillance and epidemiological investigations. Computer Methods and Programs in Biomedicine, 53, 34-45.
CLOUDSQL: The SQL-like interface of CLOUDS.
Pedrycz, W., & Gomide, F. (1998). An introduction to fuzzy sets: Analysis and design. Cambridge, MA: MIT Press. Petry, F. (1996). Fuzzy databases. Boston: Kluwer Academic Publishers. Turksen, I. B. (1991). Measurement of membership functions and their acquisition. Fuzzy Sets and Systems, 40(1), 5-38. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3. Zadeh, L. A. (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems, Man and Cybernetics, 3(1), 28-44. Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning. Information Sciences, 8, 199-249 (part I); 8, 301-357 (part II); 9, 43-80 (part III).
Key Terms API: Application Programming Interface. A source-code interface that a software library provides for programmers to call within their programs and access its services. CLOUDS: C++ Library Organizing Uncertainty in Database Systems. A library and user
Design Pattern: A design pattern (Gamma, 1995) is a design structure that solves a recurrent problem in some area, such as programming. GIS: Geographic Information Systems. Systems that usually allow the user to manipulate data that is georeferenced and present it using maps. Linguistic Term: A subjective value that can be attributed to a linguistic variable, such as “young” and “mature” for “age.” Linguistic Variable: A variable, or label, that represents some characteristic of an element, such as “age” for persons or “temperature” for water. It is a variable that takes words, known as linguistic terms, as values. See a formal definition in the first pages of this chapter. Solution Variable: In CLOUDS, a solution variable is a linguistic variable that appears in the consequent part of an IF-THEN rule.
Endnotes
1 2
3
4
http://www.sf.net/clouds Some authors define fuzzy systems as rule based systems using fuzzy variables and fuzzy inferences. We thank the authors for kindly providing this table. We use a simplified BNF-like notation. Terms inside square brackets are optional. An asterisk after the term means that zero or many instances of it are allowed. An addition signal after the term means that one or many instances of it are allowed.
405
A Tool for Fuzzy Reasoning and Querying
5
6
7
406
The STRING type is invalid for a fuzzy number (FZYNUMBER). Tocantins is a Brazilian state with 278,420.7 km,2 mostly with tropical climate, in the southeast limit of the Amazon Rainforest. This example shows how fuzzy rules can add a new perspective to model uncertain-
ties in health information systems. The rules used here do not fully express the epidemiological problem and represent only part of a complete solution to the problem of sanitary vigilance. 8 �������������������������� http://www.sourceforge.net
407
Chapter XVI
Data Model of FRDB with Different Data Types and PFSQL Aleksandar Takači University of Novi Sad, Serbia Srđan Škrbić University of Novi Sad, Serbia
Abstract This chapter introduces a way to extend the relational model with mechanisms that can handle imprecise, uncertain, and inconsistent attribute values using fuzzy logic. It describes details on how the relational model is extended in order to include fuzzy capabilities. In addition, we describe a query language called PFSQL for this fuzzy database model. Besides basic fuzzy capabilities, this query language adds the possibility to specify priorities for fuzzy statements. This appears to be the first implementation that has such capabilities. Also we describe the relations on FRDB (fuzzy relational database) and PFCSP (priority fuzzy constraint satisfaction problems) and GPFCSP (generalized priority fuzzy constraint satisfaction problems), theoretical concepts vital for the implementation of PFSQL. The authors propose several points in which this research and implementation can be continued and extended, contributing to better understanding of fuzzy database concepts and techniques and giving numerous possibilities for further development in this area.
Introduction Related Work One of the disadvantages of the relational model is its disability to model uncertain and incomplete data. The idea to use fuzzy sets and fuzzy logic to extend existing database models to include these possibilities has been utilized since the 1980s. Although this area has been researched for
a long time, concrete implementations are rare. Literature contains references to several models of fuzzy knowledge representation in relational databases. One of the early works, the BucklesPetry model (Buckles & Petry, 1982), is the first model that introduces similarity relations in the relational model. This chapter gives a structure for representing inexact information in the form of a relational database. The structure differs from ordinary relational databases in two important
Data Model of FRDB with Different Data Types and PFSQL
aspects: components of tuples need not be single values and a similarity relation is required for each domain set of the database. Zvieli and Chen (1986) offered a first approach to incorporate fuzzy logic in the entity relationship (ER) model. Their model allows fuzzy attributes in entities and relationships. They define three levels of fuzziness in the ER model. At the first level, entity sets, relationships, and attribute sets may be fuzzy; that is, they have a membership degree to the model. The second level is related to the fuzzy occurrences of entities and relationships and the on notion which instances belong to the entity or relationship with different membership degrees. Finally, the third level concerns the fuzzy values of attributes of special entities and relationships. The GEFRED (generalized model of fuzzy relational databases) model (Medina, Pons, & Vila, 1994) is a possibilistic model that refers to generalized fuzzy domains and admits the possibility distribution in domains. This is a fuzzy relational database model that has representation capabilities for a wide range of fuzzy information. In addition, it describes a flexible way to handle this information. Also, it contains the notion of unknown, undefined, and null values. This notion was defined earlier in Umano and Fukami (1994). The GEFRED model experienced subsequent expansions (Galindo, Medina, & Aranda, 1999; Galindo, Medina, Cubero, & Garcia, 2001). Chen and Kerre (1998) and Kerre and Chen (2000) introduced the fuzzy extension of several major extended entity-relationship (EER) concepts. Fuzzy logic was applied to some of the basic EER concepts connected to the notion of subclass and superclass. Chaudhry, Moyne, and Rundensteiner (1994) proposed a method for designing fuzzy relational databases following the extension of the ER model of Zvieli and Chen. They also proposed a design methodology for fuzzy relational databases (FRDBs), which contains extensions for representing the imprecision of data in the ER data model and a set of steps for the derivation of a FRDB from this extended ER model. Galindo, Urrutia, and Piattini (2006) describe a way to use the fuzzy EER model to model the
408
database and represent modeled fuzzy knowledge using a relational database in detail. This work gives insight into some new semantic aspects and extends the EER model with fuzzy capabilities. This model is called the FuzzyEER model. Also, a way to translate the FuzzyEER model to the FIRST2, a database schema that allows representation of fuzzy attributes in relational databases, is given. The FIRST-2 schema introduces a concept of fuzzy metaknowledge base (FMB). For each attribute type, it defines how to represent values and what information about them has to be stored in the FMB. In addition, in this work, authors introduce and describe specification and implementation of the fuzzy structured query language (FSQL), an SQL language with fuzzy capabilities in great detail. This language is an extension of the SQL language that allows users to write flexible conditions in queries, using all extensions defined by the FuzzyEER model. In this book, you can find a chapter by Urrutia, Tineo, and Gonzalez, comparing SQLf and FSQL languages. Another chapter of this book includes a review about flexible querying, and it has been written by �������������������� Kacprzyk, Zadrożny, de Tré, and de Caluwe. The concept of constraint satisfaction problem (CSP) has been known for years. The aim of CSP is to find a solution that satisfies all the constraints in optimal time (Dubois & Fortemps, 1999). If the satisfaction of the constraint is not a Boolean value; that is, if there can be many levels of constraint satisfaction, it is clear that there is room for inserting fuzzy values and fuzzy logic into CSP. We can model constraints as fuzzy sets over a particular domain. This leads to the fuzzy constraint satisfaction problem (FCSP). Fargier and Lang (1993) interpret the degree of satisfaction of a constraint as the membership degree of its domain value on the fuzzy set that represents it. In order to obtain the global satisfaction degree, we need to aggregate the values of each constraint. For the aggregation operator, we use fuzzy logic operators: t-norms, t-conorms, and fuzzy negation. Priority is generally viewed as the importance level of an object among others, and it is often used in real time systems. PFCSP is actually a
Data Model of FRDB with Different Data Types and PFSQL
fuzzy constraint satisfaction problem (FCSP) in which the notion of priority is introduced. Dubois, Fargier, and Prade (1994) propose the idea on how to handle priority in decision making. Later, Luo, Jennings, Shadbolt, Leung, and Lee (2003) and Luo, Lee, Leung, and Jennings (2003) develop the idea and axiomatize PFSCP systems. Finally, Takači (2005, 2006) generalizes PFCSP systems, and Takači and Škrbić (2005) introduce the idea of their use in querying FRDBs. We conclude that the current state of the art in this area includes mature fuzzy EER model extensions that describe a wide range of modeling concepts for full flavored fuzzy database modeling. These conceptual models are supported by robust models for fuzzy data representation in relational databases, such as FIRST-2. Ways to translate conceptual models to relational-based ones are also studied in detail. In addition, the FSQL represents the first implementation of the fuzzy database query language that incorporates a majority of fuzzy logic concepts. In this chapter, we focus on three topics: the FRDB data model, query-database similarity relation, and priority fuzzy SQL (PFSQL). We discuss the implementation aspects of the model and give some pointers on how to implement FRDB, in particular imprecise, uncertain, and inconsistent values. Our model for fuzzy data representation in relational databases allows attribute values to be any general fuzzy subset of the attribute domain. On the other hand, most common types of fuzzy sets (triangular, trapezoidal, interval) are implemented separately in order to simplify calculations with them. This model is not intended to be a general model for fuzzy data representation; we rather introduce it as a basis on which we build the PFSQL language and an interpreter for it. It has many similarities with already existing, more general models. In order to include fuzzy capabilities, we have developed a PFSQL query language from ground up, extending the features of SQL. The PFSQL language is an extension of the SQL language that allows fuzzy logic concepts to be used in queries.
Among other features described in this chapter in detail, this query language allows priority statements to be specified for query conditions. For calculating the membership degree of query tuples when priority is assigned to conditions, we use GPFCSP. Another chapter of this book includes a review about flexible querying, and it has been written by Kacprzyk, Zadrożny, de Tré, and de Caluwe. In FRDB, we need to introduce relational operators on the domain, which can be either crisp or fuzzy. If we consider attributes with numerical values, crisp relational operators are well known ( ⊆ ,=,< , >, ≤ , and ≥ ). In order for them to be properly defined in FRDB, they should be extended from the set of real numbers to the set of all fuzzy subsets of the domain X (denoted F( X ) ). In Zadeh (1965), the inclusion operator ⊆ F has been defined together with the introduction of fuzzy sets. The operator = acts in FRDB the same way as in RDB. However, operators < F , > F , ≤ F , and ≥ F demand an introduction of ordering on the set F( X ) . Bodenhofer (1998) has introduced an ordering which is an extension of the classical ordering ≤ F on the set F( X ) . From this ordering, we will derive operators < F , > F , and ≥ F that act on F( X ) . As it is well known, fuzzy relational operators return a compatibility degree for each pair-tuple, which is a number from the unit interval. Operator FINCL (fuzzy inclusion) is well known in fuzzy set theory. From the fuzzy inclusion operator, FQ (fuzzy equal) is derived. FLQ (fuzzy less or equal), FGQ (fuzzy greater or equal), FG (fuzzy greater), and FL (fuzzy less) are derived when the ordering ≤ F on F( X ) is fuzzified. Similar fuzzy operators have been given in Galindo et al. (2006), and later we compare these two types of operators. Moreover, in this book, you can find a chapter by Urrutia, Tineo, and Gonzalez, comparing both SQLf and FSQL languages. Assume that the conditions in the WHERE line of a database query have different degrees of importance (e.g., height=180 PRIORITY 1, weight=85 PRIORITY 0.5), then it is necessary to
409
Data Model of FRDB with Different Data Types and PFSQL
upgrade a query language with a logic that handles priority. The generalized priority fuzzy constraint satisfaction problem (GPFCSP) is designed for that purpose which makes them ideal for the theoretical background for PFSQL. GPFCSP originates
from constraint satisfaction problems (CSP), and it is introduced axiomatically. A complete description of GPFCSP is given in a separate section of this chapter. Besides the description
of already modeled and implemented features, we propose several ways in which this research and implementation can be continued and extended, contributing to better understanding of fuzzy database concepts and techniques and giving numerous possibilities for further development in this area. The structure of the chapter is as follows. In the following subsection, we recall the basic notions from fuzzy logic and types of fuzzy sets that are used later. The following three sections describe the main contributions of this chapter. In the second section of the chapter entitled Data Model Extensions, we describe the fuzzy data model, and its subsection is dedicated to the implementation of the SQL extension, PFSQL. The third section entitled Relational Operators in FRDB introduces new relational operators in FRDB and compares them with existing ones. Also, the third section contains a subsection about handling specific fuzzy data types in PFSQL. The fourth section entitled Generalized Priority Fuzzy Constraint Satisfaction Problems introduces theoretical concepts that PFSQL relies on. The final section concludes the chapter.
Fuzzy Sets and Fuzzy Logic In this section, we give the tools needed to introduce priority logic. For more information, see Lowen (1996). Fuzzy logic works with truth values from the unit interval. Values from the unit interval are used to interpret different priority degrees, thus making fuzzy logic the base for priority logic. As it is well known, most logics are based on conjunction, disjunction, and negation operators. In fuzzy logic, t-norms are used for conjunction,
410
t-conorms (s-norms) are used for disjunction, and finally, for a fuzzy negation, some special mappings N : [0,1] → [0,1] are used. Briefly, fuzzy logic is defined as a tuple L = ( A, ¬,∧,∨, 1 , 0 ) together with the structure Ω = ([0,1], M ) where in M the interpretation of connectives is given. Usually, a t-norm is used to interpret conjunction, and a t-conorm is used to interpret disjunction. We give definitions of t-norms, t-conorms, and the properties of the negation operator (Klement, Mesiar, & Pap, 2000). Definition 1. A mapping T : [0,1]2 → [0,1] is called a t-norm if the following conditions are satisfied for all x, y, z ∈ [0,1] : (C1) (C2) (C3) (C4)
T(x, y) = T(y, x) T(x,T(y, z) = T(T(x, y) z) if y ≤ z , then T ( x, y ) ≤ TT(x, ( x, z) z ), T ( x,1) = x.
The most common t-norms are: • • •
TM ( x, y ) = min( x, y ) ; TP ( x, y ) = xyxy ; TL ( x, y ) = max( x + y − 1, 0) .
In order to generalize disjunction, t-conorms (snorms) are used. Since every t-norm is associative (C3), it can be extended to an n-ary operation: T ( x1 , x2 , , xn ) = T ( x1 , T ( x2 , , T ( xn−1 , xn ) ) .
Definition 2. A mapping S : [0,1]2 → [0,1] is called an s-norm, or a t-conorm, if the conditions (C1), (C2), (C3) from Definition 1 and the following condition (C4’) are satisfied for all x, y, z ∈ [0,1] , where: (C4’) S ( x, 0) = x. Every t-conorm is dual to a t-norm and viceversa; that is, we have for each dual pair: T ( x, y ) = 1 − S (1 − x, 1 − y ) S ( x, y ) = 1 − T (1 − x, 1 − y ).
Data Model of FRDB with Different Data Types and PFSQL
The most common t-conorms are: • • •
S M ( x, y ) = max( x, y ) ; S P ( x, y ) = x + y − xy; xy S L ( x, y ) = min( x + y,1) .
Conorms SM, SL , and SP are dual to TM, TL , and TP, respectively. Now, we give the properties of a negation operator in fuzzy logic. Definition 3. Fuzzy negation is a mapping N : [0,1] → [0,1] satisfying: ( N1) N (0) = 1 and N (1) = 0.
A strong fuzzy negation besides (N1) satisfies:
where T is a t-norm dual to S. Finally, the set difference A \ B of two fuzzy sets A, B is defined by: A\ B
( x) = T (
A
( x), N (
One of the most common strong fuzzy negations which will be used later is the standard negation: N S ( x) = 1 − x.
Now, we will recall the notion of a fuzzy set. Definition 4. A fuzzy subset A of a universe X is determined by its membership function A : X → [0,1], where A ( x ), x ∈ X is interpreted as a membership degree of x to the set X. A fuzzy subset of X × X is called a fuzzy relation. Using operators T, S, and N, we can introduce operations on fuzzy sets. If A, B are two fuzzy sets, then the union A ∪ B is defined as: A∪ B
( x) = S (
A
( x),
B
( x)),
where S is a t-conorm. Similarly, the intersection of A, B , A ∩ B is defined as: A∩ B
( x) = T (
A
( x),
B
( x)),
( x)))
where N is the fuzzy negation operator. For every fuzzy set A, we assign the following important crisp sets: • •
ker nel of a f u z z y set A: ker( A) = {x ∈ X | A ( x) = 1}; support of A: supp ( A) = {x ∈ X | A ( x) > 0}.
If a fuzzy set A has a finite support, then its cardinality is calculated as: card ( A) =
∑
A
x∈ X
(N1)N is continuous. (N3)N is strictly decreasing. (N4) (N 4) N N = idid[[0,1] 0 , 1] .
B
( x).
On the other hand, if supp(A) is an infinite set, then: card ( A) = ∫
A
( x)dx.
X
If the kernel of A is a non-empty set, then we say that a fuzzy set is normalized. If a fuzzy set A is normalized, has a convex membership function and a limited kernel, then A is called a fuzzy number. Actually, if A is a fuzzy number then, ∃x0 ∈ X , such that mA(x0) = 1. In particular, if X is the set of real numbers, then that the restriction of mA to the interval (−∞, x0 ], denoted by LA , is a nondecreasing function. Similarly, the restriction of mA to the interval [ x0 ,+∞), denoted by RA , is a non-increasing function. xL = min{x ∈ ker( A)} L e t a n d xR = max{x ∈ ker( A)}. Then, if LA is strictly increasing on (−∞, xL ] and RA is strictly decreasing on [ xR ,+∞), then A is called a trapezoidal fuzzy number. Moreover, if LA and RA are linear, A is called a linear trapezoidal fuzzy number. Finally, if xR = xL , then A is a triangular fuzzy number. Normalized fuzzy sets that either have only increasing or only decreasing continuous membership functions are called fuzzy quantities. Their kernel is bounded from either the left or the right
411
Data Model of FRDB with Different Data Types and PFSQL
and unbounded from the other side. The most common relation, the fuzzy subset ⊆ F , is defined in the following way. Definition 5. For two fuzzy sets A,B, the following holds: A ⊆ F B iff (∀x ∈ X )(
( x) ≤
B
( x) =
B
A
( x)).
Also we have: A = F B iff (∀x ∈ X )(
A
( x)).
It is easy to see that the following property holds: If A ⊆ F B ∧ B ⊆ F A , then A = F B .
Data Model Extensions Our model allows data values to be any fuzzy subset of the attribute domain. Users only need to specify a membership function of a fuzzy set. As it is well known, the membership function characterizes the fuzzy set completely, which allows us to use it as a database attribute value. Hypothetically, for each fuzzy set, we should have an algorithm on how to calculate the values of its membership function. This would lead to a large spatial complexity of the database. Most often, this is solved by introducing well known standard types of fuzzy sets (triangular, trapezoidal, etc.) as attribute values. If a type of a fuzzy set is introduced, then we only need to store the parameters that are necessary to calculate the value of the membership function. This is the most common way to implement FRDB, and we have done it in our model also. On the other hand, we did not want to restrict ourselves to these particular fuzzy sets, so we allow users to specify a general membership function for each attribute value. Our idea is to have the most common fuzzy set types implemented and that the attribute values in FRDB are most often standard fuzzy sets, and only a small percentage of attribute values are generalized
412
fuzzy sets specified by users, though our model works with general fuzzy sets in every aspect of FRDB—storing, querying, and so forth. We introduce one more extension of the attribute value, the linguistic label. Linguistic labels are used to represent the most common and widely used expressions of a natural language (such as “tall people,” “small salary,” or “mediocre result”). Linguistic labels are in fact named fuzzy values from the domain. In order to use a linguistic label on some domain, first we must define this label. For instance, we can define the linguistic label “tall man” as a fuzzy quantity that has an increasing linear membership function from the point (185,0) to the point (200,1). Considering these extensions, we can define a domain of a fuzzy attribute as: D = DC ∪ FD ∪ LL
where DC is a classical attribute domain, FD is a set of all fuzzy subsets of the domain, and LL is the set of linguistic labels.
In order to store these values in a relational database, we must extend it with fuzzy metadata. We will use the example from Figure 1 to show how to cope with this problem. There are four attributes in this example that have fuzzy values, namely height and weight in the table Worker, and size and speed in the table Car. In order to represent these fuzzy values in the database, we extend this model with two additional tables as shown in Figure 2. Table IsFuzzy simply stores the data whether an attribute is fuzzy or not. All attribute names in the database are stored here, and beside the name, we have information about whether the attribute is fuzzy or not. Table FuzzyTypes stores concrete fuzzy values. Every fuzzy value in every table is a foreign key that references attribute typeid—primary key of the table FuzzyTypes. Thus, we have one record in table FuzzyTypes for every record with the fuzzy value in the database. Attributes name and value store concrete values for every fuzzy record. Attribute name can have the following values:
Data Model of FRDB with Different Data Types and PFSQL
crisp – when the value is a crisp, nonfuzzy value, interval – fuzzy value is an interval, triangle – fuzzy value is a triangular fuzzy number, trapezoid – fuzzy value is a trapezoidal fuzzy number, general – fuzzy value is a general fuzzy set, fuzzyQuantity – fuzzy value is a fuzzy quantity, lingLabel – fuzzy value is a linguistic label.
Attribute value has a different structure for every type of fuzzy attribute. In case of a crisp value, it stores the specified crisp value. For example, Table 1 shows the record that represents crisp value 15385.33. Attribute forTypeID has a special role that will be described later. In case of the value interval for attribute name, attribute the value stores value [a,b] that represents the interval with boundaries a and b. The interval [123.6, 178.98] is represented with the record shown in Table 2. If the fuzzy value is a triangular fuzzy value,
the expression stored is (c,a,b,lin), where c, a, and b are values shown at Figure 3. The fourth value is always lin, because we deal only with “linear” triangular fuzzy numbers. This value is introduced for future use. The trapezoidal fuzzy value is represented with five values (a,b,c,d,lin). The meaning of the first four values is shown in Figure 3. The fifth value is always lin, and it is introduced for the same reasons as described earlier. For example, records that represent the trapezoidal fuzzy value (75,77,21,5,lin) and the triangular fuzzy value (180,11,8,lin) are shown in Table 3. The general fuzzy value is a generalization of triangular and trapezoidal fuzzy values. It allows the user to define fuzzy values by giving a sequence of points and values for every point, like the ones given in Figure 3. If we describe this fuzzy number with the following expression" general((20,0),(21,1,lin),(22,0,lin),(26,0,lin), (27,1,lin), (28,1,lin)), then a record in the database that represents it could be the one in Table 4. The first value does not have a designation lin, hence it only represents a point with a value. The
413
Data Model of FRDB with Different Data Types and PFSQL
Data Model of FRDB with Different Data Types and PFSQL
Table 5. Linguistic labels and fuzzy quantities typeID 1 2
name lingLabel fuzzyQuantity
Figure 3. Different types of fuzzy values trapezoidal fuzzy number
triangular fuzzy number
1
1
0
c-a
a
b
c
c+b
a-c
c
a
b
d
b+d
general fuzzy number
1 0.5 a
b
c
de decreasing fuzzy quantity
increasing fuzzy quantity
1
1
0
a
b
0
a
b
lin keyword in other values describes a nature of the line that connects the value with a previous one. In our implementation, it is always lin, but this can be changed in the future. If we look more carefully, we see that the membership function of this fuzzy set represents the statement “ 21 ± 1 OR 27 ± 1,” that is, an inconsistent attribute value. If attribute name has value fuzzyQuantity, then attribute value has value (a,b,inc/dec,lin). The meaning of values a and b is shown in Figure 3. The third value can be inc or dec. In the first case, fuzzy quantity is increasing, and in the second, it is decreasing. The fourth value is always lin, as described before. Attribute forTypeID is a foreign key that represents recursive relationship and references the primary key of FuzzyTypes table. This attribute is used to represent linguistic labels. It has a value different then null if the name of the attribute that it represents is lingLabel. As mentioned before, lin-
value tallPeople 2 (180,200,inc,lin) n
forTypeID ull
guistic labels only represent names for previously defined fuzzy values. In this fashion, if an attribute name has the value lingLabel, then the value of the attribute value is the name of the linguistic label. In this case, attribute forTypeID has value of typeID of fuzzy value that this linguistic label represents. We conclude that in order to represent a linguistic label, we need two records. For example, if we wish to represent the linguistic label tallPeople, as the fuzzy quantity (180,200,inc,lin), we would do that with the two records shown in Table 5. Comparing our model to the more general FIRST-2 (Galindo et al., 2006) model, we conclude that there are several similarities between them. Although the methods of fuzzy value representation are completely different, functionally, our model is a subset of the FIRST-2 model. Fuzzy attributes type 1 in the FIRST-2 model are crisp values that our model also supports. Fuzzy types that this model covers are a subset of those represented by fuzzy attributes type 2 and 3. Null values, intervals, and trapezoidal fuzzy numbers are in FIRST-2 represented by structures with the same name. A subset of the set of triangular fuzzy numbers, isosceles triangle, is represented by approximate value with explicit margins in the FIRST-2 model. All other types of triangular fuzzy numbers, as well as fuzzy quantities, can be represented by possibility distributions with two and four values in FIRST-2, although these distribution types are more general. The general fuzzy number from our model is known as fuzzy attribute type 3 in the FIRST-2 model. Moreover, the FIRST-2 model describes a wider range of other possibilities for fuzzy values. Finally, our scheme represents fuzzy values as character strings, while the FIRST-2 model combines atomic values according to respective structure. In our model, this feature is to be developed.
415
Data Model of FRDB with Different Data Types and PFSQL
Priority Fuzzy Structured Query Language (PFSQL) In order to allow usage of fuzzy values in SQL queries, we extend classical SQL with several new elements. In addition to fuzzy capabilities that make fuzzy SQL (FSQL), we add the possibility to specify priorities for fuzzy statements. We named the query language constructed in this manner priority fuzzy SQL (PFSQL). This appears to be the first implementation that has such capabilities. The basic difference between SQL and PFSQL is in the way the database processes records. In a classical relational database, queries are executed so that a tuple is either accepted in the result set, if it fulfills conditions given in a query, or removed from the result set if it does not fulfill the conditions. In other words, every tuple is given a value true (1) or false (0). On the other hand, as the result set, PFSQL returns a fuzzy relation on the database. Every tuple considered in the query is given a value from the unit interval. This value is calculated using operators of fuzzy logic. The question is: what elements of the classical SQL should be extended? Because variables can have both crisp and fuzzy values, it is necessary to allow comparisons between different types of fuzzy values as well as between fuzzy and crisp values. In other words, PFSQL has to be able to calculate expressions like height=triangle(180,11 ,8,lin), regardless of what value of height is in the database—fuzzy or crisp. Next, we demand the possibility to set the conditions like height F , and ≥ F in the same way the operators , and ≥ are defined on the set of real numbers. Classical (crisp) relational operators in PFSQL are listed in Table 7. Besides classical relational operators, we recall fuzzy relational operators: FINCL (fuzzy inclusion), FQ (fuzzy equal), FLQ (fuzzy less or equal), FGQ (fuzzy greater or equal), FG (fuzzy greater), and FL (fuzzy less).
ordering which can be applied to a larger class of fuzzy sets is proposed:
Definition 10. Let A, B be two fuzzy sets. The relation FINCL(A,B) is defined as:
A ≤ F B iff A ≤ 'F B
FINCL( A, B) =
The ordering ≤ F ranks fuzzy numbers depending on their horizontal position on the graph. The more the membership function is to the right on the graph, the larger is the fuzzy number. This is not a total ordering incomparable fuzzy sets can be seen on Figure 4. For incomparable fuzzy sets A and B, it is obvious that neither A ≤ F B nor B ≤ F A holds. The introduction of this ordering into the implementation is not a problem, since we only need to find the maximum to the left and right in each point. This can be done in one pass over the domain once a fuzzy set is introduced. For some
where card(S) is the cardinality of the fuzzy set S. Some examples for FINCL relation are given in Table 8. As we have mentioned, the property:
420
card ( A∩ B ) card ( A )
.
If A ⊆ F B ∧ B ⊆ F A then A = F B , holds for relations ⊆ F and = F . Using this property, we will derive the relation FQ. Definition 11. Let A,B be two fuzzy sets. The relation FQ(A, B) is defined as:
Data Model of FRDB with Different Data Types and PFSQL
Table 7. Crisp operators in PFSQL Operator A ⊆F B A =F B A ≤F B
Definition (∀x ∈ X )( A ( x) ≤ B ( x)) (∀x ∈ X )( A ( x) = B ( x)) LTR ( A) ⊇ LTR ( B ) ∧ RTL( A) ⊆ RTL( B )
A F B
A ≥ F B ∧ ¬( A = F B)
Table 8. Examples for FINCL relation A tri(170,5,5,lin) t tri(170,10,10,lin) t tri(170,5,5,lin) t tri(170,5,5,lin) t
B ri(170,10,10,lin) ri(170,5,5,lin) ri(175,5,5,lin) ri(200,50,50,lin)
FQ(A, F Q ( A,B) B) = T ( FINCL( A, B), FINCL( B, A) ) where T is a t-norm. If we use T = TM , then we have:
F Q FQ ( A, B) =
card ( A∩ B ) max( card ( A ), card ( B )
.
Now we recall the definition of operator FEQ as defined by Galindo et al. (2006). Definition 12. Let A,B be two fuzzy sets. The relation FEQ( A, B) is defined as: FEQ( A, B) = sup(min( x∈X
A
( x),
B
( x))).
Obviously, operator FEQ in FSQL is the equivalent of FQ in PFSQL. We will compare these two operators in Table 9. The symbol “inc(tri(21,1,1,li n),(27,1,1,lin))” represents an inconsistent attribute value “either around 21 ± 1 or around 27 ± 1 ” with a membership function defined in Table 4. If we look at rows 1 and 2 of Table 9 more carefully, we see that: FQ (tri (25,5,1, lin), tri (20,5,5, lin)) ≠ FQ (tri (25,5,25, lin), tri (20,5,5, lin))
FINCL( A, B ) 1 0.375 0.25 0.635
On the other hand, we have: FEQ (tri (25,5,1, lin), tri (20,5,5, lin)) = FEQ (tri (25,5,25, lin), tri (20,5,5, lin))
Considering the essence of the similarity relation, it is natural that the value “tri(25,5,1,lin)” is more similar to the value “tri(20,5,5,lin)” than the value “tri(25,5,25,lin).” In rows 3 and 4, we have a similar situation. Relation FQ reacts to the inconsistency of data while FEQ does not. From this comparison, one might conclude that FQ reflects the similarity better than FEQ. On the contrary, the most often used application of fuzzy logic, the fuzzy controller, uses the FEQ relation for calculating similarity degrees. It is our opinion that these two operators should be tested on large amounts of data in order to obtain a justifiable conclusion about the type of similarity they reflect. As we did with the relation = F , we can extend ≤ F in order to obtain the FLQ relation. Def i n it ion 13. L et A,B be t wo fuzzy sets. The relation FLQ ( A, B) is defined as in Exhibit D, where T is a t-norm LTR M ( N ) = LTR ( N ) \ LTR ( N ∩ M ) , and analogously, RTLM ( N ) = RTL( N ) \ RTL( N ∩ M ).
421
Data Model of FRDB with Different Data Types and PFSQL
Table 9. FEQ and FQ comparison A tri(25,5,1,lin) tri(25,5,25,lin) tri(20,1,1,lin) tri(20,1,10,lin)
B tri(20,5,5,lin) tri(20,5,5,lin) inc(tri(21,1,1,lin),(27,1,1,lin)) inc(tri(21,1,1,lin),(27,1,1,lin))
FQ ( A, B)
FEQ ( A, B )
0.25 0.1 0.125 0.011364
0.5 0.5 0.5 0.5
Exhibit D. FLQ ( A, B) = T ( FINCL ( LTR A ( B ), LTR B ( A), ) FINCL ( RTLB ( A), RTL A ( B ) )
N o t e t h a t t h e u s e o f LTR M (N ) a n d RTLM (N ) w a s n e c e s s a r y s i n c e card ( LTR ( M ) = card ( RTL( M ) = ∞ . Similarly as FLQ, relations FE, FGQ, and FG are obtained by fuzzification of the relations < F , > F , and ≥ F , respectively. Table 10 lists fuzzy operators in PFSQL. In Galindo et al. (2006), relations FLT , FLEQ , FGT , and FGEQ are given. The relation FLEQ (fuzzy less or equal) is defined but only on the set of linear trapezoidal fuzzy numbers. Let M = trap( M , M , M , M , lin) and N = trap ( N , N , N , N , lin) be two linear trapezoidal fuzzy numbers. The relation FLEQ(M,N) is defined as: 1 FLEQ ( M , N ) = ( M − 0
N M
− M )−( N −
N
)
if if
M M
≤ >
N
,
N
M < N, otherwise (
In Table 11, some results for relations FLQ and FLEQ are given. Generally, we can derive some conclusions on the relationship of relations FLQ and FLEQ. It is clear that if FLQ ( A, B) = 1 , then also FLEQ( A, B) = 1 . Moreover, if: FLQ( A(
A
,
A
,
A
,
A
), B(
B
,
B
,
B
,
≥
N
FLEQ( A(
A
,
A
,
A
,
A
), B(
B
,
B
,
B
FQ( A, B) FLQ ( A, B) FL( A, B) FGQ( A, B) FG ( A, B)
422
,
B
) < 1,
then FLQ ( A, B) = 0 . This leads us to conclusion that for each A, B , we have FLEQ( A, B ) ≥ FLQ ( A, B) making FLQ a stricter relation than FLEQ .
).
Table 10. Fuzzy operators in FRDB Operator FINCL(A,B)
)>0
then B ≥ A, which implies FLEQ( A, B) = 1 . Also, we have if:
and M
B
Definition card ( A ∩ B) card ( A) T ( FINCL( A, B ), FINCL( B, A) T ( FINCL( LTR A ( B ), LTR B ( A), ) FINCL( RTLA ( A), RTLB ( B ) ) T ( FLQ( A, B ), N ( FQ( A, B )) T ( FINCL( LTR B ( A), LTR A ( B ), ) FINCL( RTLA ( B ), RTLB ( A) ) T ( FGQ( A, B ), N ( FQ( A, B ))
Data Model of FRDB with Different Data Types and PFSQL
Table 11. FLQ and FLEQ comparison A trap (10,20,60,80, lin) trap (10,20,60,80, lin) trap (10,20,60,80, lin) trap (10,20,60,80, lin)
B trap(30,40,50,70, lin) trap(30,40,50,100, lin) trap(11,11.5,12,25, lin) trap(11,15,19,25, lin)
Handling Specific Fuzzy Data Types in PFSQL Although our model allows attribute values to be general fuzzy sets, we will point out how to handle some specific fuzzy data types in PFSQL. Interval values and linguistic labels are an integral part of any FRDB design, but they are not fuzzy in their essence. A few modifications should be made in order to handle them properly. Linguistic labels are most often defined as fuzzy sets and thus we should consider their meaning. Our opinion is that when an attribute value in the database is a linguistic label, it should be viewed as a possibility distribution. This is not a problem if the support of the set that represents the linguistic label is finite. On the other hand, when there are sets with infinite support, the membership function cannot accurately represent the possibility distribution of the actual attribute value, as is the case with fuzzy quantities. For example, a person that has the linguistic value “tall” for the height attribute is not likely to be 235 cm tall, but tall (235) = 1. This yields that the fuzzy quantity that represents “tall” does not reflect the possibility distribution of the linguistic label. The distribution is found as a simple transformation of parameters: Poss ( fq (a, b, inc, lin)) → tri (a, b, b + c, lin), Poss ( fq (a, b, dec, lin)) → tri (a − c, a, b, lin).
We see that a triangular fuzzy number represents the possibility distribution of a fuzzy quantity. The parameter c represents the tolerance within the fuzzy quantity. It should be predetermined; we propose c = 2 * (b − a ) . However, when we
FLQ ( A, B) 0 0.5 1 0 0
FLEQ ( A, B) 1
0.65 0.94
have linguistic labels as query data values, they should not be transformed since they accurately reflect the essence of the query. Finally, intervals can be viewed as special possibility distributions (fuzzy sets): 1, x ∈ [a, b], ([a, b]) = 0, otherwise,
which makes calculations with them easy. In Table 12, we give some examples of similarity between intervals and fuzzy numbers. The introduction of a new fuzzy set type is only the question of imagination and patience of the database administrator. We have already implemented triangular fuzzy numbers, quantities, and intervals. In our future work, we hope to implement fuzzy sets that allow inconsistency in data values, that is, multitrapezoidal (triangular) fuzzy sets.
Generalized Prioritized Fuzzy Constraint Satisfaction Problem Constraint satisfaction problems (CSP) have been developed over a long period of time. Basically, they deal with a set of constraints and the means to find a solution, that is, an evaluation of variables that satisfies all the constraints. Applications of CSP problems are found mostly in scheduling problems, for example, bus/plain schedules, school time tables, and so forth. If we cannot precisely determine whether a constraint is satisfied, that is, if there can be many levels
423
Data Model of FRDB with Different Data Types and PFSQL
Table 12. Similarity between intervals and fuzzy numbers A [10,20] t [10,20] t tri(15,10,10,lin) tri(15,15,15,lin)
B ri(15,10,10,lin) ri(15,15,15,lin) [10,20] [10,20]
FQ(A,B) 0,75 0,834 0,75 0,667
of constraint satisfaction, we can expand CSP with allowing a constraint to have a satisfaction degree from the unit interval. We can model constraints as fuzzy sets over a particular domain. This leads to the fuzzy constraint satisfaction problems (FCSP). Obviously, the degree of satisfaction of a constraint is the membership degree of its domain value on the fuzzy set that represents it. In order to obtain the global satisfaction degree, we need to aggregate the values of each constraint. For the aggregation operator, we can use operators from fuzzy logic: t-norms, t-conorms, and fuzzy negation. Further on, we can add an importance degree to each of the constraints. Besides a satisfaction degree, each constraint has its importance value, or priority. Many concepts of priority have been studied. In our chapter, we consider priority of constraints as the global importance of a constraint among other ones. The more important the constraint is, the more impact it has on the aggregated output of the PFCSP system. If the value of a more important constraint is increased by some value d > 0, then the aggregated output should be greater than in the case when the value of the less important constraint is increased by the same d. If we try to interpret this concept, we can view constraints as two investments where one of the investments results in a bigger profit margin, that is, constraint with a larger priority. Obviously, if we have more money to invest, we will invest it in the better earning investment, that is, increase the value of constraint with the larger priority, which would result in a bigger profit margin. We are dealing with a very strict notion of priority which favors the constraint with the larger priority regardless of its value. A class of systems that satisfy and implement this notion of priority is 424
found. Also, we present systems that use the most common operators and satisfy this concept. PFCSP makes decisions that depend not only on the satisfaction degree of each constraint (which is the case in FCSP), but also on the priority that each constraint has. PFCSP systems are introduced by an axiomatic framework. If we interpret the constraints as conditions in the WHERE line of a PFSQL query, PFCSP allows us to add a priority degree to each of the constraints which eventually leads us to PFSQL. The problem is that FCSP only deals with the conjunction of the constraints; that is, we can permit only the AND operator in the WHERE line of the PFSQL query. Obviously, PFCSP needs to be generalized in order to handle disjunction and negation. The result is that PFCSP systems evolve into a GPFCSP that can handle priorities which are incorporated into each atomic formula. GPFCSP is a formal ground for PFSQL, a query language which allows different priorities of conditions in the WHERE line. First, we recall the definition of FCSP and PFCSP. Definition 14. A fuzzy constraint satisfaction problem (FCSP) is defined as a 3-tuple ( X , D, C f ) where: 1. 2.
3.
X = {xi | i = 1, 2, , n} is a set of variables. D = {d i | i = 1, 2, , n} is a finite set of domains. Each domain d i is a finite set containing the possible values for the corresponding variable xi in X. C f is a set of fuzzy constraints. That is:
C f = Ci |
{
Ci
}
: d i → [0,1], i = 1,2,..., n .
In a FCSP ( X , D, C f ) , given a compound label v X of all variables X, the global satisfaction degree for the compound label v X is defined as:
{
(v X ) = ⊕
Ci
}
(v X ) | C i ∈ C f .
where ⊕ is an aggregation operator on the unit interval. A solution of FCSP is a compound label vX such that:
Data Model of FRDB with Different Data Types and PFSQL
(v X ) ≥
0
(v X ) = T {
Ci
(v X ) | C ∈ C f }
where a0 is called a solution threshold, usually predetermined. The main question is what type of operator can be used for ⊕ . If all constraints in FCSP have to be satisfied, a t-norm (most often TM ) is used for ⊕ because it acts as the conjunction operator in fuzzy logic. If at least one of the constraints has to be satisfied, a t-conorm is used for similar reasons. If we want to determine whether evaluation v X1 or v X 2 is a better solution for our FCSP, an ordering ( ) on the set of compound labels has to be introduced. One of the obvious orderings is the use of the global satisfaction degree, that is:
3.
where T is a t-norm. For Ci , C j ∈ C f , assume (Ci ) ≥ (C j ), > 0 and there are two different compound labels v X and v ′X such that:
a)
when ∀C C (v X ) = fo r t h e Ci (v X ) = fo r t h e ' C j (v X ) =
v X1 v X 2 ⇔ (v X1 ) ≥ (v X 2 ).
If g ( (C i ), Ci (v X )) ≤ g ( (C j ), (v X ) ≥ (v ′X ) holds.
If TM is used for ⊕ , then we have:
{
(v X ) = min
Ci
}
Definition 15. When given ( X , D, C f , ), where X , D, C f are defined as in Definition 14 and : Ci → [0, ∞)andacompoundlabel v X ofallvariables inX,and ⊕ : [0,1]n → [0,1] , g : [0, ∞) × [0,1] → [0,1] , and a satisfaction degree (v X ), which is calculated as: (v X ) = ⊕ {g ( (C i ),
Ci
(v X i )) | C i ∈ C f },
this system is a PFCSP if the following axioms are satisfied:
If for the fuzzy constraint Cmax, we have (C max ) = max = max{ (Ci ) | Ci ∈ C f } , then: Cmax
c)
Then, the following property holds:
4.
(v X ) | C i ∈ C f .
The concept of priority is introduced into FCSP within the axiomatic framework. More detailed explanations can be found in Luo, Lee, Leung, and Jennings (2003) and Takači (2005).
1.
b)
(v X ) = 0 ⇒
(v X ) = 0
2. If ∀Ci ∈ C f , (Ci ) =
≠ Ci and ∀C ≠ C j , then we have ′ C (v X ), c o n s t r a i n t Ci , we h ave ' Ci (v X ) + , c o n s t r a i n t C j , we h ave c j (v X ) + .
5.
Cj
(v X )), then
For two different compound labels vX and v ′X such that ∀C ∈ C f : C
(v X ) ≥
C
(v′X ),
then the following holds (v X ) ≥
(v ′X )
If there exists a compound label such that ∀C ∈ C f , mC (v X ) = 1 , then (v X ) = 1.
The function r represents the priority of each constraint. A greater value of r(C) means that the constraint C has a larger priority. On the other hand, the function g aggregates priority of each constraint with the value of that constraint. These aggregated values of each constraint are then aggregated by the operator ⊕ , which results in the satisfaction degree of that valuation. We will now briefly explain each of the axioms. The first axiom states that, if the constraint with the maximum priority has a zero value of the local satisfaction degree, then the global satisfaction degree should also be zero. The second axiom states that, if all priorities are equal, the
, then:
0
425
Data Model of FRDB with Different Data Types and PFSQL
PFCSP should become a FCSP. The third axiom is the most important one. It captures the notion of priority, that is, if one constraint has a larger priority, the increase of the value on that constraint should result in a bigger increase of the global satisfaction degree than when the value with the smaller priority has the same increase. The previous statement is best understood when incorporated into an investment example. Suppose we have two investments and some money to invest. If the investments have a similar risk degree, it is obvious that we will invest in the higher earning investment. Note that, in this case, the investments represent the constraints, money represents the increase of the value of constraints, and the profit is the global satisfaction degree. The fourth axiom is the monotonicity property, and finally the fifth is the upper boundary condition. Now let us give two concrete PFCSP systems. Theorem 1. Let X , D, C f be defined as in Definition 14: : Ci → [0, ∞) , max = max{ ( R f ) | C ∈ C f } , (C i ) and ⊕ = TM . , i = 1,2, n, norm (C i ) = max
Then, the following system is a PFCSP: (v x ) = TM {g ( (C ),
C
C
(v X )) | C ∈ C f },
where g P ( (C ), v) = S P (1 − norm (C ), v). We will call the first system min-max system and the second TL − S P . Both of them satisfy Axioms 1-5; TL − S P system satisfies even a stricter version of Axiom 3 which leads to a stricter notion of priority. The detailed explanation and the proof that TL − S P satisfies Axioms 1-5 is given in Takači (2005). Similarly, the proof that min-max satisfies
where g ( (C ), v) = S M (1 − norm (C ), v) is the standard fuzzy implication. The following system is also a PFCSP: (v X ) = TL {g P ( (C ),
Axioms 1-5 is given in Luo, Lee, Leung, and Jennings (2003), but we do not focus on it here since it is not in the scope of this chapter. We will describe how a PFCSP works. With normalization of the priority values, every priority gets a value in the unit interval and also one of the priorities has the value 1. Moreover, with standard implication (S(1-p,v), S is a s-norm), we aggregate priority of each constraint with its value. This is done in a way that the larger the priority, the more chance it has for the resulting value to stay the same as it was before aggregation. If the priority of constraint is small, then the aggregated value is closer to 1. This leads to greater values for constraints with the smaller priority. It makes sense, since when these aggregated values are again aggregated with either TM or TP, the smaller values have more impact on the global satisfaction degree. Let us give an example of a PFCSP. Assume that we have to hire a teaching assistant. We rank our candidates based on three variables: Age, testScore, and PhE (physical exam). So, we would have the following set of variables and domains:
Constraints Ci are given by fuzzy subsets of the domains di. Constraint C1 is a linear triangular fuzzy number whose center is at the point 24, with left and right tolerance of 6.0. C2 is a linear fuzzy quantity starting at the point 8.0 increasing up to the point 10.0. Similarly, C3 is a fuzzy quantity starting at the point 2.0 increasing up to the point 3.0 (see Figure 5). The priorities of the constraints are represented by (C1 ) = 7, (C 2 ) = 10, (C 3 ) = 4, yielding that
Data Model of FRDB with Different Data Types and PFSQL
Figure 5. Membership functions representing constraints C1, C2, and C3
1
1 0.9
1 0.9
18
24
30
8
Table 13. Sample test data for testing GPFCSP Age 25 23 24 28
no. 1 2 3 4
testScore 9.8 9.2 9.5 9.4
(Ci ) =
(Ci )
2
3
4
degree for each constraint is obtained as the value of the membership function of each constraint at a particular point. Thus, for the first student, we obtain C1 (25) = 0.833333, C 2 (9.8) = 0.9, and C 3 (3) = 1 (see Figure 5). Now, we calculate the satisfaction degree for each student using both PFCSPs. Global satisfaction degrees obtained using min-max system is labeled iMM , i = 1,2,3,4 and when TL − S P system is used, we have MM LP as i , i = 1,2,3,4. We calculate the value of 1 shown in Exhibit E. LP Similarly, we calculate 1 as shown in Exhibit F. The results for all four students together with the satisfaction degree of each constraint are given in Table 14. As mentioned earlier, the constraints can be interpreted as conditions in a PFSQL query. Each condition has a priority value assigned to it. Our example can be interpreted as the PFSQL query in Exhibit G and the results as seen in the last two columns of the Table 14 are satisfaction degrees for each data row or CDEG value in FSQL. Due to the syntax of PFSQL that has been given in the second section, the value of PRIORITY belongs to the unit interval. For this reason, in PFSQL, instead of priorities r(C), we use normalized priorities rnorm(C) from PFCSP.
PhE 3 4 2 5
test results are most important and fitness is the least important. Each candidate is represented by an evaluation v X which consists of the values of his Age, testScore, and PhE respectively. Assume we have data available for four students which is given in Table 13. We will explain in detail how the value of α is calculated for student 1. First, we normalize the priorities of constraints C1 ,C2 , and C3 in order to obtain normalized priority values: norm
10
,
max
for each i=1, 2, 3. In this case, max = (C2 ) = 10. This leads to the following normalized priority values norm (C 3 ) = 0.4 norm (C1 ) = 0.7, norm (C 2 ) = 1, and . Next, the satisfaction degree for each condition (constraint) should be calculated. Since the data values of constraints are all exact, the satisfaction
Data Model of FRDB with Different Data Types and PFSQL
Exhibit F. LP 1
= TL ( S P (
C1
(25),1 −
norm
(C1 )), S P (
C2
(9.8),1 −
norm
(C 2 ), S P (
C3
(3),1 −
norm
(C 3 )) =
TL ( S P (0.83333,0.3), S P (0.9,0), S P (1,0.6)) = TL (0.83333 + 0.3 − 0.2499,0.9 + 0 − 0,1 + 0.6 − 0.6) = TL (0.83333,0.93,1) = max(0.8331 + 0.9 + 1 − 2,0) = 0.7331. Table 14. The results for all four students no. 1 2 3 4
Age 1 0.667 1 0.333
testScore 0.9 0.6 0.75 0.7
PhE 1 1 0 1
MM i
0.833 0.6 0.6 0.333
LP i
0.733 0.367 0.35 0.233
In FSQL, we can assign a threshold (THOLD) to each constraint. We will now point out the difference between threshold and priority in order to avoid any confusion. If there is a THOLD quantifier attached to a condition, FSQL automatically discards the data row which does not satisfy the condition with a given threshold. On the other hand, if the value of the PRIORITY exists, PFSQL calculates the satisfaction degree for each data row regardless of its satisfaction degree as it is seen in the previous example. Now, for the same data set, let us process the query in FSQL (see Exhibit H) where the approximate value #24 and the two linguistic labels are defined in Figure 5. Table 15 shows what would have happened if this query was processed by FSQL.
We see that none of the students satisfy the condition of the query as opposed to the query with priorities where each student has a satisfaction degree higher than 0. In this way, this query would not return any record in the result set because none of the records satisfies the second condition. On the other hand, threshold only removes tuples from result set; it does not affect the CDEG function. To conclude, this query returns an empty result set, but if we ask values of CDEG function, then we would get results shown in the last column of Table 15. Priorities in PFCSP are often confused with the concept of weights. We can define a WFCSP (weighted fuzzy constraint satisfaction problem) where for each constraint Ci we have an assigned weight wi. The global satisfaction degree for a valuation vX, W (v X ) in WFCSP is calculated by a known formula: W
(v X ) = T (c1 * w1 ,...., cn * wn )
where ci = Ci ( xi ) is the local satisfaction degree of a constraint Ci and T is a t-norm. In order to
Exhibit G. whERE Age ="around 24 years" PRIORITY 0.7 AND testScore ="excellent test results" PRIORITY 1
AND
PhE ="good fitness" PRIORITY 0.4
Exhibit H. SELECT * FROM students whERE Age FEQ #24 ThOLD 0.7 AND testScore FEQ $Excellent_test_results THOLD 1
AND PhE FEQ $Good_fitness THOLD 0.4
428
Data Model of FRDB with Different Data Types and PFSQL
Table 15. Results processed by FSQL no. Age 1 2 3 4
have an adequate comparison between WFCSP and PFCSP, we take TM and TL. When TM is used T we get the global satisfaction degree WM (v X ), and TL analogously when TL is used, we get: W (v X ). When priorities are substituted with weights and TM is used, the query in Exhibit I occurs in FSQL. When TL is used, the SELECT line of the FSQL query is: SELECT max(CDEG(age)*0.7+ CDEG(test Score)*1+CDEG(PhE)*0.4-2,0) If we executed the two queries, the results in Table 16 would occur. T The values for WTL (v X ) and WM (v X ) differ LP MM (v X ) completely from values (v X ) and . Since a t-norm is used for aggregation of local satisfaction degrees, the smallest values have the largest impact on the global satisfaction degree. The smaller the weight of a constraint Ci , the smaller the value of wi * ci . This leads us to the conclusion that the values of constraints with smaller weights have more impact on the global satisfaction degree which is completely different from PFCSP, where values with greater priority have the biggest impact on the global satisfaction degree. In order to achieve the formal ground for PFSQL, we need to expand the axiomatic framework given in the previous section. First, we will expand PFCSP by introducing negation and disjunction.
Generalized PFCSP is also defined by an axiomatic framework. For the conjunction, we can choose any t-norm, and for disjunction, we must choose the dual of the t-norm and negation must be the standard one. This is similar as in any kind of fuzzy logic. Let us review the axiomatic framework. Axioms 1, 2, and 5 address only conjunctions of constraints, thus they do not have to be generalized. Axioms 3 must be slightly altered in order to handle disjunction. This is not a problem since the disjunction operators in fuzzy logic (t-norms) are dual to conjunction operators (t-conorms). For more details on the priority fuzzy logic, see Takači (2006). Axiom 4 should also be generalized in the sense that monotonicity should hold only when the negation connective is absent from the formula. T he orem 2 . T he fol low i ng syst e m ( X , D, C f , , g ,∧,∨, ¬,∗), where ∧ = T , ∨ = S , L L ¬ = N S , and finally, pi * ci = S P (1 − pi , ci ) , is a GPFCSP. The global satisfaction degree of a valuation vX for a formula F is obtained as: F
(v X ) = F{
norm
(C ) * v xi | C ∈ C f },
where C f is the set of constraints of formula F , f f max = max{ ( R ) | C ∈ C } (C ) , norm (C ) = max
and F is the interpretation of formula F in GPFCSP. Also when ∧ = TM , ∨ = S M , ¬ = N S , and finally, pi * ci = S M (1 − pi , ci ) , we obtain another GPFCSP. In the example from the previous section, one can imagine that we are choosing students to
Exhibit I. SELECT min(CDEG(age)*0.7, CDEG(testScore)*1, CDEG(PhE)*0.4)) FROM students whERE Age FEQ #24 AND testScore FEQ $Excellent_test_results AND PhE FEQ $Good_fitness
429
Data Model of FRDB with Different Data Types and PFSQL
Table 16. Calculation of WPFCSP satisfaction degrees no.
Age
testScore
PhE
1 2 3 4
1 0.666667 1 0.333333
0.9 0.6 0.75 0.7
1 1 0 1
TM W
TL W
(v X )
0.4 0.4 0 0.233
(v X )
0 0 0 0
Exhibit J. MM
= TM ( S M ( S M (
C1
(v),1 −
norm
C1
(v),1 −
norm
(C1 )), S M (
C2
(v),1 −
norm
(C 2 )), S M (
C3
(v),1 −
norm
(C 3 ))
Exhibit K. LP
= TL ( S L ( S P (
(C1 )), S P (
promote the university. The criteria used to judge candidates can be: “(student should have good appearance (PhE) PRIORITY 0.8 OR good GPA PRIORITY 0.8) AND (should be around 24 years old PRIORITY 1).” If v represents a valuation for each student, using min-max GPFCSP the satisfaction degree is calculated as shown in Exhibit J. Analogously, when TL − S P , GPFCSP is used (see Exhibit K). The results for students from the example given in the previous section are given in Table 17. This example can be interpreted as the query in Exhibit L.
Conclusion In this chapter, we have presented three topics: data model of FRDB, relational operators Table 17. Query results No. MM i LP i
430
1 0.9
2 0 .666
3 0.6
4 0 .7
1
1
0.6
1
C2
(v),1 −
norm
(C 2 )), S P (
C3
(v),1 −
norm
(C3 ))
on FRDB, and GPFCSP. A detailed description of fuzzy logic enriched relational data model is given. This data model extends relational model with capabilities to store fuzzy values and supports the execution of PFSQL queries. In addition, we give a comparison between this model and the more general FIRST-2 model. It is our conclusion that this model is a functional subset of FIRST-2 model, although the methods for fuzzy value representation are different. Comparing PFSQL query processing mechanism with existing solutions, we point out differences in this area as well. Our query processing mechanism is completely independent of database implementation. It connects to the database using the JDBC driver, and provided this driver exists, any implementation can be used. Queries are processed in the middle tier between client and DBMS. In this way, we introduce the idea to place fuzzy query processing mechanism on some middleware component or an application server. In order to evaluate the conditions in each query relation, relational operators that act on the data domain X must be introduced. Since the data domain is a set of all of its fuzzy subsets (denoted F( X ) ), relations must be introduced on F( X ) . We have opted for crisp and fuzzy relations. The crisp relations that act on F( X ) subset— ⊆ F and equal— = F are well known in fuzzy set theory.
Data Model of FRDB with Different Data Types and PFSQL
Exhibit L. SELECT * FROM students whERE ( Age ="around 24 years" PRIORITY 0.8 OR testScore ="excellent test results" PRIORITY 0.8)
AND ( PhE ="good fitness" PRIORITY 1)
Ordering ≤ F is the generalization of the classical ordering ≤ , meaning the more the membership function is to the right of the graph, the “greater” the fuzzy set. A complete theoretical background of ≤ F is presented. From this ordering, using well known tautologies, relations ≥ F , < F , and > F are derived. Fuzzy relations on F( X ) assign a value from the unit interval to each tuple of the domain. First, we recall the well known relation of fuzzy inclusion as a fuzzification of the ⊆ F relation. Using a tautology from classical logic with fuzzy logic connectives ( ∧ - t-norm), we derive the similarity relation on the data relation FQ. If we use TM as the conjunction operator, the calculation of FQ is very simple. The more similar two values are, the higher the value of FQ. Also, we have FQ (A, B) = 1 iff A = F B . The equivalent of the FQ relation in many FRDB systems is the FEQ relation. The relations FQ and FEQ are essentially different, since FQ depends on the mutual fuzzy inclusion (FINCL) of operators, where on the other hand FEQ is determined by the maximum of the overlapping values of membership functions. In some future FRDB systems, both similarity relations should be incorporated. Moreover, a detailed analysis of both relations needs to be done in order to clarify the difference between them, which will result in guidelines for what relation to choose to act as a similarity relation on different domains. This is one of our goals in the future. Similarly, the relation FLQ fuzzy less or equal is obtained as the fuzzification of the ≤ F ordering. In the definition of FLQ, ⊆ F is replaced with FINCL and ∧ is replaced with a t-norm TM . The operator FLQ has equivalent operators used in other FRDB
systems FLEQ and NFLEQ. The main advantage of FLQ is that it can be calculated for any two fuzzy sets, where FLEQ and NFLEQ are defined only for trapezoidal ones. We have shown that FLQ is stricter than FLEQ and also the same conclusion can be derived for NFLEQ. Let us emphasize that there is work to be done on comparison of these relations. Moreover, when relations ≥ F , < F , and > F are fuzzified, one obtains the relations FGQ, FL, and FG. As mentioned earlier, FLQ, FEQ, FGQ, FL, and FG relations can be calculated for any two fuzzy sets. Another argument for these relations to be incorporated into future FRDB systems is that for certain types of fuzzy sets calculations of their values can be optimized to the same computational complexity as the previously used relations in FRDB. Adding priorities to conditions in queries and implementing them in PFSQL is one of the main contributions of this chapter. The concept of priority presented here is essentially different from the concept of weighted queries and threshold which can be seen through many examples. A typical example is presented here. GPFCSP systems represent the theoretical background for priority querying. We have presented in detail the build up towards GPFCSP. Starting from CSP, then fuzzifying constraints, we obtain FCSP. In FCSP, priorities are added in order to reach PFCSP systems, which are introduced axiomatically. Priority of a constraint is interpreted as its importance among other constraints. PFCSP systems favor higher satisfaction of all constraints in order to obtain a better solution. They are interpreted as a fuzzy conjunction of fuzzy constraints. PFCSP can be used in database querying but prioritized conditions can only be used
431
Data Model of FRDB with Different Data Types and PFSQL
with an AND operator acting on them. We have presented two different PFCSP systems, depending on the operators, we choose the TL − S P and min - max system. In order to expand the concept of priority querying to a general logical formula, GPFCSP systems are introduced. Axioms for GPFCSP systems are generalized PFCSP axioms. If we interpret the constraints as conditions in PFSQL query and assign priorities to them, the algorithms of GPFCSP systems are directly incorporated into PFSQL. This makes PFSQL a suitable language to handle queries with prioritized conditions, since GPFCSP is an axiomatically defined system. In the future, we hope to test our model and PFSQL on some real database systems. This will give us a better perspective on the implementation problems that we addressed here and hopefully lead to some sort of a solution. We have a working version of our model which can be downloaded from our Web page: http://www.is.im.ns.ac.yu/fuzzydb. However, as it is the case for most FRDB systems, the performance is not satisfactory; that is, the system is very time consuming. One of our goals will be to optimize the performance of the system without losing a lot of capability. We conclude that PFSQL is a useful add-on for querying FRDB. It can be used for decision making wherever priority is needed, since the addition of PFSQL to an FRDB system is possible; thus we hope it will be used in many database systems.
References
Acknowledgment
Dubois, D., Kerre, E., Mesiar, R., & Prade, H. (2000). Fuzzy interval analysis. In D. Dubois & H. Prade (Eds.), Fundamentals of fuzzy sets: The handbooks of fuzzy sets series (vol. 7, pp. 483-582). Kluwer Academic Publishers.
The authors would like to thank the editor and the anonymous referees for useful remarks and comments that enhanced the quality of this chapter. Also, the authors would like to acknowledge the support of the Serbian Ministry of Science and Environmental Protection, project ”Mathematical Models of Non-linearity, Uncertainty and Decision Making,” No. 144012 and project “Abstract Methods and Applications in Computer Science” No. 144017A, also the support of the Ministry of Science, Technology and Environmental Protection of Vojvodina. 432
Bodenhofer, U. (1998). A similarity-based generalization of fuzzy orderings. Doctoral thesis, Johannes Kepler University, Linz, Austria. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213-226. Chaudhry, N., Moyne, J., & Rundensteiner, E. (1994). A design methodology for databases with uncertain data. In Proceedings of the 7th International Working Conference on Scientific and Statistical Database Management (pp. 32-41). Charlottesville, VA: IEEE Computer Society. Chen, G. Q., & Kerre, E. E. (1998). Extending ER/EER concepts towards fuzzy conceptual data modeling. In Proceedings of the IEEE International Conference on Fuzzy Systems (pp. 1350-1325). Beijing: Tsinghua University. Dubois, D., Fargier, H., & Prade, H. (1994). Possibility theory in constraint satisfaction problems: Handling priority, preference and uncertanty. In R. Yager & L. Zadeh (Eds.), Fuzzy sets, neural networks and soft computing (pp. 166-187). London: Thomson Learning. Dubois, D., & Fortemps, P. (1999). Computing improved optimal solutions to max-min flexible constraint satisfaction problems. European Journal of OperationalResearch, 118, 95-126.
Fargier, H., & Lang, J. (1993). Uncertainty in constraint satisfaction problems: A probabilistic approach. In Proceedings of the European Conference on Symbolic and Qualitative Approaches to Reasoning and Uncertainty (LNCS 747, pp. 97-104). Berlin: Springer.
Data Model of FRDB with Different Data Types and PFSQL
Galindo, J., Medina, J. M., & Aranda, M. C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14(4), 375-411. Galindo, J., Medina, J. M., Cubero, J. C., & Garcia, M. T. (2001). Relaxing the universal quantifier of the division in fuzzy relational databases. International Journal of Intelligent Systems, 16(6), 713-742. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group. Gebhart, A. (1995). A note on t-norm based addition of fuzzy intervals. Fuzzy Sets and Systems, 75, 73-76.
Takači, A. (2006). Handling priority within a database scenario. ETF Journal of Electrical Engineering, 17, 130-134. Takači, A., & Škrbić, S. (2005). How to implement FSQL and priority queries. In Proceedings of the 3rd Serbian-Hungarian Joint Symposium on Intelligent Systems, Subotica, Serbia (pp. 261-267). Budapest Tech Polytechnical Institution. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Zadeh, L. A. (1965). Fuzzy sets. Information Control, 8, 338-353.
Kerre, E. E., & Chen, G. Q. (2000). Fuzzy data modeling at a conceptual level: Extending ER/EER concepts. In O. Pons (Ed.), Knowledge management in fuzzy databases (pp. 3-11). Heidelberg: Physica-Verlag.
Zvieli, A., & Chen, P. (1986). ER modeling and fuzzy databases. In Proceedings of the 2nd International Conference on Data Engineering (pp. 320-327). Los Angeles: IEEE Computer Society.
Klement, E., Mesiar, R., & Pap, E. (2000). Triangular norms (Trends in Logic Series 8). Dordrecht: Kluwer Academic Publishers.
Key Terms
Luo, X., Jennings, N. R., Shadbolt, N., Leung, H., & Lee, J. H. (2003). A fuzzy constraint based model for bilateral multi-isssue negotiations in semi competitive enviroments. Artificial Intelligence, 148, 53-102.
FCSP (Fuzzy Constraint Satisfaction Problem): An expansion of CSP (Constraint Satisfaction Problem) by allowing a constraint to have a satisfaction degree from the unit interval, that is, allowing many levels of constraint satisfaction. Constraints can be modeled as fuzzy sets over a particular domain and the degree of satisfaction of a constraint is the membership degree of its domain value on the fuzzy set that represents it.
Luo, X., Lee, J. H., Leung, H., & Jennings, N. R. (2003). Prioritized fuzzy constraint satisfaction problems: Axioms, instantiation and validation. Fuzzy Sets and Systems, 136, 151-188.
FRDB (Fuzzy Relational Database): A relational database with model extended by the mechanism that can handle imprecise, uncertain, and inconsistent attribute values using fuzzy logic.
Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76, 87-109.
Fuzzy Quantity: Normalized fuzzy set with either increasing or decreasing membership function; that is, their kernel is bounded from either the left or the right, and unbounded from the other side. Linear fuzzy quantities are used to describe notions like “tall people,” “small salary,” and so forth.
Lowen, R. (1996). Fuzzy set theory. Kluwer Academic Publishers.
Takači, A. (2005). Schur-concave triangular norms: Characterization and application in PFCSP. Fuzzy Sets and Systems, 155(1), 50-64.
433
Data Model of FRDB with Different Data Types and PFSQL
GPFCSP (Generalized Priority Fuzzy Constraint Satisfaction Problem): An expansion of the PFCSP introduced in order to achieve the formal ground for PFSQL. PFCSP axiomatic framework is expanded by introducing negation and disjunction. For the conjunction, a t-norm and for disjunction the dual of the t-norm is chosen. Negation has to be a standard one. Linguistic Label: Linguistic labels are named fuzzy values from the domain. They are used to represent most common and widely used expressions of a natural language (such as “tall people,” “small salary,” or “mediocre result”).
434
PFCSP (Priority Fuzzy Constraint Satisfaction Problem): An expansion of FCSP by adding an importance degree to each of the constraints. In that manner, besides a satisfaction degree, each constraint has its importance value, its priority. They make decisions that depend not only on the satisfaction degree of each constraint (which is the case in FCSP), but also on the priority that each constraint has and are introduced by an axiomatic framework. Priority Fuzzy SQL (PFSQL): A variation of SQL language extended with fuzzy capabilities with an option to assign priorities to query conditions.
435
Chapter XVII
Towards a Fuzzy Object-Relational Database Model Carlos D. Barranco Pablo de Olavide University, Spain Jesús R. Campaña University of Granada, Spain Juan M. Medina University of Granada, Spain
Abstract This chapter introduces a fuzzy object-relational database model including fuzzy extensions of the basic object-relational databases constructs, the user-defined data types, and the collection types. The fuzzy extensions of these constructs focus on two main flexible aspects, a way to flexibly compare complex data types and an extension of collection types allowing partial membership of its elements. Collection operators are also adapted to consider flexibly comparable domains for its elements. Such a fuzzy object-relational database model, and its implementation in a fuzzy object-relational database management system, provides an easy and effective way to manage a great amount of complex fuzzy data in object-relational databases for emerging fuzzy applications. As a sample of the proposal advantages, an application for dominant color based image retrieval, which is built on an object-relational database management system implementing the proposed fuzzy database model, is introduced.
Introduction The introduction of the fuzzy set theory by Prof. Zadeh (1965) has provided the database community with a very useful tool for representing imprecise, uncertain, and inapplicable data. This approach eases and makes flexible the way in which real
world data can be represented and managed in databases. Fuzzy databases are databases able to represent and retrieve fuzzy data and, some of them, also able to process flexible queries including weakly defined conditions. Fuzzy databases are a very convenient data storage and retrieval system for dealing with classical and nonclassical
problems in which real world data, user perception, and/or natural language concepts and descriptions are involved. Research in the field of fuzzy databases has led to a significant number of fuzzy database models. The aim of these models was always to extend current, widespread, and accepted database models in order to make them suitable for fuzzy data storage and retrieval. Fuzzy models have evolved along with the conventional, or crisp, database models to answer data processing needs. During the apogee of the relational model for databases, several fuzzy relational database models appeared. When the object database model appeared for solving some of the lacks in the relational one, the fuzzy database research focused on fuzzy object databases. In recent years, a new kind of database, the object-relational database, is progressively breaking through in the database mainstream. This kind of database was born in order to solve the relational model lacks by enriching it with some features of object databases. Object-relational databases aim to merge the good qualities of both, the well-known relational model, and the object oriented database paradigm, while neutralizing their drawbacks. Currently, object-relational databases are well accepted by database professionals and manufacturers, and their most important concepts are gradually being incorporated in recent SQL standards. As objectrelational databases are gradually conquering the database market, it seems natural to work on their extension to allow fuzzy data storage and retrieval. In fact, this kind of database is very suitable for such an extension, as one of its most important features is its extensibility. This chapter proposes a model that extends object-relational databases for fuzzy data representation and querying. This model makes the database able to represent a number of new types of fuzzy data that are derived from the extension of the basic object-relational databases constructs, the user defined complex data types, and the multivalued attribute. Additionally, the model also supports all fuzzy data types that were considered in earlier models.
436
The chapter is organized as follows. First, the chapter includes an introduction on object-relational databases. Second, a brief background on fuzzy databases is also presented. Afterwards, the proposed fuzzy object-relational database model is depicted. Next, the basis of an implementation of a fuzzy object relational database management system (FORDBMS), which is based on the proposed model, is described. Then, an application of the introduced FORDBMS and some examples of its queries are given. Finally, some concluding remarks and future research directions are proposed.
Object-Relational Databases Even though nowadays the relational model is the most commonly used model in database theory and practice, as computer applications started to manage large amounts of complex data, it was noticed that this database model was not very suitable for managing complex data. The relational model faces special difficulties to manage complex data resulting from the composition of other data elements, which is very common in computer aided design, geographical information systems, and multimedia applications. A relational database stores the data elements of an entity in a relation and relates a complex data element with its components by foreign keys. When the application requires retrieving a complex data element, a number of join predicates has to be performed to gather all its atomic components, which lead to severe performance reductions. The aim to seamlessly represent and manage complex data in databases led the database community to propose the concept of object oriented DBMS (OODBMS). An OODBMS represents data elements as objects. The objects are uniquely identified by an object identifier, or oid, that substitutes the primary key concept of the relational model. Data are modeled using classes of objects and following the object oriented principles: encapsulation, inheritance, and polymorphism. Moreover, the procedures related to the manipulation of the data,
Towards a Fuzzy Object-Relational Database Model
which are named methods in the object oriented context, are bound to the data objects and stored in the database, which meant a great paradigm shift in data modeling conception. Despite the fact that the OODBMS model has more powerful modeling capabilities than the relational model and it is more appropriate to represent complex data, this kind of DBMS suffers from low performance due to complex or sometimes practically impossible query optimization, and their inability to support large scale systems. Moreover, there was a lack of a formal model and standard Data Definition Language (DDL) and Data Manipulation Language (DML) until the publication by the Object Data Management Group (ODMG) of the ODMG report 1.0 in 1993. The latest revision of the ODMG report is the release 3.0 (Cattell & Barry, 2000), which was published shortly before the group was disbanded. OODBMSs were not very well received by the DBMS market due to two main reasons. First of all, they were unsuccessful for technical reasons. While trying to solve the problems observed in complex data handling in relational DBMSs, they failed in the same point: low performance. Second, they were not accepted for economic reasons. The transition from the existing systems relying on relational DBMSs (RDBMS) to OODBMS required new investments for training of personnel (programmers had to learn new DDL and DML languages that were proprietary in most cases), for redefining the database schemas and for testing and tuning again the database software. In the mid-1990s, a new kind of DBMS, the object-relational DBMS (ORDBMS) (Stonebraker & Moore, 1996), was proposed. This new concept was conceived to gather the benefits of relational and object oriented DBMS, without suffering from the drawbacks of these models. The aim of the ORDBMS proposal was to enhance RDBMSs to allow a richer data type support, so that complex data types along with their methods could be seamlessly managed. All of this is accomplished without losing desirable features of the traditional RDBMSs and maintaining the compatibility with legacy systems.
User-Defined Data Types The ORDBMS concept was first partially integrated into SQL standards in the SQL:1999 (Eisenberg & Melton, 1999) standard revision. This standard mandates that a table column can be a User-Defined Data Type (UDT). A UDT is a structure of named fields, along with a set of associated methods. The data type of a field can be any of the built-in data types or a UDT. This makes possible very complex UDT definitions. The standard includes some object oriented features, such as inheritance and polymorphism, to ease the definition of UDTs and to empower their usage flexibility. For instance, a UDT for gathering data describing current weather conditions could be defined. This UDT, which can be named WEATHER, can be composed of the following fields: • • • •
Issuer: This field contains the name of the organization that publishes the weather conditions as a VARCHAR data type. Temperature: This field contains, as a value of NUMBER data type, the current temperature. Cloudiness: This field describes the current cloud coverage as a value of the set {clear, sunny, cloudy, overcast}. Precipitation: The field value describes the current precipitation conditions as a value of the UDT PRECIPITATION, or a null vale in case of no precipitation. The UDT PRECIPITATION is composed by the following fields: Intensity: This field describes the intensity of the phenomenon as a value of the set {light, moderate, heavy}. Kind: This field describes the kind of precipitation. The value for this field is taken from the set {rain, hail, snow}.
Collection Data Types A collection data type is a user defined data type able to hold multiple values of another type, which
437
Towards a Fuzzy Object-Relational Database Model
is called base type. SQL:1999 introduced the first collection type, an array of values where each element is accessed by a numerical index. This data type construct is very limited as each array type has a fixed maximum cardinality limit, and the standard does not allow the definition of an array data type of another array data type. The collection types are generally known as nested tables but the arrays do not behave like real tables. Later on, a new revision of SQL standard was published as SQL:2003 (Eisenberg, Melton, Kulkarni, Michels, & Zemke, 2004). This revision maintains UDTs and arrays, and includes a new complex data type called multiset. This data type is able to contain an unordered collection of values, all of them of the same data type, including UDTs. It is possible to define a multiset of multisets, and a multiset does not have a fixed cardinality limit. This new type construct fits to the nested table concept, so it has become the collection type of reference. This relegates arrays to very particular applications. For instance, a collection type of the WEATHER UDT defined earlier can be used to store in a field the current weather conditions of a city issued by different organizations. An example table with a column of this type is shown in Table 1.
Background on Fuzzy Databases A fuzzy database (Galindo, Urrutia, & Piatinni, 2006; Petry, 1996) can be defined as a database containing imperfect data which is generally modeled as fuzzy sets. The term imperfect data encompasses data that are uncertain, imprecise, vague, or inapplicable. The work (Bosc & Prade, 1996) includes an excellent discussion about the meaning of the characteristics of imperfect data. We refer the reader to this work for more details and references. The following, which is based on and extends the previous discussion, is a brief definition of the characteristics of imperfect data: •
•
Uncertain data: Data that are not totally trustworthy, for which there is available an estimation of its reliability. For instance, if we ask the age of Prudence to a neighbor, a co-worker, a friend and her mother, and each one reply a different value, we can assign a reliability degree according to the strength of the relationship between Prudence and the asked person Imprecise data: When the data is not available in its maximum finest granularity, we have some an approximation but not a precise value. In this case, it could be available a set
Table 1. Weather conditions of world cities City (VARCHAR)
or a range of values, among which the actual precise value is unknown. For instance, we know that Prudence is between 25 and 28 years old, but we can not give an exact age for her. Vague data: This kind of data is defined by a gradual predicate. Usually, vague data corresponds to linguistic terms of natural language. For instance, we could know that Prudence is middle-aged. The linguistic term middle-aged corresponds to a gradual predicate that is completely incompatible for age values below 30 and above 60, and completely compatible for ages between 35 and 55. The predicate is partially compatible for ages between 30 and 35, and ages between 55 and 60, where the compatibility degree gradually ascends in the former case, and gradually descends in the latter case. Inapplicable data: There may be some entities for which a piece of data relating to one of its properties can not be acquired due to a lack of the property. For instance, if Prudence is not married the data related to his spouse is inapplicable.
Fuzzy logic and fuzzy set theory have proven their great ability for modeling this kind of data. Even though the great majority of proposals for modeling and managing imperfect data take advantage of the fuzzy paradigm, there are some early non-fuzzy proposals. The proposal in (Codd, 1979), which introduces the idea of null values for representing unknown or inapplicable data, is probably the best known approach. Another widely known non-fuzzy approaches make use of statistical inference (Wong, 1982) and probability distributions (Barbara, Garcia-Molina, & Porter, 1992).
Fuzzy Relational Databases Most of the initial fuzzy database proposals aim to extend the relational model in order to make it suitable for handling imperfect data. These approaches
take advantage of the convenient modeling capabilities of fuzzy sets and logic in several ways by generalizing different aspects of the relational model to make them more flexible: •
•
•
Relations including partially belonging tuples: This is the basic fuzzy database model. In this model the set of tuples of a relation is replaced by a fuzzy set. Each tuple has a membership degree to its relation, so the tuple can partially belong to it. This idea has been adopted in (Mouaddib, 1993). If a flexible relational operator is included in a query on vague data, the query result could contain tuples partially satisfying the query condition. The degree to which a tuple satisfies the query conditions can be naturally used to measure its membership to the result set. This approach has been included in many studies (Buckles & Petry, 1982; Prade, 1984; Umano & Fukami, 1994; Zemankova & Kandel, 1984). In this book, the reader can find one chapter by Belohlavek and Vychodil studying this kind of ranked tables and data dependencies. An extended relational algebra for operating with this kind of fuzzy relations is proposed in (Umano & Fukami, 1994). In Bosc and Pivert (1995), relation oriented operators for nested queries are extended so they can be employed on fuzzy relations. Imprecise attribute values: The proposal (Buckles & Petry, 1982) introduces a way to model imprecise attribute values. Every attribute value could be a subset of the attribute domain. The subset has a disjunctive meaning, thus only one of the subset members is the actual value for the attribute. Vague attribute values: Once again, the previous proposal makes possible modeling vague data by taking advantage of fuzzy numbers to represent vague numerical data. Furthermore, possibility distributions defined on basic attributes are used to model arbitrary vague and imprecise data in the works (Prade, 1984; Tahani, 1977; Umano & Fukami, 1994).
439
Towards a Fuzzy Object-Relational Database Model
The first one also studies the definition of flexible queries using vague predicates in a relational database. The representation of unknown attribute values is made possible by means of a possibility distribution where every domain element is totally possible. This possibility distribution is defined as shown in Equation 1, where D is the attribute domain.
{1/d:d e D}
•
Inapplicable attribute values: The proposals (Prade, 1984; Umano & Fukami, 1994) also include approaches for inapplicable data handling. The former introduces a special value, denoted as e, which is added to each attribute domain. The e value represents the inapplicability of the attribute and can be part of possibility distributions representing an attribute value. The inclusion of this special value in possibility distributions makes possible the definition of attribute values that are partially inapplicable. An instance of such a possibility distribution is the one shown in Equation 2. The representation of total ignorance about the applicability of an attribute and the value of the attribute (in case of applicability) is also possible. This case is described by the possibility distribution shown in Equation 3. The latter proposal studies deeper these extreme cases and introduces the special values unknown (unknown attribute value), undefined (inapplicable attribute), and null (total ignorance) to represent them.
•
440
(1)
(2) (3)
Flexible equivalence relations: In Buckles and Petry (1982), the idea of making use of similarity relations (Zadeh, 1971) in order to soften the classical equivalence relations in
•
databases is introduced. Every query must specify a similarity threshold which determines the indistinguishability of domain members for this query. A similar proposal, but based on proximity relations was introduced in Shenoi and Melton (1989). When a database includes imprecise and vague attribute values that are modeled using possibility distributions, the equivalence operator is substituted by possibility and necessity measure based equality operators as proposed in Prade (1984). Flexible relational operators: The proposal (Zemankova & Kandel, 1984) introduces the idea of fuzzy relational operators. These fuzzy extensions of the classical relational operators make flexible the order relation between possibility distributions of numbers and scalars. Later proposals, such as Medina, Pons, and Vila (1994), adapt the concept of fuzzy relational operators for fuzzy numbers. These new adapted operators soften the classical order relations of numbers, where the result can be only true or false values by returning a degree in which the compared fuzzy numbers are related. Another chapter of this book includes a review about flexible querying written by Kacprzyk, Zadrożny, de Tré, and de Caluwe.
Generalized Model for Fuzzy Relational Databases The work (Medina et al., 1994) proposes a fuzzy relational database model, generalized model of fuzzy relational databases (GEFRED), which aims to gather the main different aspects of the relational model that has been made flexible in previous proposals. The integration of these proposals in the same framework results in a generalized model with a wide ability to represent and handle fuzzy data based on a fuzzy extended relational database model. The model introduces the concept of generalized fuzzy domain. A generalized fuzzy domain DG is defined as shown in Equation 4, where D is the basic domain which is being gen-
Towards a Fuzzy Object-Relational Database Model
~ eralized, P ( D) is the powerset of all possibility distributions that can be defined on D, and NULL is a special value meaning total ignorance in the sense of Umano and Fukami (1994) that has been described previously.
(4)
A wide variety of fuzzy data and, of course, nonfuzzy data, can be represented by using a generalized fuzzy domain. Particularly, any arbitrary possibility distribution can be represented, and hence imprecise, vague, unknown, and undefined values can be handled. For each generalized fuzzy domain, a set of linguistic labels, each one representing a fixed possibility distribution can be defined to ease data representation. Every fuzzy generalized domain has at least the linguistic labels UNKNOWN defined as the possibility distribution for representing unknown values, and UNDEFINED, corresponding to the possibility distribution which is used to represent inapplicable values. The special value NULL can be informally described for illustrative purposes by the hypothetical possibility distribution . This possibility distribution indicates that any case, where the value is unknown and the value is not applicable, is totally possible. A generalized fuzzy relation can be built on top of generalized fuzzy domains. The model defines a generalized fuzzy relation as , where is the head and the body of the generalized relation. The head of a relation is defined as Equation 5 shows where each attribute AGj has an associated fuzzy domain DGj (j=1,2,…,n), and CAGj is a compatibility attribute whose domain is [0,1].
The body of a relation is defined as Equation 6 shows, where i=(1,2,..,m), m is the number of tuples belonging to the relation, є DGj is the value of the j-th attribute of the i-th tuple, and cij is the compatibility degree of the j-th attribute of the i-th tuple. The square brackets in the previous formulae denote an optional element. The compatibility degrees are optional elements because they are not allowed in the base relations of the database, as the model does not consider partial membership for the tuples of base relations. The compatibility degrees are exclusively used for relations representing the result of a query, where their value represents the compatibility of the attribute value with respect to the query conditions. In order to manipulate the data in a fuzzy database, the model defines the generalized fuzzy relational algebra with union, intersection, difference, Cartesian product, projection, and join extended operators for generalized fuzzy relations. The selection and join operations base their criteria on fuzzy compatibility measures for fuzzy data. Further work on the GEFRED model has addressed the fuzzy domain calculus (Galindo, Medina, & Aranda, 1999) and the division operator on fuzzy relations (Galindo, Medina, Cubero, & García, 2001).
Object Oriented Fuzzy Databases With the appearance of the first object oriented databases in the early 1990s, research on fuzzy databases was focused on the incorporation of fuzziness into all of their concepts. Fuzziness has been considered for attribute values, for the behavior (methods) of objects, for the structure definition (i.e., set of attributes) of classes and objects, for the relationships between objects and their classes, and
Equation 5.
Equation 6.
441
Towards a Fuzzy Object-Relational Database Model
for the inheritance relationships between classes as well. This chapter only focuses on aspects related to the flexible comparison of objects, so we refer the reader to a book (de Caluwe, 1997) for further study of object oriented fuzzy databases. This is an excellent compilation of early work on fuzzy object oriented databases. In this book, the reader can find a chapter by Kacprzyk, Zadrożny, de Tré, and de Caluwe, which summarizes the main fuzzy querying approaches, including the object-oriented proposals.
A Fuzzy Object-Relational Database Model The previous aims in the fuzzy database research field were the extension of well studied and widespread database models of the moment, that is, based on the relational and object oriented database models, in order to make them able to represent and handle fuzzy data. This work, as a natural continuation of the aims in fuzzy database research, proposes an extension of object-relational databases. This proposal takes advantage of the extension mechanisms of ORDBMSs to make them able to represent, store, and retrieve fuzzy data, even of complex data types. Even though object-relational databases adopt an object-oriented model for data type definition and integrate it into the relational framework, the purpose of this work is not to incorporate fuzziness in the definition and hierarchy of data types. Most fuzzy object oriented model proposals aim to soften the relationships between objects and classes, and subclasses and superclasses. The proposal of this chapter, in contrast, aims to provide a way to flexibly compare complex fuzzy data. This section proposes a model to create fuzzy extensions of the basic data type constructs of ORDBMSs, the UDTs, and the collection types, in order to make flexible their equivalence relations, and to allow a seamlessly integration and representation of fuzzy data in the object relational framework.
442
For this chapter’s purposes, the proposed model is named SDSDM as an acronym of soft data server database model.
Flexibly Comparable Types As it has been stated previously, the axis of the proposal is the aim of making flexible the equivalence relations of complex data elements. This section proposes a fuzzy extension of UDTs which allow a flexible comparison of their objects. In the previous section, which is devoted to a brief fuzzy database background, the idea of substituting an equivalence relation by a similarity relation (Buckles & Petry, 1982) or by a proximity relation (Shenoi & Melton, 1989) has been introduced. For this chapter’s purposes, a proximity or similarity relation substituting an equivalence relation is named a flexible equivalence relation. The application of this idea requires the definition of a flexible equivalence relation for each domain whose elements are going to be flexibly compared. These flexible equivalence relations are modeled by a fuzzy relation, which is typically described by a membership function. Relational databases and their fuzzy extensions do not provide a way to attach a user defined flexible equivalence relation to a domain. To fill this gap, the implementations of the early fuzzy extensions stored the definition of these relations as discrete membership functions in tables of the data dictionary (or metadatabase). This way, users are unable to define any arbitrary flexible equivalence relation, especially a continuous one or a flexible equivalence relation for a domain with high cardinality. As the underlying database model of this proposal is an object-relational one, it provides a seamless and natural way to attach the definition of the flexible equivalence relation of a domain (UDTs in object-relational context) along with the structure and behavior definition of the domain. SDSDM proposes that the flexible equivalence relation of each flexibly comparable UDT should be specified as part of its behavior.
Towards a Fuzzy Object-Relational Database Model
We define a flexibly comparable type (FCT) as a UDT which encloses, in its own definition, its flexible equivalence relation specification as one of its methods. In other words, a FCT is a class whose objects have a common method whose implementation defines the membership function of the flexible equivalence relation associated with the UDT. Such a method is named feq (fuzzy equal) and its functional definition is shown in Equation 7, where D is the domain defined as a FCT, and is the membership function of the flexible equivalence relation defined for the FCT.
(7)
The domains (UDTs and built-in types) of a FORDBMS implementing the proposed model are divided into two separate groups, those implementing the feq method, and those not implementing it. The domains of the former type are named FTC types, and the domains of the latter group are named non-FTC (NFTC) types. The strength of this proposal is the natural way in which flexible equivalence relations are bound to their domains, at database model level, and the freedom in which these equivalence relations can be defined, virtually any flexible equivalence relation which could be implemented as a method.
A Flexibly Comparable Type for a Scalar Discrete Domain One of the most immediate and simple FCT is the one designed for modeling a scalar domain on which a flexible equivalence relation is defined. Each object of this FCT is able to represent one scalar or, less formally, a label, of the domain. The implementation of the feq method of the class
Table 2. Flexible equivalence relation on the cloudiness domain Overcast
cloudy
sunny
0
0
0.75
sunny
0
0.25
cloudy
0.75
clear
returns a value corresponding to the application of the membership function of the flexible equivalence relation defined on the domain, which has been specified by the FCT designer. For instance, a FCT of this kind is useful to allow the flexible comparison of cloudiness attribute domain values, defined in the previous example. This domain is defined as {clear, sunny, cloudy, overcast}. A flexible equivalence relation, which in this case is a proximity relation, for this domain could be the one shown in Table 2. The designer of the FCT can implement it to allow each object to represent one of the scalars of the domain and implement the feq method for returning a value according to the previously defined flexible equivalence relation.
A Flexibly Comparable Type for Fuzzy Numbers Another good example of a possible FCT is the one designed to represent and flexibly compare fuzzy numbers. This FCT can be designed to model a domain whose elements are fuzzy numbers that are defined as trapezoidal possibility distributions. A trapezoidal possibility distribution is noted , and it is defined by the membership function shown in Equation 8.
Equation 8.
443
Towards a Fuzzy Object-Relational Database Model
The proposed FCT includes an attribute of the numerical built-in type for each of the four values defining a trapezoidal possibility distribution. The resemblance of two fuzzy numbers could be calculated by means of their possibility measure. Therefore, the feq method of this FCT can be implemented as shown in Equation 9, where a and b are two fuzzy numbers, and a t-norm.
way CORM determines the resemblance between two objects of the same type is sketched in Figure 1, where o1 and o2 are objects of the same FCT, a1, a2, …, an are attributes of these objects, and vij is the value of the attribute aj for the object oi. This procedure is divided into the following steps: 1.
(9)
A Generic Flexible Equivalence Relation for User Defined Complex Flexibly Comparable Types
2.
Even though a flexible equivalence relation of a user defined FCT can be any of the designed by the user to meet particular requirements, in this section, a generic purpose flexible equivalence relation for user defined complex FCT is introduced. The proposed flexible equivalence relation is a fuzzy resemblance measure for complex objects, whose attributes either are of FCT or NFCT types. For the sake of simplicity, the proposed flexible equivalence relation is named complex object resemblance measure (CORM). The original idea of CORM was first proposed in Marín, Medina, Pons, Sánchez, and Vila (2003) for the OODBMS context, and later adapted to the object-relational paradigm in Cubero, Marín, Medina, Pons, and Vila (2004). The
A resemblance degree between the pair of values of each attribute of the compared object is calculated. For the i-th attribute, the function Sai(o1,o2), which is detailed later, is used. The resemblance degrees of each pair of attribute values are aggregated. For this purpose, the VQ aggregator, which is described later, is employed.
Attribute Resemblance Measure One of the two main components of CORM is the function to calculate the resemblance degree between a pair of values, of the same attribute, from two objects of the same FCT. This resemblance measure is based on the flexible equivalence relation for the attributes whose type is a FCT, and in the classical built-in equivalence for the attributes whose type is a NFCT. This resemblance measure is described by the function shown in Equation 10, where, and are two objects of the same class, and are the values of the -th attribute of
Figure 1. Sketch of the process to calculate the resemblance of two objects as proposed in CORM o1 a1 v 11 a2 v 12 ... ... an
v 1n
o2 a1 v 21 a2 v 22 ... ... an
444
v 2n
S a1 (o1 ,o 2 ) S a2 (o1 ,o 2 ) ... S an (o1 ,o 2 )
VQ
S (o 1 ,o 2 )
Towards a Fuzzy Object-Relational Database Model
the objects and , and is the class or type of the -th attribute of these objects, feq is the membership function of the flexible equivalence relation for the type , and is the Kronecker delta function defined as Equation 11 shows.
is defined as Equation 13 shows, where pi is the relevance weight of the attribute given by the type designer. Finally, R is the fuzzy set of the attributes whose values for both objects resemble. Equation 14 shows the membership function of R.
(12)
(10)
(13)
(11)
(14)
Attribute Resemblance Aggregation Once the resemblance degree between the pair of values of both compared objects is calculated for each attribute, these degrees are aggregated to produce a global resemblance value for the objects. In CORM, this aggregation is calculated by obtaining the degree of truth of the vague sentence “Most of the important attributes of the class present similar values in both objects.” This vague sentence involves linguistic quantifiers (Zadeh, 1983). CORM allows the FCT designer to indicate a degree of relevance for each attribute, so the resemblance measure can be adapted to the type specificity. This makes CORM to focus more on the relevant attributes of the type. The CORM proposal is based on Vilas’ approach (Vila, Cubero, Medina, & Pons, 1995) to calculate the degree of truth of the previous sentence. This approach is based on the concept of coherent family of quantifiers to interpret the fuzzy quantifier “Most”. In this approach, this quantifier is interpreted as a weighted combination of the degrees of truth of the existential ∃ and universal ∀ quantifiers. The degree of truth of the existential quantifier corresponds to degree of truth of the sentence “There exists an important attribute which presents similar values in both objects.” This degree is calculated as shown in Equation 12. In this equation, A is the set of the attributes of the compared objects data type. P is the fuzzy set of relevant attributes whose membership function
Likewise, the degree of truth of the universal quantified sentence “All important attributes present similar values in both objects” is calculated as shown in Equation 15, where the implication has been substituted as shown in Equation 16, and represents a t-conorm.
(15)
(16)
As it has been stated before, Vilas’ approach combines the previous degrees of truth to obtain the degree of truth of the fuzzy quantifier Most. This degree of truth is calculated by the function defined in Equation 17, where is a factor in the interval [0,1] for modeling the fuzzy quantifier Most. The value of this factor for the existential quantifier, , is 1, and for the universal quantifier, , is 0.
(17)
The result of the previous function corresponds to the resemblance degree of a pair of compared objects. More formally, the resemblance degree of two compared objects is defined as Equation 18 shows. 445
Towards a Fuzzy Object-Relational Database Model
(18)
Fuzzy Collections Previously, it has been stated that the proposed SDSDM model aims to create fuzzy extensions of the basic data type constructs of ORDBMSs. In the previous section, a fuzzy extension of UDTs has been proposed, so the objects of the extended UDTs could be flexibly compared. In this section a fuzzy extension of collection types is proposed. The proposed extension aims to add flexibility to the collection types by allowing partial membership of their elements. If a collection is extended to allow partial membership, there is not a direct transposition from it to a fuzzy set as it might seem at first glance. A fuzzy set, in its classical sense, is defined on a classical domain, where the classical equality can be applied. This is not the general case of the fuzzy sets defined in the SDSDM model. The base type of a fuzzy extension of a collection type can be a FCT, where the classical equality is substituted by a flexible equivalence relation. Because of the previous reason, we consider an extension of a collection type as a fuzzy set whose associated operators are modified to take into account the possible flexibility of the equivalence relation of the base type of the collection. Such an extension is named a fuzzy collection.
Basic Operators on Fuzzy Collections A useful concept for the redefinition of the basic fuzzy collections operators is the similarity driven extension of a fuzzy collection. We define a similarity driven extension of a fuzzy collection A as a fuzzy set AS containing the elements of the original fuzzy collection A and the domain elements which are similar to the elements of A in any degree greater than zero. Formally, AS is defined as shown in Equation 19, where S is the flexible equivalence relation defined for the base type of the fuzzy collection A.
446
(19)
The membership function of a similarity driven extension of a fuzzy collection is defined in Equation 20, where D is the underlying domain (i.e., the base type) on which A is defined.
(20)
In case of a fuzzy collection whose elements are of a NFCT, the flexible equivalence relation is defined as the classical identity. In this case, a fuzzy collection is similar to a fuzzy set in the classical sense. Thus, a fuzzy set is a particular case of a fuzzy collection. With the help of the previous concept, the intersection of two fuzzy collections is a fuzzy collection defined as shown in Equation 21. This operation is noted ∩S in order to distinguish it from the fuzzy set intersection. This definition can be reformulated in a set oriented view as Equation 22 shows.
(21)
(22)
The previous definition takes into account the partial equivalence of FCT objects by correlating the elements of each intersected set with the similarity driven extension of its counterpart. The corresponding membership function is shown in Equation 23.
(23)
The union and complement operations of a fuzzy collection remain as they are defined for classical fuzzy sets. These operators do not have to correlate the members of their operands (i.e., fuzzy collections), and therefore the flexible equivalence relation is not needed to perform them. The mem-
Towards a Fuzzy Object-Relational Database Model
bership function of the fuzzy collection resulting from the union of two fuzzy collections A and B is shown in Equation 24.
(24)
The same way, the membership function of the complement of a fuzzy collection A is the following:
(25)
Fuzzy Collection Inclusion One of the most common relational operators used in queries in an ORDBMS when they involve a collection type field is the inclusion operator. This operator, in its classical version, returns a true value when all the elements of the left operand are elements of the right operand, and a false value otherwise. Likewise, the classical inclusion operator for fuzzy sets returns a Boolean value indicating whether the fuzzy set corresponding to the left operand is included in the fuzzy set corresponding to the right operand. Beyond the classical conception of the inclusion operator for fuzzy sets, there is a large variety of proposals which makes flexible this operator by defining a degree of inclusion instead of a Boolean value. One of the reference works on making flexible the inclusion operator for fuzzy set is Sinha and Dougherty (1993). A direct adoption for fuzzy collections of the classical fuzzy version of the inclusion operator is not coherent. As it has been argued in the previous subsection, an equivalent operator for fuzzy collections should take into account the peculiarities of the allowed domains in the SDSDM model, as the model allows the definition of domains with a flexible equivalence relation substituting the classical equality. In order to calculate the degree to which a fuzzy collection A is included in a fuzzy collection B, the SDSDM model makes use of the Resemblance Driven Inclusion Degree. This degree is originally proposed in Marín et al. (2003) and
later adopted for its usage in a FORDBMS context in Cubero et al. (2004). If the original function is adapted to the notation proposed in this work by means of the similarity driven extension of a fuzzy set, the resemblance driven inclusion degree of a fuzzy collection A in a fuzzy collection B is calculated taking into account the reasoning shown in Equation 26. This operator is noted as in the rest of the chapter.
(26)
The functional expression to calculate the degree of truth of the previous reasoning is shown in Equation 27, where I is a fuzzy implication operator.
(27)
A Resemblance Measure for Fuzzy Collections Another possible relational operator for collection types is the equality operator. This operator is applied in the classical context to check if two collections contain the same elements. If so, a true value is returned, otherwise a false value is returned. There is also a version of this operator for fuzzy sets. This version continues returning a true value if the two compared fuzzy sets contain the same elements and a false value otherwise. Once again, a direct adoption for fuzzy collections of the classical fuzzy sets equality operator is not possible. As stated before, this operator was conceived for fuzzy sets defined on an underlying domain where the classical equality holds. Thus, a fuzzy collection equality operator that takes into account the flexible equivalence relation defined on the base type of the collection is needed. Besides, the equality operator must take into account the semantics of the fuzzy collections. Like fuzzy sets, the semantics of the membership of an element into a fuzzy collection affects the way the resemblance degree is determined. On the one hand, the basic set oriented view of a fuzzy
447
Towards a Fuzzy Object-Relational Database Model
set traditionally calls for a conjunctive semantics. If the value of an attribute is a fuzzy set with conjunctive semantics, all the fuzzy set elements are meant to be actual values of the values of the attribute. On the other hand, fuzzy sets can be used to model exclusive expressions (for instance, possibility distributions) where the semantics is clearly disjunctive. An attribute, whose value is a possibility distribution represented as a fuzzy set with disjunctive semantics, is meant to actually have only one value. The actual value of the field is unknown, but it must be an element of the possibility distribution.
A Resemblance Measure for Fuzzy Collections with Conjunctive Semantics
The SDSDM model adopts the Generalized Resemblance Degree between Fuzzy Sets proposed in Marín et al. (2003), which once more was adapted to the FORDBMS context in Cubero et al. (2004) for measuring the resemblance of two fuzzy collections with conjunctive semantics. This operator is devised relying on the concept of double inclusion. Two sets A and B are equal if, and only if, A is included in B and B is included in A, or more formally as shown in Equation 28.
(28)
The previous expression makes use of the classical inclusion operator. When A and B are two fuzzy collections, the Resemblance Driven Inclusion Degree depicted in the previous section is applied. Thus, the flexible equivalence relation defined for the base types of the fuzzy collections is taken into account. This resemblance measure for fuzzy collections with conjunctive semantics, noted as , is formally defined as Equation 29 shows.
(29)
The degree of truth for the previous expression is calculated as shown in Equation 30. 448
(30)
A Resemblance Measure for Fuzzy Collections with Disjunctive Semantics In order to determine the resemblance of two fuzzy collections with disjunctive semantics, the SDSDM model adopts the classical possibility compatibility measure of possibility distributions. It is adapted so the flexible equivalence relation of a FCT base type is taken into account. The classical possibility compatibility measure of two possibility distributions A and B corresponds to the truth value of the expression shown in Equation 31.
(31)
As the intersection operator for fuzzy sets has been defined previously, the second expression in the previous sentence is used to adapt the classical possibility measure to the fuzzy collections peculiarities. The resemblance measure for fuzzy collections with disjunctive meaning, noted as , is defined as Equation 32 shows.
(32)
Therefore, the functional definition of the resemblance measure results as shown in Equation 33.
(33)
Fuzzy Collections as Flexibly Comparable Types In the previous subsections, a way to calculate the resemblance of a fuzzy collection, either with conjunctive or disjunctive semantics, is introduced. These measures provide a way to flexibly compare fuzzy collections depending on its semantics. The only requirement of a database type for being
Towards a Fuzzy Object-Relational Database Model
considered a FCT is the availability of a way to calculate the degree to which two values of the type are equivalent. If the previously introduced resemblance measures are used for this purpose, a fuzzy collection type can be considered a FCT, and therefore it can be nested in other complex FCT types as any other FCT.
User Defined Linguistic Labels The usage of linguistic labels is a common practice in fuzzy database querying. Actually, linguistic labels mean for fuzzy databases what is commonly known as symbolic constants. Linguistic labels are an easy way to specify common domain values in queries. They result especially convenient for those values of FCT which are large fuzzy collections or complex objects, as they mean a shortcut for these values. In this case, the usage of a linguistic label means an increment of query clarity because it avoids complex constant expression and, of course, a saving of query writing effort. In the SDSDM model, a user defined set of linguistic labels can be attached to every FTC. The relation between a FCT and its linguistic labels, together with the value which represents the linguistic label, is maintained in a table of the data dictionary (or metadatabase).
model basic constructs. As SDSDM includes some more type constructs such as conjunctive fuzzy collections and flexible comparable types that are not possibility distribution oriented, the special linguistic labels are defined in terms of their resemblance values when they are compared to other domain values. When the resemblance of the linguistic label UNKNOWN to another domain value is computed, the expression shown in Equation 34 is applied.
(34)
If the special linguistic label UNDEFINED is compared to another domain value or the label UNKNOWN, the resemblance degree is computed by applying the expression shown in Equation 34.
(35)
When the special linguistic label NULL is involved in a resemblance comparison with another domain value or the special linguistic labels UNKNOWN or UNDEFINED, the resemblance degree is computed as shown in Equation 36.
Special Linguistic Labels
(36)
In the SDSDM model, every FCT has a predefined set of special linguistic labels. These linguistic labels represent special values which are used to model the ignorance of a field value, the UNKNOWN label, the inapplicability of a field, the UNDEFINED label, and the ignorance about the applicability of the field and its value if the field were applicable, the NULL label. These special linguistic labels are directly taken from the GEFRED model (Medina et al., 1994). The way the special linguistic labels are defined in the SDSDM model differ slightly from GEFRED model. In the latter, the special linguistic labels are defined as possibility distributions, which are the GEFRED
where, null is the native null value of object-relational databases. The reader should not confuse the previous null value, which is named native null for the sake of clearness, with the special linguistic label NULL. The former is a value of the three-valued logic used in relational and object relational databases, where null means ignorance of the truth value of a proposition. This is used by the SDSDM model to indicate the ignorance about a resemblance degree value, because it is impossible to be determined with the current data in the database. The latter, the special linguistic label NULL, as stated earlier, represents the case of total ignorance about the
449
Towards a Fuzzy Object-Relational Database Model
applicability of a field, and the ignorance of its value if it were applicable.
7.
Expression Capabilities of SDSDM With the previously defined elements, the SDSDM model reaches the expressive capabilities of the preceding fuzzy database models. The GEFRED model stands out for gathering in one model the contributions of the most important previous proposals. As a result, GEFRED is able to manage all kinds of fuzzy and nonfuzzy data which are supported by most of the previous models. These kinds of fuzzy and nonfuzzy data are: 1.
2.
3.
4.
5.
6.
450
A single scalar. This kind of value is represented as a discrete possibility distribution including one element. The GEFRED model supports the definition of a fuzzy equivalence relation between the elements of the scalar domain. A single number. This kind of values is represented as a built-in numerical type or as a discrete possibility distribution with only one element. Possibility distributions on scalar domains. This kind of value is represented as discrete fuzzy sets with disjunctive semantics. The GEFRED model supports the definition of a fuzzy equivalence relation between the scalars of the domain on which the possibility distribution is defined. Possibility distributions on numerical domains. This kind of value is represented as discrete fuzzy sets with disjunctive semantics. This kind of data includes the particular case of fuzzy numbers. Special values representing the ignorance about a field value. This kind of values is represented by the special linguistic label UNKNOWN. Special values representing the inapplicability of a field. This kind of value is represented by the special linguistic label UNDEFINED.
Special values representing the ignorance about the field applicability and about its field value if it were applicable. This kind of value is represented by the special linguistic label NULL.
The SDSDM model is able to represent the previous kinds of values. These are represented by means of the following constructs: 1.
2. 3.
4.
5.
6.
A single scalar. This kind of value is represented by using the previously described FCT for discrete scalar domains. As GEFRED, the SDSDM model supports the definition of a fuzzy equivalence relation between the scalars of the domain. A single number. In the SDSDM model, this kind of value is represented as the ORDBMS built-in numerical type. Possibility distributions on scalar domains. This kind of value is represented in SDSDM model as a combination of FCTs. The values are represented as a fuzzy collection with disjunctive semantics of values of the previously defined FCT that models the scalar domains. Possibility distributions of numerical domains. In the SDSDM model, this kind of value is represented as the previous kind, as a combination of data types. The values are represented as fuzzy collections with disjunctive semantics whose elements are values of the host ORDBMS built-in type for numbers. Fuzzy numbers are represented by the previously described FCT for fuzzy numbers. Special values representing the ignorance about a field value. The special linguistic label UNKNOWN is also available in SDSDM model for symbolizing the ignorance of the attribute or field value. Special values representing the inapplicability of a field. The special linguistic value UNDEFINED is inherited by the SDSDM model from the GEFRED model in order to symbolize this kind of values.
Towards a Fuzzy Object-Relational Database Model
7.
Special values representing the ignorance about the field applicability and about its value if it were applicable. Once again, the SDSDM model adopts the special linguistic label NULL from the GEFRED model to symbolize this kind of values.
In addition to the previous kinds of values, the SDSDM model is able to represent the following: 8.
Possibility distributions on any kind of FCT and NFCT domains. Fuzzy collections with disjunctive semantics are not limited to numerical or scalar values. In the SDSDM model, it is possible to define a fuzzy collection with disjunctive semantics on every domain of the database, which includes every FCT or NFCT type. 9. Fuzzy sets on any kind of FCT and NFCT domains. The SDSDM model supports the representation of fuzzy sets defined on every kind of database domain, particularly including FCT and NFCT types. This kind of data is represented by fuzzy collections with conjunctive semantics. 10. Complex data types including FCT and NFCT attribute values. As an object-relational database model, SDSDM model allows the definition by the user of complex data types as a structure of fields, which can include user defined methods to encapsulate its behavior. The SDSDM model includes an additional and optional feature to enable the flexible comparison of complex data elements by making use of a flexible equivalence relation. This flexible equivalence relation is totally user definable as it is defined as a special method of the type, so the user can programmatically specify the relation. Of course, SDSDM model also allows the definition and usage on each model construct of classical complex data types that are not flexibly comparable.
Soft Data Server: A Fuzzy Object Relational Database Management System The previously depicted model is implemented in an experimental prototype named soft data server (SDS) (Cubero et al., 2004). SDS is an extension of a well-known and widespread commercial ORDBMS. It creates a FORDBMS on the underlying ORDBMS. SDS extends the host ORDBMS by taking advantage of the extension mechanisms included in the latest SQL standards, SQL:1999 and SQL:2003, as the underlying ORDBMS is compliant with useful parts of these standards for SDS purposes. SDS mainly defines a group of UDTs which holds the representation and manipulation details of fuzzy data in the database, and helps the user to create his own FCTs. These UDTs, and their supertype/subtype relation are depicted in Figure 2. The figure includes a pair of abstract data types that do not correspond to real database types, but help to clarify the SDS type structure. The mentioned abstract data types are DatabaseDataTypes, which model a root type for every database data type, and BuiltInTypes, which is the common ancestor for the built-in data types of the ORDBMS. In Figure 2, these abstract data types have a dark background in order to differentiate them from non-abstract UDTs. The UDTs for fuzzy data representation and manipulation included in SDS are the following: •
FlexiblyComparableTypes: This UDT is the common ancestor for all the UDTs included in SDS. Its main purpose is to encapsulate all the common and compulsory behavior for FCTs. One of these compulsory methods is an abstract method for flexible comparison, which is redefined in each subtype in order to implement its corresponding flexible equivalence relation. This redefinition is particularly important in user defined FCTs, so the user attaches to the data type an especially designed flexible equivalence relation. Another common behavior encapsulated in
451
Towards a Fuzzy Object-Relational Database Model
Figure 2. SDS data types
•
•
•
•
452
this UDT is the set of methods to allow the definition, update, and deletion of linguistic labels for the type. AtomicFCT: This UDT act as the common ancestor for those SDS UDTs designed to represent non complex or set oriented data, namely atomic fuzzy data, as fuzzy numbers and scalars. FuzzyNumbers: This UDT is designed to represent and manage fuzzy numbers in SDS. This UDT is a FCT as it implements the method feq. It encapsulates the flexible equivalence relation proposed in the subsection devoted to describe the FCT for fuzzy numbers. Additionally, this type includes methods implementing fuzzy relational comparators for fuzzy numbers. These fuzzy relational comparators are similar to those included in the FSQL server (Galindo, Medina, Pons, & Cubero, 1998; Galindo et al., 2006), where they are named fuzzy comparators. FCScalars: This UDT is designed as a common ancestor for the FCT representing discrete scalar domains where a flexible equivalence relation is defined. The type encloses helper methods for the definition and removal of FCTs representing flexibly comparable scalar domains along with their associated flexible equivalence relations. FuzzyCollections: This UDT is the common ancestor of every type included in SDS which represents an extension of the collection types of the ORDBMSs.
•
•
•
ConjunctiveFC: This UDT gathers all the necessary functionality related to fuzzy collections with conjunctive semantics. This functionality includes helper methods for the definition and deletion of fuzzy collections of base type determined by the user. In addition to these DDL methods, the type includes a default implementation for the feq method, which implements the flexible equivalence relation. This implementation is made according to the flexible equivalence relation for fuzzy collections with conjunctive semantics described in the subsection devoted to this kind of data. DisjunctiveFC: This UDT is analogous to the previously described UDT, but it is conceived for being the common ancestor of fuzzy collections with disjunctive semantics. As the previous UDT, this data type includes helper methods for the definition and deletion of fuzzy collections whose elements are of a user determined base type. Likewise, the type includes a default implementation for the feq method that is based on the functional specification of the flexible equivalence relation described in the SDSDM model for this kind of type. ComplexFCT: This UDT is a common ancestor for every user defined FCT designed to represent and manage complex data organized as a structure of fields. It includes a set of methods that encapsulate the user defined behavior for the data type. This supertype includes a specific implementation of the feq method that is based on the generic flexible equivalence relation for flexibly comparable complex data types described in a previous section.
A Fuzzy Object-Relational Database Application on Image Retrieval This section introduces a sample application of a FDBMS implementing the SDSDM model. This
Towards a Fuzzy Object-Relational Database Model
example focuses on querying an image database by dominant color criteria, where each dominant color is described by linguistic labels. In this example, the previously proposed data types are used to model a complex data type for representing the dominant color of an image, and the use of flexibly equivalence relations for retrieving images with a similar set of dominant colors.
Dominant Fuzzy Color Descriptors The base of this example has been previously published (Barranco, Medina, Chamorro-Martínez, & Soto-Hidalgo, 2006; Chamorro-Martínez, Medina, Barranco, Galán-Perales, & Soto-Hidalgo, 2007). In these papers, an algorithm for extracting dominant fuzzy colors from an image is described. For the example purpose, the only relevant detail is that the output of the algorithm is a fuzzy set representing the dominant fuzzy colors of an image.
We refer the reader for particular details of this algorithm to the previous references. A fuzzy color is described as a composition of three linguistic labels which describe the hue, saturation, and illumination components of the HSI color space. A simplified set of these linguistic labels is illustrated by Figure 3. In this figure, each label is associated with its corresponding trapezoidal possibility distribution defined on the domain of a color component. Each image is related to a fuzzy set of dominant fuzzy colors, where the membership degree of each dominant fuzzy color is equivalent to its degree of dominance in the image. An example of the fuzzy set of dominant fuzzy colors of an image could be the one shown in Equation 37. { 0.7/ ( red, lowSaturation, bright), 0.5/ ( blue, highSaturation, VeryHighIllumination), 0.3/ ( yellow, mediumSaturation, HighIllumination) } (37)
Figure 3. Linguistic labels for HSI color components
453
Towards a Fuzzy Object-Relational Database Model
At first glance, an evident advantage for the user can be noticed. Traditional image databases work with quantities, so the user experiences more difficulties to understand these values and to write queries. This approach makes flexible the color comparison and definition and, at the same time, makes the query definition and the results more easily understandable for users.
Flexible Comparable Types for Representing Dominant Fuzzy Color Descriptors Previously, a descriptor for the set of dominant colors of an image is described. When this data is handled by an image retrieval system supported by a DBMS, a UDT for representing this complex piece of data can ease the storage, handling, and querying of these image descriptors. As these descriptors are a complex combination of fuzzy data constructs, a FORDBMS based on the SDSDM model would be very convenient for storing and querying this kind of data. The data type DominantColorSet is the FCT which models the proposed image domain fuzzy color descriptor. This data type is modeled as shown in Figure 4. Let us describe the definition the DominantColorSet data type by a bottom-up approach. A fuzzy color is basically composed of linguistic labels which represent its hue, saturation, and illumination components. These linguistic labels, in turn,
Figure 4. UML diagram for DominantColorSet data type Fhue dominantcolorset
Fuzzycolor
Fsaturation FIllumination
conjunctiveFc
complexFct
Fuzzynumbers
Fuzzycollections
Flexiblycomparabletype
AtomicFct
databasedatatypes
454
correspond to trapezoidal possibility distributions defined on crisp numerical hue, saturation, and illumination domains, as shown in Figure 3. Each component of the fuzzy color could be modeled as a FCT based on the FuzzyNumber data type of SDS. Thus, the FCT FHue, Fsaturation, and FIllumination data types are defined in order to model the domain of the components of a fuzzy color. On these data types, the linguistic labels, shown in Figure 3, are defined on their corresponding data type in order to make them valid constant values. As it was described previously, a fuzzy color is a composition of three linguistic labels, each one representing a fuzzy value of the hue, saturation, and illumination components. In the database, a fuzzy color is represented as a complex object. The data type FuzzyColor, which is a subtype of the ComplexFCT SDS data type, is a complex data type composed by three fields whose values are respectively of the FHue, FSaturation, and FIllumination data types. Finally, the DominantColorSet data type is defined as a fuzzy collection with conjunctive semantics whose elements are of the FuzzyColor FCT. The membership degree of each element of the fuzzy collection corresponds to the degree of dominance of the represented fuzzy color.
Flexible Operators for Dominant Color Based Retrieval Once the dominant color descriptors are stored in the fuzzy database along with their corresponding images, it is necessary to provide the user with a way of defining queries to retrieve the images using the color descriptors as criteria. As the dominant color descriptors of images are fuzzy collections of fuzzy colors, the previously defined operators for fuzzy collections are useful. The previously defined resemblance driven inclusion is a flexible way of retrieving from a database images that include a set of dominant fuzzy colors. For instance, the user could ask the system for images including high saturated bright red, and medium saturated high illuminated blue as dominant colors. If the inclusion operator is used to define a condition on
Towards a Fuzzy Object-Relational Database Model
the set of dominant colors in a query, the result is composed by images including the set of fuzzy dominant colors defined in the condition. Besides, as the proposed inclusion operator is based on resemblance measures, the results also contain images including a similar set of fuzzy dominant colors. Each returned image is related to a fulfillment degree, which is as high as the set of dominant fuzzy colors of the image are more similar to the set defined in the condition. This fulfillment degree makes possible the ordering of the result by their similarity to the query conditions. Another useful element for query definition in the proposed image database is the previously defined fuzzy collection resemblance measure, which is modeled as the fuzzy collection resemblance operator. If this operator is used in queries, the user can retrieve images with a dominant fuzzy color descriptor similar to a given one. This kind of queries is useful to retrieve from an image bank those images with a similar set of dominant colors of a given sample image.
Flexible Equivalence Relation for Fuzzy Color Instances As it has been described previously, fuzzy color instances are values of the FuzzyColor data type. This data type is a subtype of the ComplexFCT data type and therefore its default flexible equivalence relation is the previously proposed generalized resemblance measure for flexible comparable types. During some tests with users, it was found that the resemblance degree between two fuzzy colors computed using this default flexible equivalence relation results to be very strict. As the SDSDM model is designed with openness in mind, the deliberated inheritance relation between the user defined FCT for representing fuzzy colors (FuzzyColor) and its ancestor (ComplexFCT) makes possible a redefinition of the flexible comparable relation of domain elements, so it can be adapted to the user needs and to the domain specificities. When the FuzzyColor data type is defined, its flexible equivalence relation is defined to match the
expression shown in Equation 38, where and are two objects of the FuzzyColor data type, and are the values of the -th attribute of the objects and , and feq is the membership function of the flexible equivalence relation of each corresponding attribute type.
(38)
Image Retrieval by a Dominant Fuzzy Color Criteria A dominant fuzzy color based image retrieval can be easily performed by taking advantage of the previously defined FCTs, operators, and the fuzzy color resemblance measure. The previous elements make the FORDBMS able to answer queries including conditions defined in the set of dominant colors. SDS takes advantage of the latest SQL standards so a flexible query for this server could be expressed as a standard SQL in which UDT methods and user defined operators are used to allow the access to flexible query features of the FORDBMS. A dominant color based condition is, in fact, a flexible condition defined on the fuzzy collection of fuzzy colors describing the dominant colors of each image in the database. As it has been stated before, the inclusion and resemblance operators of fuzzy collections can be used to create this kind of condition. Each user defined fuzzy color constant used in a query can be defined by using the linguistic labels previously defined for the FCTs that represent the HSI color components. A requirement on a color component could be omitted by using the special linguistic label UNKNOWN, which fully resembles to any value. An example of a condition using the fuzzy inclusion operator is “Retrieve all images including bright very high saturated red”. This condition is defined using the fuzzy collection inclusion operator between the image descriptor and a user
455
Towards a Fuzzy Object-Relational Database Model
Exhibit A. SELECT image,cdeg(1) FROM images WHERE FCond( FInclusion( ColorDescriptor, DominantColorSet(1.0, FuzzyColor( FHue(‘red’),FSaturation(‘veryhighsat’),FIllumination(‘bright’) ) ) ) , 1 ) > 0 ORDER BY 2 DESC;
Figure 5. Color inclusion query results
defined fuzzy collection constant of fuzzy colors. A query applying this condition is expressed in SQL using the previously defined data types and operators as the sentence in Exhibit A where cdeg is an ancillary operator (Murthy, Sundara, Agarwal, Hu, Chorma, & Srinivasan, 2003) that returns the fulfillment degree of the flexible condition marked by the same numerical value in the where clause. FCond is a function that encloses flexible conditions and returns their fulfillment degree. An ancillary operator is a special operator that makes possible the transfer of the result of an operator in the where section of a query to the select section. In this case, FCond has the value 1 as its last argument to link it to the cdeg operator at the select section that receives the same value as argument. Figure 5 shows the results of the previous query and some more complex query examples combining several color inclusion conditions of this type of queries applied on a database of 160 flag images. In this figure, the first column shows the fuzzy 456
colors which must be included in the resulting images, and for each fuzzy color a sample crisp color which fits it. The second column shows the top five most relevant images. Another example of this kind of condition is the requirement “Retrieve all the images including bright colors.” In this case, the inclusion operator must ensure the inclusion of the fuzzy color whose hue is UNKNOWN, its saturation is UNKNOWN, and its illumination is bright. The label UNKNOWN must be used to avoid the definition of the condition for the hue and saturation components. The previously defined query, applied on a database of about 700 color images, obtains the results shown in Figure 6. In the same figure, the results of a similar query for retrieving images with dark dominant colors is included. Finally, an example of a query using the resemblance operator to retrieve the set of images of a database with a dominant color pattern similar to the one associated with a sample image is shown
Towards a Fuzzy Object-Relational Database Model
Figure 6. Color inclusion query results for the illumination component
Exhibit B. SELECT a.image,cdeg(1) FROM images a, images b WHERE b.id=# AND FCond( FEQ( a.ColorDescriptor, b.ColorDescriptor ),1 ) > 0 ORDER BY 2 DESC;
in Exhibit B where # is a number constant corresponding to the numerical identifier (i.e., primary key) of the sample image. The results of several examples of this kind of queries, applied on a database with about 700 color images, are shown in Figure 7. In each example, the fist column shows the sample image, and the second column shows the set of images ordered by relevance with a dominant color pattern similar to the image sample. The image sample is excluded from the results for the sake of brevity, as it is obviously a perfect match.
Conclusion and Future Research Directions At the beginning of this chapter, the current trends in the database world and in the fuzzy database research field have been discussed. On the one hand,
the most recent answer of crisp DBMS practitioners and manufacturers to the problem of complex data management is the object-relational database model. Object-relational databases are being well accepted by the database market due to their combination of the good features of the relational and object-oriented database models. As a result of its success and acceptance, the object-relational paradigm is gradually being incorporated in recent SQL standards. On the other hand, the aim of the fuzzy database models proposed in the literature has been to extend current database models in order to make them able to represent and retrieve fuzzy data. As a result, the relational model and the object-oriented database model, have been extended in several ways. In this chapter, we propose the SDSDM model. This is an extension of the object-relational model to make it able to represent fuzzy data. The model keeps the modeling power of previous fuzzy re-
457
Towards a Fuzzy Object-Relational Database Model
Figure 7. Image resemblance query results
lational models, as it is able to represent the kind of fuzzy data that can be represented in previous proposals. Moreover, the SDSDM model is able to represent new kinds of fuzzy data resulting from the extension of the basic data constructs of ORDBMS, the UDTs and the collection data types. These new fuzzy data types make the SDSDM model able to represent complex data composed of fuzzy and crisp values, and fuzzy collections of data with conjunctive and disjunctive semantics. The model is designed with openness in mind, so the user can customize each default aspect of the model. Additionally, the chapter has introduced the basis of an implementation of the SDSDM model in a FORDBMS. The proposed data types can be seamlessly integrated in an ORDBMS by making use of its native extension mechanisms, that is, UDTs. As the fuzzy data representation features are integrated as native extensions, there is no need for a special bridge, which overlies the underlying DBMS, to process the specificities of the fuzzy data. Actually, every component for fuzzy data handling
458
is fully integrated in the host ORDBMS and the query language extension for fuzzy querying is fully SQL compliant. A sample application of the resulting FORDBMS and its query language has been presented. One of the main weaknesses of fuzzy databases is that the flexible conditions in queries hugely increment the number of candidate results. This leads to a performance reduction of query processing, which makes fuzzy databases not actually competitive, in contrast with crisp systems. Future research will focus on providing indexing mechanisms for fuzzy data in order to increment the performance of FORDBMS query processing. Additionally, the optimization of fuzzy queries should also be addressed, as it could contribute to an additional increase of fuzzy query processing performance. The reader can find in this book a chapter by Mouaddib, Raschia, Ughetto, and Voglozin, studying the user requirements in fuzzy queries and the evaluation strategies, including index structures.
Towards a Fuzzy Object-Relational Database Model
References Barbara, D., Garcia-Molina, H., & Porter, D. (1992). The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 4(5), 487-502. Barranco, C., Medina, J., Chamorro-Martínez, J., & Soto-Hidalgo, J. (2006, June). Using a fuzzy object-relational database for colour image retrieval. H. L. L. G. Pasi (Ed.), Flexible Query Answering Systems 7th International Conference (LNAI 4027, pp. 307-318). Springer. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3(1), 1-17. Bosc, P., & Prade, H. (1996). An introduction to the fuzzy set and possibility theory-based treatment of flexible queries and uncertain or imprecise databases. In A. Motro & Ph. Smets (Eds.), Uncertainty management in information systems: From needs to solutions (pp. 285-324). Kluwer Academic. Buckles, B. P., & Petry, F. E. (1982, May). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7(3), 213-226. Cattell, R., & Barry, D. (2000). The object data standard: Odmg 3.0. Morgan Kaufmann Publishers Inc. Chamorro-Martínez, J., Medina, J., Barranco, C., Galán-Perales, E., & Soto-Hidalgo, J. (2007, February). Retrieving images in fuzzy object-relational databases using dominant color descriptors. Fuzzy Sets and Systems, 158(3), 312-324. Codd, E. F. (1979). Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4), 397-434. Cubero, J. C., Marín, N., Medina, J. M., Pons, O., & Vila, M. A. (2004). Fuzzy object management in an object-relational framework. In Proceedings of the X International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU) (pp. 17671774).
de Caluwe, R. (Ed.). (1997). Fuzzy and uncertain object-oriented databases (concepts and models) (Vol. 13). World Scientific. Eisenberg, A., & Melton, J. (1999). SQL: 1999, formerly known as SQL3. SIGMOD Record, 28(1), 131-138. Eisenberg, A., Melton, J., Kulkarni, K., Michels, J.-E., & Zemke, F. (2004). SQL:2003 has been published. SIGMOD Record, 33(1), 119-126. Galindo, J., Medina, J. M., & Aranda, M. C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14, 375-411. Galindo, J., Medina, J. M., Cubero, J. C., & García, M. T. (2001). Relaxing the universal quantifier of the division in fuzzy relational databases. International Journal of Intelligent Systems, 16(6), 713-742. Galindo, J., Medina, J., Pons, O., & Cubero, J. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. Larsen (Eds.), Flexible query answering systems (LNAI 1495, pp. 164-174). Springer. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Marín, M., Medina, J., Pons, O., Sánchez, D., & Vila, M. (2003). Complex object comparison in a fuzzy context. Information and Software Technology, 45(7), 431-444. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized mode of fuzzy relational databases. Inf. Sci. Inf. Comput. Sci., 76(1-2), 87109. Mouaddib, N. (1993). Fuzzy identification in fuzzy databases-the nuanced relational division. In Proceedings of the 2nd International Symposium on Uncertainty Modeling and Analysis (pp. 455-462). Murthy, R., Sundara, S., Agarwal, N., Hu, Y., Chorma, T., & Srinivasan, J. (2003). Supporting ancillary values from user defined functions in 459
Towards a Fuzzy Object-Relational Database Model
Oracle. In Proceedings of Data Engineering 19th International Conference (pp. 151-162). Petry, F. E. (1996). Fuzzy databases: Principles and applications. Boston: Kluwer Academic Publishers. Prade, H. (1984). Lipski’s approach to incomplete information data bases restated and generalized in the setting of Zadeh’s possibility theory. Information Systems, 9(1), 27-42. Shenoi, S., & Melton, A. (1989, July). Proximity relations in the fuzzy relational database model. Fuzzy Sets and Systems, 31(3), 285-296. Sinha, D., & Dougherty, E. R. (1993, April). Fuzzification of set inclusion: Theory and applications. Fuzzy Sets and Systems, 55(1), 15-42. Stonebraker, M., & Moore, D. (1996). Objectrelational DBMSs: The next great wave. Morgan Kaufmann. Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing & Management, 13(5), 289-303. Umano, M., & Fukami, S. (1994, February). Fuzzy relational algebra for possibility-distributionfuzzy-relational model of fuzzy data. Journal of Intelligent Information Systems, 3(1), 7-27. Vila, M., Cubero, J., Medina, J., & Pons, O. (1995). The generalized selection: An alternative way for the quotient operations in fuzzy relational databases. In B. Bouchon-Meunier, R. Yager, & L. Zadeh (Eds.), Fuzzy logic and soft computing (pp. 214-250). World Scientific.
Zadeh, L. (1983). A computational approach to fuzzy quantifiers in natural languages. Computers and Mathematics, 9(1), 149-184. Zemankova, M., & Kandel, A. (1984). Fuzzy relational databases: A key to expert systems. TÜV Rheinland.
Key Terms Flexible Equivalence Relation: A fuzzy relation between the values of a database data type that is a flexible replacement of the classical equivalence relation. Flexible Query: A query whose restrictions, or conditions, are weakly defined. Usually, the restrictions of this kind of queries are modeled as fuzzy sets. The results for this query are allowed to partially match the conditions of the query. Flexibly Comparable Type: A user defined type which encapsulates an implementation of its flexible equivalence relation as a special method. The values of this type can be flexibly compared, so the result of the comparison is a numerical resemblance degree rather than the Boolean value returned by the classical equality comparator. Fuzzy Collection: A fuzzy set whose elements are of a domain on which a flexible equivalence relation is defined. The operators on this kind of fuzzy set take into account this flexible equivalence relation when correlating domain elements. Fuzzy Database: A database able to store and handle imperfect information, which is modeled by taking advantage of fuzzy set theory.
Wong, E. (1982). A statistical approach to incomplete information in database systems. ACM Transactions on Database Systems, 7(3), 470-488.
Fuzzy Object-Relational Databases: An extension of object-relational databases to allow them to store and handle fuzzy data.
Zadeh, L. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Object Data Management Group (ODMG): A group of database vendors and practitioners founded with the aim to increase portability of customer software across object-oriented data management products.
Zadeh, L. (1971, April). Similarity relations and fuzzy orderings. Information Sciences, 3(2), 177200.
460
Towards a Fuzzy Object-Relational Database Model
Object-Relational Databases: A database whose model is the relational database model but enriched to allow entity attributes to be of complex data type.
Soft Data Server Database Model (SDSDM): A fuzzy object-relational database model that supports fuzzy versions of the basic object-relational database type constructs, the user-defined data types and the collection data-types.
461
462
Chapter XVIII
Relational Data, Formal Concept Analysis, and Graded Attributes1 Radim Belohlavek Binghamton University – SUNY, USA and Palacky University, Czech Republic
Abstract Formal concept analysis is a particular method of analysis of relational data. Also, formal concept analysis provides elaborate mathematical foundations for relational data. In the course of the last decade, several attempts appeared to extend formal concept analysis to data with graded (fuzzy) attributes. Among these attempts, an approach based on residuated implications plays an important role. This chapter presents an overview of foundations of formal concept analysis of data with graded attributes, with focus on the approach based on residuated implications and on its extensions and particular cases. Presented is an overview of both of the main parts of formal concept analysis, namely, concept lattices and attribute implications, and an overview of the underlying foundations and related methods. In addition to that, the chapter contains an overview of topics for future research.
INTRODUCTION Tabular Data, Formal Concept Analysis, and Related Methods Tables, that is, two-dimensional arrays, represent perhaps the most popular way to describe data. Table rows usually correspond to objects of our interest, table columns correspond to some of their attributes, and table entries contain values of attributes on the respective objects. As an example, consider patients as objects and the patients’ names,
weight, gender, and so forth as attributes. Table rows and columns are usually labeled by objects’ and attributes’ names. A particular case arises when all the attributes are logical attributes (presence or absence attributes) like male, headache, lefthanded, and so forth. A patient either is a male or not, and, in general, either has a logical attribute or not. In this case, a table entry corresponding to object x and attribute y contains × or is blank depending on whether object x has or does not have attribute y.
Relational Data, Formal Concept Analysis, and Graded Attributes
Many methods of various kinds have been and are being developed for representation, processing, and analysis of tabular data. This chapter is concerned with formal concept analysis (FCA), which is a particular method of knowledge extraction from tabular data. Although some previous attempts exist (see Barbut, 1965), FCA was initiated by Wille’s (1982) seminal paper. Since then, significant progress has been made in theoretical foundations, algorithms, and methods. Applications of FCA can be found in many areas of human affairs, including engineering, sciences, economics, information processing, mathematics, psychology, and education; see, for example, Carpineto and Romano (2004b) and Koester (2006) for applications in information retrieval; Snelting and Tip (2000) for applications in object-oriented design; Ganapathy, King, Jaeger, and Jha (2007) for applications in security; Pfaltz (2006) for applications in software engineering; Zaki (2004) for how concept lattices can be used to mine nonredundant association rules; and Ganter and Wille (1999) and Carpineto and Romano (2004a) for further applications. Two monographs on FCA are available: Ganter and Wille (1999, mainly mathematical foundations) and Carpineto and Romano (2004a; mainly algorithms and applications). There are three international conferences devoted to FCA, namely, ICFCA (International Conference on Formal Concept Analysis), CLA (Concept Lattices and their Applications), and ICCS (International Conference on Conceptual Structures). In addition, further papers on FCA can be found in journals and proceedings of other conferences.
A table with logical attributes can be represented by a triplet X , Y , I where I is a binary relation between X and Y. Elements of X are called objects and correspond to table rows, elements of Y are called attributes and correspond to table columns, and for x ∈ X and y ∈ Y , x, y ∈ I indicates that object x has attribute y while x, y ∉ I indicates that x does not have y. For instance, Figure 1 (left) depicts a table with logical attributes. The corresponding triplet X , Y , I is given by X = {x1 , x2 , x3 ,…} , Y = { y1 , y2 , y3 ,…} , and we have x1 , y1 ∈ I , x2 , y3 ∉ I , and so forth. Since representing tables with logical attributes by triplets is common in FCA, we say “table X , Y , I ” instead of “triplet X , Y , I representing a given table.” FCA aims at obtaining two outputs out of a given table. The first one, called a concept lattice, is a partially ordered collection of particular clusters of objects and attributes. The second one consists of formulas, called attribute implications (AIs), describing particular attribute dependencies that are true in the table. The clusters, called formal concepts, are pairs A, B where A ⊆ X is a set of objects and B ⊆ Y is a set of attributes such that A is a set of all objects that have all attributes from B, and B is the set of all attributes that are common to all objects from A. For instance, {x1 , x2 }, { y1 , y2 } and {x1 , x2 , x3 }, { y2 } are examples of formal concepts of the (visible part of) the left table in Figure 1. An attribute implication is an expression A ⇒ B with A and B being sets of attributes. A ⇒ B is true in table X , Y , I if each object having all attributes from A has all attributes from B as well. For instance, { y3 } ⇒ { y2 } is true in the (visible part of) the left table in Figure 1, while { y1 , y2 } ⇒ { y3} is not ( x2 serves as a counterexample).
Graded Attributes and Extensions of Formal Concept Analysis Contrary to classical (two-valued) logic, fuzzy logic uses intermediate truth degrees in addition to 0 (false) and 1 (true). Fuzzy logic thus allows us to assign truth degrees like 0.8 to propositions like “Customer C is satisfied with service s.” In this example, assigning 0.8 to the above proposi463
Relational Data, Formal Concept Analysis, and Graded Attributes
tion means that customer C was quite satisfied but not completely. This way, fuzzy logic attempts to deal with fuzzy attributes (graded attributes) like being tall, being satisfied (with a given service), and so forth. An example of a table with fuzzy attributes is presented in the right part of Figure 1. A table entry corresponding to object x and attribute y contains a truth degree of “object x has attribute y.” For instance, object x1 has attribute y1 to degree 1, x2 has attribute y1 to degree 0.8, x2 has attribute y3 to degree 0.1, and so forth. If objects are patients and y1 is intensive headache, then the table says that patient x2 has a rather severe headache. Needless to say, dealing with fuzzy attributes by means of classical logic, that is, using only 0 and 1, and forcing a user to decide whether or not a given customer was satisfied is not appropriate. Using intermediate truth degrees in addition to 0 and 1 instead of 0 and 1 only has became known under the term fuzzy approach (graded approach). There are two basic ways to deal with formal concept analysis of tables with fuzzy attributes. The first one is to use so-called conceptual scaling (Ganter & Wille, 1999) to transform an input table with fuzzy attributes to a table with bivalent (yes/no) attributes and to use ordinary FCA to analyze the table with bivalent attributes. The second one, which is the topic of this chapter, is to extend ordinary FCA into a setting that enables us to deal with fuzzy attributes directly, that is, to extend FCA to a fuzzy setting. The first paper attempting to extend FCA to a fuzzy setting is Burusco and Fuentes-Gonzáles (1994). Due to technical difficulties, this approach did not prove successful. A different approach, based on the use of a residuated implication, was proposed independently in Pollandt (1997) and Belohlavek (1998). Currently, this approach, its extensions, and its particular cases represent the mainstream in formal concept analysis of data with fuzzy attributes. A comprehensive overview of this mainstream is, however, not available. This chapter attempts to provide such overview.
464
Aim and Outline of This Chapter We present an overview of formal concept analysis of data with fuzzy attributes. We focus on the approach based on residuated implication and comment on related approaches. The chapter covers two main parts of FCA, namely, concept lattices and attribute implications. In a sense, the chapter can be seen as an answer to the following question: Is it feasible to extend formal concept analysis in a way that naturally handles fuzzy attributes? The following are the main points we try to emphasize. 1.
2.
3.
We present a sound generalization of mathematical foundations of FCA. This concerns mainly concept lattices and attribute implications, that is, the two main outputs of FCA, but also mathematical structures directly related to FCA like closure operators, closure systems, Galois connections, and complete lattices. We use complete residuated lattices as a general structure of truth degrees. The ordinary (i.e., nonfuzzy) results on FCA turn out to be a particular case of our results when the complete residuated lattice is the two-element Boolean algebra of classical logic. Although the computational aspects (design of efficient algorithms) are of secondary interest in this chapter, we present algorithms with the same order of complexity as those known from the ordinary FCA (computation of fixed points of the fuzzy closure operators involved, computation of systems of pseudointents, computation of nonredundant bases of fuzzy attribute implications [FAIs]). Our approach is based on following closely fuzzy logic in a narrow sense (see, e.g., Hájek, 1998). Note that fuzzy logic in a narrow sense, sometimes called mathematical fuzzy logic, denotes logical calculi aimed at reasoning with propositions that can take intermediate truth degrees such as 0.7 in addition to 0 and 1. Briefly speaking, our definitions result from considering appropriate formulas
Relational Data, Formal Concept Analysis, and Graded Attributes
4.
5.
6.
7.
and evaluating these formulas according to the principles of fuzzy logic. This has an important effect. Namely, the meanings of notions such as a formal concept, a concept lattice, the validity of an attribute implication, and so forth are essentially the same as in the ordinary setting. Furthermore, when developing fuzzy attribute logic, that is, a logical calculus for reasoning with rules A ⇒ B , we present both ordinary-style as well as Pavelka-style logics. We present various results (representation results, reduction results) on relationships between the new structures that result in our approach such as fuzzy concept lattices, fuzzy Galois connections, fuzzy attribute implications, and so on, and the ordinary structures, that is, concept lattices, Galois connections, attribute implications, and so forth. We demonstrate that in a fuzzy setting, new phenomena arise. These phenomena are hidden in the ordinary setting but are interesting and important in a fuzzy setting. Two examples are presented in detail. First is the factorization of concept lattices by similarity, which allows us to consider a simplified version of the original concept lattice, namely, its factor lattice. Second is the usage of hedges (truth functions of the connective very true) to parameterize the underlying Galois connections. Hedges enable us to control the size of the resulting concept lattice. In addition to that, by setting hedges in an appropriate way, we obtain approaches proposed by other authors as a particular case of our approach. Some of the results we present, although developed in a fuzzy setting, are new even for the ordinary setting. The method of reducing the size of concept lattices by closure operators is an example. We present a survey of recent developments, extensions of the basic approach based on residuated implication, and directions for future research in the FCA of tables with fuzzy attributes.
PRELIMINARIES Formal Concept Analysis in Ordinary Setting Let X , Y , I be a data table with crisp attributes; that is, X and Y are finite sets (of objects and attributes) and I ⊆ X × Y is a binary relation between X and Y (see the introduction). X , Y , I is also called a formal context in FCA. We introduce the operators ↑ : 2 X → 2Y and ↓ : 2Y → 2 X by putting for each A ⊆ X and B ⊆ Y A↑ = { y ∈ Y | for each x ∈ A : x, y ∈ I }, B ↓ = {x ∈ X | for each y ∈ B : x, y ∈ I }.
A formal concept in X , Y , I is a pair A, B of A ⊆ X and B ⊆ Y such that A↑ = B and B ↓ = A . ( X , Y , I ) = { A, B | A↑ = B, B ↓ = A} , that is, ( X , Y , I ) is the set of all formal concepts in X , Y , I . Introduce a partial order ≤ on ( X , Y , I ) by A1 , B1 ≤ A2 , B2 iff A1 ⊆ A2 (iff B2 ⊆ B1 ). The set ( X , Y , I ) equipped by ≤ is called a concept lattice of X , Y , I . Note that A↑ is the set of all attributes shared by all objects from A; dually, B ↓ is the set of all objects sharing all attributes from B. Therefore, A, B is a formal concept iff A is the set of all objects sharing all attributes from B and, vice versa, B is the set of all attributes shared by all objects from A. A and B are called an extent and an intent of A, B ; an extent (intent) is thought of as a collection of objects (attributes) to which the concept A, B applies. A1 , B1 ≤ A2 , B2 means that A2 , B2 is more general than A1 , B1 since it applies to a larger collection of objects (or, equivalently, applies to a smaller collection of attributes). ≤ therefore models the subconcept-superconcept hierarchy. This way, FCA captures a traditional approach to concept and conceptual hierarchy (Arnauld & Nicole, 1662). Alternatively, formal concepts can be defined as maximal rectangles in table X , Y , I that are full of × s. The following assertion is called the main theorem of concept lattices.
Theorem 1. (Wille, 1982) (1) B( X , Y , I )
equipped with ≤ is a complete lattice with infima and suprema given by: 465
Relational Data, Formal Concept Analysis, and Graded Attributes
∧
j∈J
Aj , B j =
j∈J
Aj , ( j∈J B j )↓↑ ,
∨
j∈J
Aj , B j = ( j∈J Aj )↑↓ , j∈J B j .
(2) Moreover, an arbitrary complete lattice V = V , ≤ is isomorphic to B( X , Y , I ) iff there are mappings : X → V , : Y → V such that g(X) is supremally dense in V and m(Y) is infimally dense in V; ( x) ≤ ( y ) iff x, y ∈ I . A subset K of a complete lattice V is called infimally (supremally) dense if each element of V is an infimum (supremum) of some elements of K. An attribute implication A ⇒ B over a set Y of attributes, that is, A, B ⊆ Y (see introduction) is true (valid) in a set M ⊆ Y of attributes iff: A⊆ M
implies B ⊆ M .
If M is a set of attributes shared by an object x, then A ⇒ B being true in M means that if x has all attributes from A, then x has all attributes from B. A ⇒ B is true in X , Y , I iff A ⇒ B is true in each {x}↑, that is, in each row of table X , Y , I . A nonredundant basis of X , Y , I is a minimal set T of attribute implications such that every attribute implication A ⇒ B is true in X , Y , I iff A ⇒ B follows from T in that A ⇒ B is true in each M in which every attribute implication from T is true. An important nonredundant basis, a computationally tractable one, is a so-called Guigues-Duquenne basis (Ganter & Wille, 1999). For further details, we refer the reader to Ganter and Wille and also to Carpineto and Romano (2004a).
Fuzzy Sets and Fuzzy Logic We now recall basic notions of fuzzy logic and fuzzy set theory (for details, see, e.g., Belohlavek, 2002; Gerla, 2001, Hájek, 1998; Klir & Yuan, 1995). We pick so-called complete residuated lattices as our basic structures of truth degrees (i.e., sets of truth degrees equipped with fuzzy logic operations like implication). A complete residuated lattice is a structure L = L, ∧, ∨, ⊗, →, 0,1 where L is a set of truth degrees; ∧, ∨, ⊗, → are operations on L, and 0, 1 are two designated truth degrees from
466
L. As an example, we can have L = [0, 1]; that is, L is a real unit interval, but in general, elements of L need not be numbers. ∧ and ∨ are infimum and supremum on L. Note that if L = [0, 1], ∧ and ∨ coincide with the minimum and maximum. L equipped with ∧ and ∨ is required to form a complete lattice. This is needed because of the semantics of the general and universal quantifiers in fuzzy logic. ⊗ and → are truth functions of fuzzy conjunction and fuzzy implication. Although we have many choices of ⊗ and → (see below), the choice of ⊗ and → cannot be arbitrary. ⊗ and → need to satisfy certain properties, and certain relationships, such as the adjointness property (see below), need to be satisfied between ⊗ and → . These properties enable us to properly extend to a fuzzy setting various results from a crisp setting. Note also that the properties and relationships imposed by the concept of a complete residuated lattice are quite natural and not restrictive. Formally, a complete residuated lattice is an algebra L = L, ∧, ∨, ⊗, →, 0,1 such that L, ∧, ∨, 0,1 is a complete lattice with 0 and 1 being the least and greatest element of L, respectively; L, ⊗,1 is a commutative monoid (i.e., ⊗ is commutative and associative, and a ⊗ 1 = 1 ⊗ a = a for each a ∈ L ); ⊗ and → satisfy a ⊗ b ≤ c iff a ≤ b → c (adjointness property) for each a, b, c ∈ L . Moreover, we use the following concept of a (truth-stressing) hedge (Hájek, 1998, 2001). A hedge on a complete residuated lattice L is a mapping ∗ : L → L satisfying 1∗ = 1 , a∗ ≤ a , (a → b)∗ ≤ a∗ → b∗ , and a∗∗ = a∗ for each a, b ∈ L . A biresiduum on L is a derived operation ↔ defined by a ↔ b = (a → b) ∧ (b → a ) . Elements a of L are called truth degrees. ⊗ and → are (truth functions of) fuzzy conjunction and fuzzy implication; hedge ∗ is the (truth function of) logical connective very true; ↔ is a (truth function of) fuzzy equivalence. A common choice of L is a structure with L = [0, 1] (real unit interval), with ∧ and ∨ being the minimum and maximum, and ⊗ being a left-continuous t-norm with the corresponding → . The three most important pairs of adjoint operations on the unit interval are Łukasiewicz, a ⊗ b = max(a + b − 1, 0) , a → b = min(1 − a + b,1) ;
U ( A(u ) → B (u )),
Relational Data, Formal Concept Analysis, and Graded Attributes
Gödel (minimum), a ⊗ b = min(a, b) , a → b = 1 if a ≤ b and a → b = b if a > b ; and Goguen (product), a ⊗ b = a ⋅ b , a → b = 1 if a ≤ b and a → b = b/a if a > b . Other examples are finite chains, for example, L = {a0 = 0, a1 ,…, an = 1} ⊆ [0,1] ( a0 < < an ) with ⊗ and → given by ak ⊗ al = amax( k +l − n,0) and ak → al = amin( n − k +l ,n ) (finite Łukasiewicz chain), or ⊗ and → being the restrictions of the above Gödel operations on [0, 1] to L. A special case is a two-element Boolean algebra, which we will denote by 2. An L-set (fuzzy set) A in a universe U is a mapping A: U → L , with A(u ) being interpreted as the degree to which u belongs to A. If U = {u1 ,…, un } , then A can be denoted by A = {a1 /u1 ,…,an /un } , meaning that A(ui ) equals ai ; we write {u,0.5 /v} instead of {1 /u,0.5 /v,0 /w} , and so forth. LU (or LU ) denotes the collection of all L sets in U; basic operations with L sets are defined componentwise. An L set A∈ LU is called crisp if A(u ) ∈ {0,1} for each u ∈ U . Crisp L sets can be identified with ordinary sets. For a crisp set A, we also write u ∈ A for U A(u ) = 1 and u ∈ / A for A(u ) = 0 . An L set A∈ L is called empty (denoted by ∅ ) if A(u ) = 0 for each u ∈ U . For a ∈ L and A∈ LU , an ordinary set a A = {u ∈ U | A(u ) ≥ a} is called an a-cut of A. Given A, B ∈ LU , we define a degree S ( A, B) to which A is contained in B and a degree A ≈ B to which A is equal to B by:
S ( A, B) = ∧ u∈U ( A(u ) → B(u )), A ≈ B = ∧ u∈U ( A(u ) ↔ B(u )).
For further details, we refer the reader to Belohlavek (2002c), Gottwald (2001), Hájek (1998), and Klir and Yuan (1995).
CONCEPT LATTICES OF TABLES WITH FUZZY ATTRIBUTES Concept Lattices Data tables with fuzzy attributes. A data table with fuzzy attributes, or a formal fuzzy context, is a triplet X , Y , I where X and Y are sets, and I : X × Y → L is a binary fuzzy relation between X and Y that takes values in the support L of L. X and Y are usually assumed to be finite; elements of X and Y are called objects and attributes, respectively. A degree I ( x, y ) ∈ L is interpreted as a degree to which object x ∈ X has attribute y ∈ Y . The notion of a data table with fuzzy attributes is our formal counterpart to tables such as the one in Figure 1 (right) with an obvious correspondence x ∈ X and attributes y ∈ Y that correspond to table rows and columns, respectively; I ( x, y ) is the table entry in the row corresponding to x and the column corresponding to y. Arrow operators, formal concepts, and concept lattices. Each table X , Y , I with fuzzy attributes induces a pair of operators ⇑ : LX → LY ⇓ Y X and by: ( A(u ): L↔→ B(Lu ))defined .
A ≈ B = ∧ u∈U
In particular, we write A ⊆ B iff S ( A, B) = 1 . A binary L relation ≈ on U is called an L equivalence if for any u, v, w ∈ U we have u ≈ u = 1 (ref lexivity), u ≈ v = v ≈ u (symmetry), and (u ≈ v) ⊗ (v ≈ w) ≤ (u ≈ w) (transitivity). An L equality is an L equivalence satisfying u = v whenever u ≈ v = 1 . Throughout this chapter, we use the following convention. If we want to emphasize the structure L of truth degrees, we say L-set, L-Galois connection, and so forth instead of fuzzy set, fuzzy Galois connection, and so on, which we use if L is not important or clear from context.
A⇑ ( y ) = ∧ x∈X ( A( x) → I ( x, y )), B ⇓ ( x) = ∧ y∈Y ( B( y ) → I ( x, y )),
(1)
for each A∈ LX and B ∈ LY , and x ∈ X and y ∈ Y . A formal (fuzzy) concept of X , Y , I is a pair A, B of fuzzy sets A∈ LX and B ∈ LY satisfying A⇑ = B and B ⇓ = A . Introduce the following sets. (X ,Y , I ) = { A, B ∈ LX × LY | A⇑ = B , B ⇓ = A},
(2)
467
Relational Data, Formal Concept Analysis, and Graded Attributes
Ext( X , Y , I ) = { A ∈ LX | A, B ∈ ( X , Y , I ) for some B},
(3)
Int( X , Y , I ) = {B ∈ LY | A, B ∈ ( X , Y , I ) for some A}.
(4)
That is, ( X , Y , I ) is the set of all formal concepts in X , Y , I . Introduce a partial order ≤ on ( X , Y , I ) by: A1 , B1 ≤ A2 , B2 iff A1 ⊆ A2 (iff B2 ⊆ B1 ).
(5)
The set (X ,Y , I ) equipped by ≤ is called a (fuzzy) concept lattice of X , Y , I .
Remark 1. (1) Using basic principles of fuzzy logic, one can see that A⇑ ( y ) is a truth degree stating that for each object x, if x belongs to A, then x has attribute y. Therefore, A⇑ is a fuzzy set of all attributes shared by all objects from A. Analogously, B ⇓ is a fuzzy set of all objects sharing all attributes from B. (2) Therefore, A, B is a formal concept iff A is the fuzzy set of all objects sharing all attributes from B, and B is the fuzzy set of all attributes shared by all objects from A. Elements of Ext( X , Y , I ) are called extents; elements of Int( X , Y , I ) are called intents. (3) An intuitive interpretation and terminology comes from the Port-Royal approach to concepts (Arnauld & Nicole, 1662). Under Port-Royal, a concept is understood as consisting of a collection A of objects to which it applies and a collection B of attributes to which it applies. For example, the extent of the concept dog consists of all dogs, and the intent of dog consists of all attributes common to dogs (they bark, have tails, etc.). Note that from the point of view of the fuzzy approach, it is quite natural that extents and intents of concepts are fuzzy sets. Namely, this allows us to capture vaguely delineated concepts like large dogs.
468
(4) Partial order ≤ is interpreted as a subconceptsuperconcept hierarchy. Namely, A1 , B1 ≤ A2 , B2 means that A2 , B2 is more general than A1 , B1 since it applies to a larger collection of objects (or, alternatively, applies to a smaller collection of attributes). The structure of concept lattices will be investigated later. Among others, we will see that B( X , Y , I ) equipped with ≤ is indeed a complete lattice. (5) Later on, we will study modifications of ⇑ and ⇓ . Nevertheless, we start with ⇑ and ⇓ since, as we will see later, they play the role of basic arrow operators. (6) One can see that for L = 2 (two-element Boolean algebra), the above notions coincide with the corresponding notions from ordinary FCA (provided we identify crisp fuzzy sets and relations with ordinary sets and relations). Alternatively, formal concepts can be defined as maximal rectangles contained in 〈X, Y, I〉. Call a rectangle any pair A, B ∈ LX × LY . A1 , B1 A2 , B2 iff for each x ∈ X and y ∈ Y we have A1 ( x) ≤ A2 ( x) and B1 ( y ) ≤ B2 ( y ) ( A1 , B1 is a subrectangle of A2 , B2 ). We say that A, B is contained in I iff for each x ∈ X and y ∈ Y we have A( x) ⊗ B( y ) ≤ I ( x, y ) . Then we have Theorem 2.
Theorem 2. (Belohlavek, 2002c) A, B is a formal concept of X , Y , I iff A, B is a maximal (with respect to ) rectangle contained in I. Remark 2. Theorem 2 provides a useful way of looking at formal concepts. In a crisp case (table contains × s and blanks), Theorem 2 says that formal concepts are maximal rectangles in the table that are full of × s. Fuzzy Galois Connections and Closure Operators We now turn to selected results on Galois connections and closure operators in a fuzzy setting
Relational Data, Formal Concept Analysis, and Graded Attributes
that are the basic structures related to the arrow operators ⇑ and ⇓ . These results are taken from Belohlavek (1999, 2001a, 2002b, 2003), to which we refer for details (further results, comments, examples, etc.).
Fuzzy Galois Connections Throughout this section, K denotes a ≤ -filter in L; that is, K ⊆ L satisfies the statement that if a ∈ K and a ≤ b , then b ∈ K . Sometimes, K is assumed to be a filter in L, that is, a ≤ -filter satisfying a ⊗ b ∈ K whenever a, b ∈ K . An L K -Galois connection between nonempty sets X and Y is a pair ⇑ ,⇓ of mappings ⇑ : LX → LY , ⇓ : LY → LX satisfying: S ( A1 , A2 ) ≤ S ( A2⇑ , A1⇑ ) whenever S ( A1 , A2 ) ∈ K , (6) S ( B1 , B2 ) ≤ S ( B2⇓ , B1⇓ ) whenever S ( B1 , B2 ) ∈ K , (7) A ⊆ A⇑⇓ ,
(8)
B ⊆ B ⇓⇑ ,
(9)
for every A, A1 , A2 ∈ LX and B, B1 , B2 ∈ LY .
Remark 3. (1) We usually omit the phrase
“between X and Y” and say just “ L K -Galois connection.” For L = 2 (ordinary case), we obtain the usual notion of a Galois connection between sets. (2) K controls the meaning of the antitony Equations 6 and 7. Two important cases are K = L and K = {1}. For instance, Equation 6 becomes S ( A1 , A2 ) ≤ S ( A2⇑ , A1⇑ ) for K = L, and it becomes “if A1 ⊆ A2 , then A2⇑ ⊆ A1⇑ ” for K = {1}. Clearly, for K1 ⊆ K 2 , each L K2 -Galois connection is also an L K1 -Galois connection. (3) Equations 6 and 7 can be simplified (Belohlavek, 2001a): ⇑ ,⇓ is an L K -Galois con-
nection iff S ( A, B) ∈ K or S ( B, A) ∈ K implies S ( A, B ⇓ ) = S ( B, A⇑ ) . (4) L K -Galois connections obey several useful properties that we omit here due to lack of space. Axiomatic characterization of arrow operators. The arrow operators defined by Equation 1 can be characterized axiomatically. Namely, they turn out to be just L L -Galois connections.
Theorem 3. (Belohlavek, 1999) For a binary L relation I between X and Y, denote by ⇑ I ,⇓I the mappings defined by Equation 1. For an L L -Galois connection ⇑ ,⇓ between X and Y, denote by I ⇑ ⇓ a binary L relation between X and , Y defined by: I
⇑ ⇓
,
( x, y ) = {1 /x}⇑ ( y ) = {1 /y}⇓ ( x).
Then ⇑ I ,⇓I is an L L -Galois connection and I ⇑ I ,⇓I and ⇑ ,⇓ I ⇑ ⇓ define a bijective , correspondence between binary L relations and L L -Galois connections between X and Y.
Remark 4. Theorem 3 generalizes a classical
result by Ore (1944).
Representation by ordinary Galois connections: Case 1. A natural question regarding the relationship of ordinary and fuzzy concept lattices is the following: Is there not some simple relationship between the arrow operators ⇑ I and ⇓I induced by a fuzzy relation I on the one hand, ⇑a ⇓a and the ordinary arrow operators I and I induced by a-cuts a I of I? For instance, is it not ⇑ the case that a ( A⇑ I ) = ( a A) a I , that is, that A⇑ I can ⇑a be computed cut by cut using I s? If yes, this would imply some simple relationships between ( X , Y , I ) and ( X , Y ,a I ) . It turns out that the answer to the above question is negative. Nevertheless, there is a relationship between fuzzy Galois connections and ordinary Galois connections,
469
Relational Data, Formal Concept Analysis, and Graded Attributes
which we present here. It consists of establishing a bijective correspondence between L L -Galois connections and particular systems of ordinary Galois connections. A system { ⇑a ,⇓a | a ∈ L} of ordinary Galois connections between X and Y is called L nested if (a) for each a, b ∈ L , a ≤ b , A ⊆ X , and B ⊆ Y we have A⇑a ⊇ A⇑b and B ⇓a ⊇ B ⇓b , and (b) for each x ∈ X and y ∈ Y , the set {a ∈ L | y ∈ {x}⇑a } has a greatest element. Then we have Theorem 4.
Theorem 4. (Belohlavek, 1999, 2002c). For an L L -Galois connection ⇑ ,⇓ , denote ⇑ X Y C ⇑ ,⇓ = { ⇑a ,⇓a | a ∈ L} where a : 2 → 2 and ⇓a : 2Y → 2 X are defined by A⇑a = a ( A⇑ ) and ⇓a B = a ( B ⇓ ) for A∈ 2 X and B ∈ 2Y . For an Lnested system C = { ⇑a ,⇓a | a ∈ L} of ordinary Galois connections, denote by ⇑C ,⇓C the pair of ⇑C : LX → LY and ⇓C : LY → LX defined for A ∈ LX and B ∈ LY : A⇑C ( y ) = ∨{a | y ∈ b∈L (b A)⇑a⊗b }, B ⇓C ( x) = ∨{a | x ∈ b∈L (b B)⇑a⊗b }.
Then: 1. 2. 3.
C ⇑ ,⇓ is a nested system of L-Galois connections, ⇑ C ⇓C is an L-Galois connection, and , ⇑ ⇓ , C ⇑ ,⇓ and C ⇑C ,⇓C define bijective correspondence between L L -Galois connections and L-nested systems of ordinary Galois connections.
Remark 5. (1) Note that Theorem 4 can be obtained as a consequence of results on cut-like semantics for fuzzy logic as presented in Belohlavek (2002c). A particular (and trivial) case of the cutlike semantics is a result of the representation of fuzzy sets by their a-cuts. (2) Theorem 4 can be used to get insight to some approaches to FCA in a fuzzy setting that are based on decomposing X , Y , I into the cuts X , Y ,a I (see Belohlavek & Vychodil, 2005f).
470
Representation by ordinary Galois connections: Case 2. We now present another representation of fuzzy Galois connections by ordinary Galois connections. It consists of establishing a bijective correspondence between L{1} -Galois connections between X and Y and particular ordinary Galois connections between X × L and Y × L . This representation is useful for establishing a relationship between fuzzy and ordinary concept lattices. For A∈ LU , let A ⊆ U × L be defined by A = { u, a | a ≤ A(u )} . Thus, A is the area below themembershipfunctionAin U × L .For A ⊆ U × L , let U A ∈ L be defined by A (u ) = ∨{a | u, a ∈ A} . Thus, A is a fuzzy set in U resulting as an upper envelope of A. An ordinary Galois connection ∧ ,∨ between X × L and Y × L is called commutative with respect to if for each A ⊆ X × L and B ⊆ Y × L we have: ∧
∨
∨
A = A∧ and B = B . (10) ∧ ∨ For a pair , of mappings ∧ : X × L → Y × L and ∨ : Y × L → X × L , introduce a pair of mappings ⇑ ∧∨ and ⇓ ∧ ,∨ Y by: , : LX → LY : L → LX ⇑
∧ ,∨
⇓
∧ ∨
, ∧ ∨ = A and B = B (11)
A
⇑ ⇓ for A ∈ LX and B ∈ LY . For a pair , of mappings ⇑ X Y ⇓ Y X : L → L and : L → L , define a pair of map∧ pings ⇑ ,⇓ and ∨ ⇑ ,⇓ : X ×L →Y ×L :Y × L → X × L by:
∧
⇑ ,⇓
∨
⇑ ⇓
, ⇑ ⇓ = A and B = B (12)
A
for A ⊆ X × L and B ⊆ Y × L . Then we have Theorem 5.
Theorem 5. (Belohlavek, 2001b) Let
⇑ ⇓
, be an L{1} -Galois connection between X and Y and ∧ ∨ , be an ordinary Galois connection between
Relational Data, Formal Concept Analysis, and Graded Attributes
X × L and Y × L , which is commutative with respect to . Then:
as in Theorem 5, then:
X , Y , ⇑ ,⇓
∧
1.
⇑ ,⇓
,
∨
,
⇓
∧ ,∨
2.
is an L{1} -Galois connection between X and Y, and
3.
sending
∧
⇑
∧ ,∨
,
⇓
⇑ ⇓
,
⇑ ,⇓
to
,
∨
⇑ ,⇓
and
∧ ∨
,
to
∧ ,∨
defines a bijective correspondence between L{1} -Galois connections between X and Y and commutative Galois connections between X × L and Y × L .
This observation has some important consequences for the relationship between fuzzy concept lattices and ordinary concept lattices. We now present selected results. Under the above notation, denote: X
⇑ ⇓
Y
⇑
⇓
( X , Y , , ) = { A, B ∈ L × L | A = B, B = A}
X × L, Y × L, ∧ ,∨
Lemma 1. (Belohlavek, 2001b) For any L K -Galois connection ⇑ ,⇓ , if: ∧ ∨
,
=
∧
⇑ ,⇓
,
∨
= X × L, Y × L, I × ,
Theorem 6. (Belohlavek, 2001b) Any Lconcept lattice ( X , Y , I ) is isomorphic to the ordinary concept lattice X × L, Y × L, I × , where x, a , y, b ∈ I × iff a ⊗ b ≤ I ( x, y ) . An isomorphism is given by sending A, B ∈ ( X , Y , I ) to × A , B ∈ X × L, Y × L, I . As an almost direct consequence of Lemma 1 and Theorem 2, we get a theorem characterizing the lattice of fixed points of L{1} -Galois connections (Belohlavek, 2001b, Theorem 3.4), a particular case of which is the following theorem.
Theorem 7. (Belohlavek, 2001b) Let X , Y , I be a data table with fuzzy attributes.
( X × L, Y × L, ∧ ,∨ ) = { A, B ∈ 2 X × L × 2Y × L | A∧ = B, B ∨ = A}
that is, the sets of fixed points of the respective Galois connections. Note that if ⇑ ,⇓ are the arrow operators induced by X , Y , I , then ( X , Y , ⇑ ,⇓ ) is just the L-concept lattice ( X , Y , I ) . Then, using Lemma 1, one can prove Theorem 6.
where I × ⊆ ( X × L) × (Y × L) is def ined by x, a , y, b ∈ I × iff b≤{a/x} ⇑ ( y ) .
1.
and
are isomorphic lattices. Moreover:
∧ ,∨
and
X × L, Y × L, ∧ ,∨
⇑ ,⇓
is an ordinary Galois connection between X × L and Y × L , which is commutative with respect to , ⇑
∧
j∈J
Then ( X , Y , I ) is a complete lattice with respect to ≤ , where the suprema and infima are given by: Aj , B j =
j∈J
Aj , ( j∈J B j )⇓⇑ , ∨ j∈J Aj , B j = ( j∈J Aj )⇑⇓ , j∈J B j .
2. Moreover, an arbitrary complete lattice V = V , ≤ is isomorphic to B ( X , Y , I ) iff there are mappings : X × L → V and : Y × L → V such that: • •
g(X, L) is supremally dense in V and m(Y, L) is infimally dense in V, and ( x, a ) ≤ ( y, b) iff a ⊗ b ≤ I ( x, y ) .
⇑ ,⇓
471
Relational Data, Formal Concept Analysis, and Graded Attributes
Note that Theorem 6 is a reduction theorem, which, in principle, enables us to reduce several problems concerning fuzzy concept lattices (e.g., computing a fuzzy concept lattice) to the corresponding problems of ordinary concept lattices. We will go back to this issue later on. Theorem 7 plays a role of a main theorem for concept lattices in a fuzzy setting. Note that Theorem 1, that is, the main theorem for ordinary concept lattices, is a particular case of Theorem 7. As we will see in the section “Main Theorem on Concept Lattices,” Theorem 7 is a version of the main theorem for concept lattices that concerns crisp order on ( X , Y , I ) . The other version, concerning fuzzy order on ( X , Y , I ) , is presented in “Main Theorem on Concept Lattices,” where we will also see an alternative way to prove Theorem 7 (directly, not via reduction to the ordinary case).
Fuzzy Closure Operators Fuzzy closure operators are important structures widely studied in fuzzy set theory (see, e.g., Belohlavek, 2002c; Gerla, 2001). They are closely related to FCA in a fuzzy setting, but play a role in other areas as well, as in the case of ordinary closure operators. Let K be a filter in L (in some cases, ≤ -filter suffices). An L K -closure operator in a nonempty set X is a mapping C : LX → LX satisfying: A ⊆ C ( A), S ( A1 , A2 ) ≤ S (C ( A1 ), C ( A2 ))
(13)
whenever S ( A1 , A2 ) ∈ K ,
(14)
C ( A) = C (C ( A))
(15)
for every A, A1 , A2 ∈ LX .
Remark 6. As in the case of L K -Galois
connections, K influences the meaning of the monotony Equation 2. Two important cases are K = L and K = {1} for which Equation 14 becomes S ( A1 , A2 ) ≤ S (C ( A1 ), C ( A2 )) and “if A1 ⊆ A2 , then
472
C ( A1 ) ⊆ C ( A2 ) .” Note that most of the literature on fuzzy closure operators deals with K = {1} only. Results related to fuzzy closure operators we present here are contained mainly in Belohlavek (2001a, 2002b). In what follows, we present selected results of these papers. The first result concerns a characterization of systems of fixed points of L K -closure operators. Recall that it is well known from an ordinary case that a system S of subsets of X is a system of fixed points of some closure operator on X iff it is closed under arbitrary intersections. In our setting we have Theorem 8.
Theorem 8. (Belohlavek, 2001a) A system
S ⊆ LX is a system of fixed points of some L K -closure operator C in X, that is, S = { A ∈ LX | A = C ( A)} iff for each a ∈ K and A ∈ S we have a → A ∈ S and for any Ai ∈ S ( i ∈ I ) we have Ai ∈ S . i∈I
Remark 7. (1) Note that a → A (a shift of A ) is defined by (a → A)( x) = a → A( x) . That is, systems of fixed points are just systems closed under a shifts for a ∈ K and closed under arbitrary intersections. (2) Belohlavek (2001a) contains further characterizations of systems of fixed points of fuzzy closure operators and describes explicitly the bijective mappings between L K -closure operators and systems of their fixed points. (3) Belohlavek (2002b) contains further results on L K -closure operators, namely, fuzzy closure operators induced by binary fuzzy relations, the representation of L{1} -closure operators in X by ordinary closure operators in X × L , operators of consequence, and some further results. Fuzzy closure operators and Galois connections. In this section, we present selected results on relationships between fuzzy closure operators and fuzzy Galois connections. We have seen that the arrow operators ⇑ and ⇓ induced by a table with fuzzy attributes form an L L -Galois connec-
Relational Data, Formal Concept Analysis, and Graded Attributes
tion. The following result is an excerpt of results from Belohlavek (2001a) that describe a bijective correspondence between L K -Galois connections and pairs of L K -closure operators with dually isomorphic systems of fixed points.
Theorem 9. (Belohlavek, 2001a) Let
⇑ ⇓
,
be an L L -Galois connection between X and Y, and C be an L L -closure operator on X. Then: 1. C ⇑ ⇓ : L → L defined by C ⇑ ⇓ ( A) = A is , , an L L -closure operator on X , 2. f o r Y = { A ∈ LX | A = C ( A)} ⇑C ⇓C , operators and def i ned by ⇑C ⇓C A ( A′) = S ( A, A′), B ( x) = ∧ A∈Y B( A) → A( x) form an L L -Galois connection between X and Y, and 3. C = C ⇑C ⇓C . , X
X
⇑⇓
Therefore, given X , Y , I , both L L -closure operators.
⇑⇓
and
⇓⇑
are
Computing a concept lattice. Since: ( X , Y , I ) = { A, A⇑ | A ∈ Ext( X , Y , I )} and E xt( X , Y , I ) = fix(⇑⇓ ),
where fix(⇑⇓ ) is the set of all fixed points of ⇑⇓ , in order to compute ( X , Y , I ) , it is sufficient if we are able to compute fix(C) for a given fuzzy closure operator C. Computing systems of fixed points of fuzzy closure operators appears several times in FCA (we will see some cases later). For this purpose, we now briefly present an algorithm that is an extension of Ganter’s NextClosure algorithm (Ganter & Wille, 1999) to our setting (for details see Belohlavek, 2002a). The algorithm outputs all fixed points of C in a lexicographic order defined below. S u p p o s e X = {1 , 2 , . . . , n }; L = {0 = a1 < a2 < < ak = 1} (the assumption that L is linearly ordered is in fact not essential). For i, r ∈ {1,…, n} and j, s ∈ {1,…, k} we have: (i, j ) ≤ (r , s ) iff
i 1, and λ > 0 illustrates this definition. The degree of proximity decreases as (x,y) moves further away from (a,b). It reaches 1 if (x,y) = (a,b). Unfortunately, this membership function with unbounded support is difficult to represent. Alternatively, we can employ the following restricted but more practical function that defines a circle around (a,b) with radius r ∈ + (see Exhibit B). Next, we define three geometric primitives on fuzzy points that are valid for both definitions of fuzzy points. Let p˜(a, b), q˜(c, d ) ∈ Pf with a, b, c, d ∈ . See Exhibit C. In contrast to crisp points, for fuzzy points we also have a predicate for disjointedness. We are now able to define an object of the fuzzy spatial data type fpoint as a set of disjoint fuzzy points:
Exhibit B. p ( a ,b )
( x − a ) 2 + ( y − b) 2 1 − = r 0
if ( x − a ) 2 + ( y − b) 2 ≤ r 2 otherwise
Exhibit C. (i) p˜(a, b) = q˜(c, d) (ii) p˜(a, b) ≠ q˜(c, d) (iii) p˜(a, b) and q˜(c, d) are disjoint
:⇔ :⇔ :⇔
a = c ∧ b = d ∧ µ p˜(a,b) = µq˜(c,d) ¬( p˜(a, b) = q˜(c, d)) supp( p˜(a, b)) ∩ supp(q˜(c, d)) = ∅
497
Fuzzy Spatial Data Types
fpoint = {Q ⊆ Pf | ∀ p˜, q˜ ∈ Q : p˜(a, b) and q˜(c, d ) are disjoint ∧ Q is finite} Disjointedness of the single fuzzy points of a fuzzy point object is required since the membership degree of each single fuzzy point should be unique.
FUZZY LINES This subsection provides a concept of a fuzzy line object and introduces a corresponding fuzzy spatial data type fline. First, we informally discuss some features of fuzzy lines and compare them to crisp lines. Then, we give their formal definition.
What are Fuzzy Lines? Lines are the one-dimensional, geometric abstraction for linear features like rivers, boundaries, and transportation routes. Each crisp line is a subset of the Euclidean plane 2 with particular properties. Each element of a single crisp line is a crisp point that definitely and totally belongs to the line. A crisp line object (see, for example, Schneider, 1997; Schneider & Behr, 2006) includes a finite set of blocks. Each block consists of a finite set of simple lines (curves) such that each pair of simple lines is either disjoint or meets in a common end point. Fuzzy lines are supposed to adopt the fundamental structure of crisp lines. That is, a fuzzy line X has the same linear geometry as a crisp line and is hence a subset of 2; however, each point of X may only be to some degree a member of X . For example, the pollution of a river can be represented by the line geometry of the river where each point represents the degree or concentration of pollution at that location. The concentration is larger than 0 for all points of the fuzzy line but usually different at different locations.
498
Formal Definition of Fuzzy Lines We now specify the fuzzy spatial data type fline for fuzzy lines. For that, we first introduce a simple fuzzy line as a continuous curve with smooth transitions of membership grades between neighboring points of the line (Figure 1a). We assume a total order on 2 that is given by the lexicographic order < on the coordinates (first x, then y) of the points of 2. The membership function of a simple fuzzy line l˜ is then defined by l : fl → [0,1] with fl :[0,1] → 2, such that the following apply: (i) (ii) (iii) (iv) (v)
l is continuous fl is continuous ∀ a, b ∈]0, 1[: a ≠ b ⇒ fl(a) ≠ fl(b) ∀ a ∈ {0,1} ∀ b ∈]0, 1[: fl(a) ≠ fl(b) fl(0) < fl(1) ∨ ( fl(0) = fl(1) ∧ ∀a ∈]0, 1[: fl(0) < fl (a))
Function fl on its own models a continuous, simple crisp line (a curve). The points fl(0) and fl(1) are called the end points of f . The definition allows loops ( fl(0) = fl(1)) but prohibits the equality of interior points and thus self-intersections (Condition iii). The reason is that self-intersections do not occur in spatial reality; hence, our model excludes them. Condition iv disallows the equality of an interior point with an end point. Condition v requires that in a closed simple line, fl(0) must be the leftmost point, that is, the smallest point with respect to the lexicographic order 2) i
i
a li
(vi) ∀1 ≤ i ≤ n ∀ a ∈ {0,1} ∀ ( j , k ) ∈ V : li
( fl (a)) = i
lj
overlap within their interior. Moreover, they may not be touched within their interior by an end point of another element (Condition iii). The main reason for both conditions is, again, their uniqueness of representation. Condition iv ensures the property of connectivity of a fuzzy block; isolated fuzzy simple lines are disallowed. Condition (v) expresses that each end point of an element of b must belong to exactly one or more than two incident elements of b (note that always (i, a) ∈ Vla). This condition i supports the requirement of maximal elements and hence achieves uniqueness of representation. Condition vi requires that the membership values of more than two elements of b with a common end point must have the same membership value; otherwise, we get a contradiction saying that a point of a fuzzy block has more than one different membership value. The set of all fuzzy blocks over SLf is denoted by Bf . The disjointedness of any two fuzzy blocks b1, b2 ∈ Bf is defined as follows:
( fl (k )) j
Intuitively, a fuzzy block is a maximal, connected fuzzy line component (Figure 1b). Condition i states that a fuzzy block consists of a finite set of fuzzy simple lines. Condition ii requires that the elements of a fuzzy block do not intersect or
Figure 1. Example of a simple fuzzy line (a) and a (complex) fuzzy line (b). Fuzziness is indicated by shading. The complex fuzzy line consists of two fuzzy blocks that are made up of seven fuzzy simple lines.
b1 and b2 are disjoint :⇔ supp(b1) ∩ supp(b2) = ∅.
A fuzzy spatial data type for fuzzy lines called fline can now be defined in two equivalent ways. The structured view or component view is based on fuzzy blocks: fline = { mi=1 l | n ∈ ∧ ∀ 1 ≤ i ≤ n : b i ∈ B f ∧ ∀ 1 ≤ i < j ≤ n : bi and bj are disjoint}. The unstructured view is based on fuzzy simple lines: fline = { i =1 li | m ∈ ∧ ∀ 1 ≤ i ≤ m : li ∈ SL f ∧ m
∀ 1 ≤ i < j ≤ m : (li and lj are disjoint ∨ li and lj meet ) ∧ ∀ 1 ≤ i ≤ n ∀{a ∈ 0,1}: (| Vla | = 1) ∨ i
(| Vla | > 2)} i
(a)
(b)
An example of a fuzzy line object is given in Figure 1b.
499
Fuzzy Spatial Data Types
Figure 2. Example of a fuzzy region consisting of two fuzzy faces as components
FUZZY REGIONS The aim of this subsection is to develop and formalize the concept of a fuzzy region and to introduce a corresponding fuzzy spatial data type fregion for them. First, we informally discuss some intrinsic features of fuzzy regions and compare them to classical crisp regions. Then, we provide their formal definition. Finally, we give some examples of possible membership functions for them.
What are Fuzzy Regions? Research on spatial data modeling has so far focused on crisp or determinate spatial objects. The properties of crisp regions have been described in many publications. A very general definition defines a crisp region as a set of disjoint, connected areal components, called faces, possibly with disjoint holes in the Euclidean space 2 (Clementini & Di Felice, 1996b; Schneider, 1997; Schneider & Behr, 2006). This model has the nice property that it is closed under (appropriately defined) geometric union, intersection, and difference operations. For example, if we intersect two crisp regions, the result is always a crisp region. The model also allows crisp regions to contain holes and islands within holes to any finite level. By analogy with the generalization of crisp sets to fuzzy sets, we strive for a generalization of crisp regions to fuzzy regions on the basis of the point set paradigm and fuzzy concepts. At the same time we would like to transfer the structural definition of crisp regions (that is, the component view) to fuzzy regions. Thus, the structure of a fuzzy region is supposed to be the same as for a crisp region but with the exception and generalization that amounts to a relaxation and hence greater
500
flexibility of the strict belonging or nonbelonging principle of a point in space to a specific region, and which enables a partial membership of a point in a region. This is just what the term fuzzy means here. There are at least three possible related interpretations for a point in a fuzzy region. First, this situation may be interpreted as the degree of belonging to which that point is inside or part of some areal feature. Consider the transition between a mountain and a valley and the problem to decide which points have to be assigned to the valley and which points to the mountain. Obviously, there is no strict boundary between them, and it seems to be more appropriate to model the transition by partial and multiple membership. Second, this situation may indicate the degree of compatibility of the individual point with the attribute or concept represented by the fuzzy region. An example is a “warm” area where we must decide for each point to which grade it corresponds to the concept warm. Third, this situation may be viewed as the degree of concentration of some attribute associated with the fuzzy region at the particular point. An example is air pollution where we can assume the highest concentration at power stations, for instance, and lower concentrations with increasing distance from them. All these related interpretations give evidence of fuzziness. When dealing with crisp regions, the user usually does not employ point sets as a method to conceptualize space. The user rather thinks in terms of sharply determined boundaries enclosing and grouping areas with equal properties or attributes and separating different regions with different properties from each other; he or she has purely qualitative concepts in mind. This view changes when fuzzy regions come into play. Besides the qualitative aspect, in particular the quantitative aspect becomes important, and boundaries in most cases disappear (between a valley and a mountain there is no strict boundary). The distribution of attribute values within a region and transitions between different regions may be smooth or continuous. This feature just characterizes fuzzy regions.
Fuzzy Spatial Data Types
Figure 3. Example of a fuzzy set that is not a fuzzy region due to lower dimensional, geometric anomalies like cuts, punctures, and dangling lines
There are a lot of spatial phenomena showing a smooth behavior. Application examples are air pollution, temperature zones, magnetic fields, storm intensity, and sun insolation. Figure 2 demonstrates a possible visualization of a fuzzy region object that could model the expansion of air pollution caused by two nearby power stations. The left image shows a radial expansion of the first power station where the degree of pollution concentrates in the center (darker locations) and decreases with increasing distance from the power station (brighter locations). The right image shows the distribution of air pollution of the second power station that is surrounded by high mountains to the north, the south, and the west. Hence, the pollution cannot escape in these directions and finds its way out of the valley in an eastern direction. In both cases we can recognize the smooth transitions to the exterior. We call each connected component a fuzzy face.
Formal Definition of Fuzzy Regions Since our objective is to model two-dimensional fuzzy areal objects for spatial applications, we consider a fuzzy topology T on the Euclidean plane 2. In this spatial context, we denote the elements of T as fuzzy point sets. The membership function for a fuzzy point set A in the plane is then described by A : 2 →[0,1]. From an application point of view, there are two observations that prevent a definition of a fuzzy region simply as a fuzzy point set. We will discuss them now in more detail and at the same time elaborate properties of fuzzy regions.
Avoiding Geometric Anomalies Regularization. The first observation refers to a necessary regularization of fuzzy point sets. The first reason for this measure is that fuzzy (as well as crisp) regions that actually appear in spatial applications in most cases cannot be just modeled as arbitrary point sets but have to be represented as point sets that do not have geometric anomalies and that are in a certain sense regular. Geometric anomalies relate to isolated or dangling line or point features and missing lines and points in the form of cuts and punctures. Spatial phenomena with such degeneracies never appear as entities in reality. The second reason is that, from a data-type point of view, we are interested in fuzzy spatial data types that satisfy closure properties for (appropriately defined) geometric union, intersection, and difference. We are, of course, confronted with the same problem in the crisp case where the problem can be avoided by the concept of regularity (Schneider, 1997; Tilove, 1980). It turns out to be useful to appropriately transfer this concept to the fuzzy case. Let A be a fuzzy set of a fuzzy topological space ( 2, T ). Then:
A is called a regular open fuzzy set if A = intT (clT A )). Whereas crisp regions are usually modeled as regular closed crisp sets, we will use regular open fuzzy sets due to their vagueness and their usual lack of boundaries. Regular open fuzzy sets avoid the aforementioned geometric anomalies, too. Since application examples show that fuzzy regions can also be partially bounded, we admit partial boundaries with a crisp or fuzzy character. For that purpose, we define the frontier of a fuzzy set as (the notation := means “is defined as”): frontierT ( A ) : = {((x, y), – supp(intT ( A ))}.
A
(x, y)) | (x, y) ∈ supp( A )
501
Fuzzy Spatial Data Types
The term supp( A ) – (intT ( A )) determines the crisp locations of all fuzzy points of A that are not interior points. However, these locations do not necessarily all belong to the boundary of A since A has not been constrained so far. This is done in the following definition. A fuzzy set A is called a spatially regular fuzzy set if, and only if: (i) intT ( A ) is a regular open fuzzy set, (ii) frontierT ( A ) ⊆ frontierT (clT (intT ( A ))), and (iii) frontierT ( A ) is a partition of n ∈ connected boundary parts (fuzzy sets). Not every set A is a spatially regular fuzzy set. Therefore, Condition i ensures that the interior of A is without any geometric anomalies. The other two conditions arrange for a correct (partial) boundary if it exists. Condition ii works as follows: On the right side of ⊆, the set A ' = clT (intT ( A ))) is a regular closed fuzzy set; that is, the interior of A is complemented by its boundary without any geometric anomalies. Hence, the frontierT operator applied to A ' yields the boundary of A '. The condition now requires that the frontier of A is a subset of the frontier of A ' and does not contain other fuzzy points. Condition iii states that the frontier of A has to consist of a finite number of connected pieces due to the finite component assumption explained before. Infinitely many boundary pieces cannot be represented in an implementation. From the definition of frontierT , we can conclude that frontierT ( A ) = ∅ if A is regular and open. We will base our definition of fuzzy regions on spatially regular fuzzy sets and define a regularization function reg f that associates the interior of a fuzzy set A with its corresponding regular open fuzzy set and that restricts the partial boundary of A (if it exists at all) to a part of the boundary of the corresponding regular closed fuzzy set of A : reg f ( A ) := intT (clT ( A )( ∪ (frontierT ( A ) ∩ frontierT (clT(intT ( A ))). The different components of the regularization process work as follows: The interior operator 502
intT eliminates dangling point and line features since their interior is empty. The closure operator clT removes cuts and punctures by appropriately adding points. Furthermore, the closure operator introduces a fuzzy boundary (similar to a crisp boundary in the ordinary point-set topological sense) separating the points of a closed set from its exterior. The operator frontierT supports the restriction of the boundary. The following statements about set operations on regular open fuzzy sets are given informally and without proof. The intersection of two regular open fuzzy sets is regular a n d open. The union, difference, and complement of two regular open fuzzy sets are not necessarily regular and open since they can produce anomalies. Correspondingly, this also holds for spatially regular fuzzy sets. Hence, we introduce regularized set operations on spatially regular fuzzy sets that preserve regularity. Let A , B be spatially regular fuzzy sets of a fuzzy topological space ( 2, T ), and let a − b = a − b for a ≥ b, and a − b = 0 otherwise (a, b ∈ +0). Then the following apply. (i)
A ∪r B := reg f ( A ∪ B )
(ii)
A ∩r B := reg f ( A ∩ B )
(iii) A − r B := reg f ({(( x, y ), ( x, y ) ∈ supp( A ) ∧ (
A − r B
A − r B
( x, y )) |
x, y ) =
( x, y ) − B ( x, y )}) (iv) ¬r A := reg f (¬A ) A
Note that we have changed the meaning of difference (i.e., A − r B ≠ A ∩r ¬B) since the right side of the inequality is not meaningful in the spatial context. Regular open fuzzy sets, spatially regular fuzzy sets, and regularized set operations express a natural formalization of the desired closure properties of fuzzy geometric set operations. In the crisp case, this is taken for granted but mostly never fulfilled by spatial-type systems, geometric algorithms, spatial database systems, and GIS. Whereas the subspace RCCS of regular closed crisp sets together with the crisp regular set operations ⊕ (geometric union) and ⊗ (geometric
Fuzzy Spatial Data Types
intersection) and the set-theoretic order relation ⊆ form a Boolean lattice, this is not the case for SRFS denoting the subspace of spatially regular fuzzy sets. Here, we obtain the (unproven but obvious) statement that SRFS together with the regularized set operations ∪r and ∩r and the fuzzy set-theoretic order relation ⊆ are a pseudo-complemented distributive lattice. This implies that (a) (SRFS, ⊆) is a partially ordered set (reflexivity, antisymmetry, transitivity), (b) every pair A , B of elements of SRFS has a least upper bound A ∪r B and a greatest lower bound A ∩r B , (c) (SRFS, ⊆) 2 has a maximal element 1 := {((x,y), m(x,y)) | (x,y) ∈ 2 ∧ m(x, y) = 1} (identity of ∩ ) and a minimal 2 element 0 := {((x, y), m(x, y)) | (x, y) ∈ 2 ∧ m(x, y) = 0} (identity of ∪ ), and (d) algebraic laws like idempotence, commutativity, associativity, absorption, and distributivity hold for ∪r and ∩r . (SRFS, ⊆) is not a complementary lattice. Although the algebraic laws of involution and dualization hold, this is not true for the laws of complementarity. If we take the standard fuzzy set operations presented in the section “Fuzzy Sets and Fuzzy Topology” as a basis, the law of 2 excluded middle A ∪r ¬ A = 1 and the law of 2 contradiction A ∩r ¬ A = 0 do not hold in general. This fact explains the term pseudo-complemented from above and is not a weakness of the model but only an indication of fuzziness.
smoothness of fuzzy regions. This property can be modeled by the well-known mathematical concept of continuity. Here, we employ the concept of a piecewise continuous function for modeling the smooth membership distribution in a single fuzzy face. A function is piecewise continuous if it is made of a finite number of continuous pieces. Hence, it has only a finite number of discontinuities (continuity gaps), and its left and right limits are defined at each discontinuity. The only possible kinds of discontinuities for a piecewise continuous function are removable and step discontinuities. A removable discontinuity represents a hole in the function graph. It can be repaired by filling in a single point. A step discontinuity (also called semicontinuity) is a location in the function graph where the graph steps or jumps from one connected piece of the graph to another. Formally, it is a discontinuity for which the limits from the left and right both exist but are not equal to each other. Defining Fuzzy Regions. We can now give the definition of a fuzzy spatial data type for fuzzy regions called fregion. It supports the structured view and is based on fuzzy faces. See Exhibit D. Since different connected components of a set are disjoint (except for single common boundary points perhaps), the fuzzy faces of a fuzzy region object are disjoint too. Giving an equivalent definition for the unstructured view turns out to be difficult since most properties of the structured view have to be repeated. We therefore omit it here.
Modeling Smooth Attribute Changes Continuous Membership Functions. The second observation is that, according to the application cases shown before, the mapping A itself may not be arbitrary but must take into account the intrinsic
Exhibit D. fregion = {R ∈ SRFS | (i)
n R = i =1 Ri , n ∈
(ii) ∀ 1 ≤ i ≤ n : Ri is a connected component (fuzzy face) (iii)
n
R
= i =1
(iv) ∀ 1 ≤ i ≤ n :
Ri Ri
is a piecewise continuous function}
503
Fuzzy Spatial Data Types
Examples of Membership Functions for Fuzzy Regions In this subsection, we give some simple examples of membership functions that fulfill the properties required in the previous subsection. The determination of suitable membership functions is the difficulty in using the fuzzy set approach. Frequently, expert and empirical knowledge is necessary and used to design appropriate functions. We start with an example for a smooth fuzzy region. By taking a crisp region A with boundary BA as a reference object, we can construct a fuzzy region on the basis of the following distance-based membership function:
A
if (x, y ) ∈ A 1 ( x, y ) = − 1 d (( x , y ), B ) A if (x, y ) ∉ A a
where a ∈ + and a > 1, λ ∈ + is a constant, and d((x, y), BA) computes the distance between point (x, y) and boundary BA in the following way: d((x, y), BA) = min{dist((x, y), (x0, y0) | (x0, y0) ∈ BA}, where dist(p�, q) is the usual Euclidean distance between two points p�, q ∈ 2. Unfortunately, this membership function leads to an unbounded spatially regular fuzzy set (regular open fuzzy set), which is impractical for implementation. We can also give a similar definition of a membership function with bounded support: 1 1 ( x, y ) = 1 − d (( x, y ), BA ) A 0
if (x, y ) ∈ A if (x, y ) ∉ A, d (( x, y ), BA ) ≤ otherwise
In the same way as the distance from a point outside of A to BA increases to λ, the degree of membership of this point to A decreases to zero. Usery (1996) also presents membership functions for smooth fuzzy regions. The applications considered are air pollution defined as a fuzzy region with membership values based on the distance
504
from a city center, and a hill with elevation as the controlling value for the membership function. Lagacherie et al. (1996) models the transition of two smooth regions for soil units with symmetric membership functions. Burrough (1996) uses an a priori imposed membership function with which individual spatial objects can be assigned membership grades. This is known as the semantic import approach or model. A method to design a membership function for a finite-valued fuzzy region with n possible membership values (truth values) is to code the n values by rational numbers in the unit interval [0,1]. For that purpose, the unit interval is evenly divided into n – 1 subintervals and takes their end points as membership values. We obtain the set: Tn = {
i | n ∈ , 0 ≤ i ≤ n – 1} n −1
of truth values. This is an example of a fuzzy plateau region since we obtain n regions of equal membership each, that is, a plateau. Assuming that we intend to model air pollution caused by a power station located at point p ∈ 2, we can define the following (simplified) membership function for n = 5 degrees of truth representing, for instance, areas of extreme, high, average, low, and no pollution (a, b, c, d ∈ + denote distances): 1 3 4 ( x, y ) = 1 A 2 14 0
if dist (p, (x, y )) ≤ a if (a ≤ dist p, (x, y )) ≤ b if b ≤ dist (p, (x, y )) ≤ c if (c ≤ dist p, (x, y )) ≤ d if d ≤ dist (p, (x, y ))
FUZZY TOPOLOGICAL PREDICATES Topological relationships characterize the relative locations of two spatial objects to each other, for example, whether they overlap, meet, or are disjoint. In spatial databases and GIS, they are important for formulating spatial selections and spatial joins;
Fuzzy Spatial Data Types
Figure 4. The eight topological relationships between two simple regions A and B: disjoint (a), meet (b), overlap (c), equal (d), contains (e), inside (f), covers (g), and coveredBy (h) A
A
A
B
B
B
(a)
(b)
(c)
A B
(d)
they are usually used in the WHERE clause of an SQL statement. In this section, we present a concept of topological predicates for fuzzy spatial data types on the basis of available topological predicates for crisp spatial data types. The concept is generic and applicable to each pair of fuzzy spatial data types. Furthermore, we assume spatial objects with the most general structure including multiple components and holes in regions; they are also called complex spatial objects in contrast to simple spatial objects including only single points, continuous lines, and simple regions that are topologically equivalent to a disk. First, we form the basis and introduce crisp topological predicates. Then, we show how they can be leveraged for a formal definition of fuzzy topological predicates. Finally, we demonstrate how fuzzy topological predicates can be deployed for querying in a database system.
Topological Predicates on Complex Crisp Spatial Objects We introduce crisp topological predicates with an example. Consider the map of the 50 states of the USA. Each state has besides its thematic attributes like name and population also a geometry that describes its territory. It can have holes (like enclaves) and consist of several components (like mainland and islands). Cities can be modeled as points; that is, we are here interested in their location only and not so much in their extent. In a relational database management system (DBMS), we can declare them in the two relations.
A
B
A
B B
A
(e)
(f)
B
(g)
A
(h)
Here, point and region are crisp spatial data types. They are used in the same way as standard data types like string and integer. A query could ask for all pairs of city names and state names where a city is located in a state. This can then be formulated as a spatial join query. select cname, sname from cities, states where location inside territory. The term inside is a topological predicate testing whether a point is located inside a region and yielding a Boolean value as a result. All existing topological predicates can be used instead of inside. Interdisciplinary research on crisp topological relationships has led to a large number of publications in spatial databases, GIS, linguistics, cognitive science, and the geosciences. Two main questions are in the focus of interest. The first issue relates to the design of appropriate models for crisp topological relationships such that the relationships are expressive, mutually exclusive, and hence unique, and cover all topological configurations between two spatial objects. The second issue refers to an efficient implementation of the topological predicates, which requires geometric data structures and algorithms from computational geometry (de Berg, van Krefeld, Overmars, & Schwarzkopf, 2000). A detailed discussion of both issues is far beyond the scope of this chapter. Our definitions of fuzzy topological relationships are based on the so-called nine-intersection model (Egenhofer, 1989) from which a complete collection
505
Fuzzy Spatial Data Types
Table 1. Numbers of topological predicates between two simple spatial objects (a) and between two complex spatial objects (b) simple
simple
simple
complex
complex
complex
point
line
region
point
line
region
5 14
14 82
7 43
7
43
33
simple point simple line
2 3
3 33
3 19
complex point complex line
simple region
3
19
8
complex region
(a)
of mutually exclusive topological relationships can be derived for each combination of the crisp spatial types point, line, and region. The model is based on the nine possible intersections of boundary, interior, and exterior (Gaal, 1964) of a spatial object with the corresponding components of another object (3 ⋅ 3 = 9 combinations). Each intersection is tested for the topologically invariant criteria of nonemptiness. 29 = 512 different spatial configurations are possible from which only a limited subset makes sense depending on the combination of spatial objects just considered. For example, for two simple regions (no multiple components, no holes), eight meaningful configurations have been identified that lead to the eight topological predicates disjoint, meet, overlap, equal, inside, contains, covers, and coveredBy illustrated in Figure 4. Egenhofer (1986) presents a derivation of the topological relationships between two simple lines, two simple regions, and a simple line and a simple region. Schneider and Behr (2006) generalize this work to all nine combinations (including three symmetric combinations) of the complex spatial data types point, line, and region, and give a thorough, systematic, and complete specification of topological relationships for all types of combinations together with a prototypical visualization of each predicate. Table 1b shows the increase of topological predicates for complex objects compared to simple objects (Table 1a). The collection of topological predicates is proven to be mutually exclusive and complete for each spatial data type combination. The large amount of topological predicates in Table 1b, which can be used individually but are difficult to handle,
506
(b)
has led to the concept of clustered topological predicates (Schneider & Behr, 2006). The idea is to merge topological predicates with similar features to a single clustered predicate. Different predicate clusters are possible. Schneider and Behr (2006) propose a cluster that results in the set Tc = {disjoint, meet, overlap, equal, inside, contains, covers, coveredBy} of clustered topological predicates for all pairs of complex spatial data types.
Topological Predicates on Complex Fuzzy Spatial Objects The concept of topological predicates on complex fuzzy spatial objects that we propose now amounts to a counterpart of Tc , that is, to a collection Tf = {disjointf, meetf, overlapf, equalf, insidef, containsf, coversf, coveredByf} of fuzzy topological predicates. It is generic in the sense that it is applicable to all combinations of the fuzzy spatial data types fpoint, fline, and fregion. It is able to answer queries like the following. • • • •
Do regions A and B overlap a little bit? Determine all pairs of regions that nearly completely overlap. Does region A somewhat contain region B? Which regions lie quite inside B?
In a similar way as we can generalize the characteristic function χA : X → {0, 1} to the membership function A : X → [0, 1],6 we can generalize a (binary) predicate pc : X × Y → {0,
Fuzzy Spatial Data Types
1} to a (binary) fuzzy predicate pf : X×Y→ [0, 1]. Hence, the value of a fuzzy predicate can be interpreted as the degree to which the predicate holds for its operand objects. In our case of topological predicates, X,Y∈{point, line, region}, {0, 1} = bool, pc ∈ Tc, and X , Y ∈{fpoint, fline, fregion} hold. For the set [0, 1], we introduce a new type fbool for fuzzy Booleans. For the definition of fuzzy topological predicates, we describe a fuzzy spatial object A ∈ γ ∈ {fpoint, fline, fregion} in terms of nested α-level sets (α-cuts; see the section “Fuzzy Sets and Fuzzy Topology”). They represent crisp spatial objects A≥a for an α ∈ [0, 1] and are defined as: A = regc ({(x, y) ∈ | 2
≥a
A
For showing that the proper inclusion relationship holds between the α-level spatial objects, we can distinguish two cases. First, if |ΛA˜ | = n + 1, then for any p ∈ A≥αi − A≥αi–1 with i ∈ {2, ..., n + 1}, A (p) = αi holds. We obtain fuzzy plateau objects; that is, each spatial object is annotated with a single membership value. For the case that |ΛA˜ | > n + 1, we get A (p) ∈ [αi , αi–1), which leads to intervalbased spatial objects; that is, each spatial object is annotated with an interval of membership values. As a result, we obtain the following. A fuzzy spatial object can be represented as a finite set of n α-level spatial objects; that is:
A = {A≥ai | 1 ≤ i ≤ n, n ≤ |ΛA˜| ∈ } with αi > αi+1 ⇒ A≥ai ⊂ A≥ai+1 for 1 ≤ i ≤ n – 1.
(x, y) ≥ α}).
From an implementation point of view, one of the advantages of using finite collections of α-level sets to describe fuzzy spatial objects is that available geometric data structures and geometric algorithms known from computational geometry (de Berg et al., 2000) can be applied. The open question now is how to compute the topological relationships of two collections of α-level spatial objects, each collection describing a fuzzy spatial object. We use the concept of basic probability assignment (Dubois & Jaulent, 1987) for this purpose. A basic probability assignment m(A≥ai) can be associated with each α-level region A≥ai and can be interpreted as the probability that A≥ai is the true representative of A . It is defined as:
Without going into detail, the function regc is a regularization function that adjusts geometric anomalies for all three crisp spatial data types. We call A≥a an α-level spatial object. Clearly, A≥a is a crisp spatial object that is defined by all points with membership value greater than or equal to α. The core of A is then equal to A1.0. A property of the α-level spatial objects of a fuzzy spatial object is that they are nested; that is, if we select membership values 1 = α1 > α2 > . . . > αn > αn+1 = 0 for some n ∈ , then we obtain: A≥α1 ⊂ A≥α2 ⊂ . . . ⊂ A≥αn ⊂ A≥αn+1.
Exhibit E. f
( A , B )
=
n
n
∑∑ ( i =1 j =1
≤
n
−
i +1
)⋅(
j
−
j +1
)⋅
i
−
i +1
)⋅(
j
−
j +1
)
n
∑∑ ( i =1 j =1
= (
i
( A≥ i , B
≥
)
j
≥ (since ( c ( A≥ i , B j ))
− 2 ) + + ( 1 − 2 )( n − n +1 ) + + ( n − = ( 1 − 2 )(( 1 − 2 ) + + ( n − n +1 )) + + ( n − n +1 )((
= (
1
1
−
−
2
2
)(
c
1
) + + (
n
−
n +1
)
since
n
∑( i =1
i
−
i +1
n +1 1
−
)( 2
1
−
2
) +(
) + + (
n
−
n
−
n +1
n +1
)(
n
−
n +1
)
))
) =1
= 1
507
Fuzzy Spatial Data Types
m(A≥ai) = αi – αi+1, for 1 ≤ i ≤ n for some n ∈ with α1 = 1 and αn+1 = 0. That is, m is built from the differences of successive αi s. It is easy to see that for the telescoping sum holds: n
∑ m( A
≥
i
)=
1
i =1
−
n +1
= 1− 0 = 1
Let πf ( A , B ) be the value that represents a (binary) property πf between two fuzzy spatial objects A and B of equal or different data types. For reasons of simplicity, we assume that ΛA˜ = Λ B =: Λ. Otherwise, it is not difficult to synchronize ΛA˜ and Λ B by forming their union and by reordering and renumbering all levels. Based on the work in Dubois and Jaulent (1987), property πf of A and B can be determined as the summation of weighted predicates by: n
f
n
( A , B ) = ∑∑ m( A≥ i ) ⋅ m( B ≥ j ) ⋅ i =1 j =1
c
( A≥ i , B ≥ j )
where πc (A≥ai, B≥aj) yields the value of the corresponding property πc for two crisp α-level spatial objects A≥ai and B≥aj. This formula is equivalent to: n
f
n
( A , B ) = ∑∑ ( i =1 j =1
i
−
i +1
)⋅(
j
−
j +1
)⋅
c
( A≥ i , B
≥
j
)
If πf is a topological predicate of Tf = {disjointf , meetf , overlapf , equalf , insidef , containsf , coversf , coveredByf } between two fuzzy spatial objects, we can compute the degree of the corresponding relationship with the aid of the pertaining crisp topological predicate πc ∈ Tc . The value of πc (A≥ai, B≥aj) is either 1 (true) or 0 (false). Once this value has been determined for all combinations of α-level spatial objects from A and B , the aggregated value of the topological predicate πf ( A , B ) can be computed as shown above. The more fine-grained the level set Λ for the fuzzy spatial objects A and B is, the more precisely the fuzziness of topological predicates can be determined.
508
It remains to show that 0 ≤ πf ( A , B ) ≤ 1 holds; that is, πf is really a fuzzy predicate. Since αi – αi+1 > 0 for all 1 ≤ i ≤ n, and since πc(A≥ai , B≥aj) ≥ 0 for all 1 ≤ i, j ≤ n, πf ( A , B ) ≥ 0 holds. We can show the other inequality by determining an upper bound for πf ( A , B ). See Exhibit E. Hence, πf ( A ,B ) ≤ 1 holds. This generic predicate definition reveals its quantitative character. If the predicate πc(A≥ai ,B≥aj ) is never fulfilled, the predicate πf ( A , B ) yields false. The more α-level spatial objects of A and B fulfill the predicate πc(A≥ai,B≥aj), the more the validity of the predicate πf increases. The optimum is reached if all topological predicates are satisfied.
DATABASE INTEGRATION AND QUERYING WITH FUZZY TOPOLOGICAL PREDICATES In this section, we demonstrate how fuzzy spatial data types can be used in a relational database system and how fuzzy topological predicates can be integrated into an SQL-like spatial query language. We have shown before the integration of crisp spatial data types into a relational database schema when we discussed the relation schemas states and cities. The integration of fuzzy spatial data types takes place in the same way. For example, assuming that we have a relation pollution that stores among other things the blurred geometry of polluted zones as fuzzy regions, and a relation landuse that keeps information about the use of land areas and stores their vague spatial extent as fuzzy regions. Finally, we assume that we are given the living spaces of different animal species in a relation animals and that their vague extent is also represented as a fuzzy region. We obtain relation schemas like the following. pollution(pollid: integer, pollzone: fregion, . . .) landuse(lid: integer, name: string, use: string, area: fregion, . . .) animals(aid: integer, name: string, territory: fregion, . . .)
We can make the following observations. Complex data types like point, line, region, fpoint,
Fuzzy Spatial Data Types
Figure 5. Membership functions for fuzzy modifiers not
a little bit
nearly completely
1.0 somewhat
0.1
0.2
slightly
0.3
0.4
quite
0.5
fline, and fregion are used in the same manner as attribute data types and standard data types like integer, bool, and date. The main difference is that the former data types have an internal complexity that is hidden from the user and only accessible by operations (methods) and predicates. The data type representation is not scattered over a collection of relation tables, but concentrated in the attribute value representing the complex object. This has the main advantage that the implementation of a complex data type can be exchanged and improved without any consequences for the query language and application programs. This approach7 amounts to the concept of abstract data types in databases (Stonebraker, Rubenstein, & Guttman, 1983). In the particular fuzzy context, we can make an additional observation. The aspect of fuzziness is neither explicitly modeled at the tuple level nor at the attribute level. It is represented and hidden inside the representation of fuzzy spatial objects; that is, only the fuzzy spatial data types know about and are able to handle the fuzzy aspects. The advantages of this concept are that objectrelational database management systems can be used, which integrate fuzzy spatial data types by the well-known UDT (user-defined type) mechanism, and that the standard relational database theory is still valid and not subject to changes like the corresponding theory for fuzzy databases that deviates from the standard theory. The fact that the membership degree yielded by a fuzzy topological predicate is a computationally determined quantification between 0 and 1, that is, a fuzzy Boolean, impedes a direct integration of fuzzy predicates into SQL, which is the standard query language of relational databases. First, it is
0.6
mostly
0.7
0.8
0.9
completely
1.0
not very comfortable and user-friendly to use such a numeric value in a query. Second, spatial selections and spatial joins expect crisp predicates with Boolean values as filter conditions and are not able to cope with fuzzy predicates. As a solution, we propose to embed adequate qualitative linguistic descriptions of nuances of topological relationships as appropriate interpretations of the membership values into a spatial query language. For instance, depending on the membership value yielded by the predicate insidef, we could distinguish between not inside, a little bit inside, somewhat inside, slightly inside, quite inside, mostly inside, nearly completely inside, and completely inside. These fuzzy linguistic terms can then be incorporated into spatial queries together with the fuzzy predicates they modify. We call these terms fuzzy modifiers since their meaning is that of intensifying or relaxing the constraint expressed by the primary term to which they are applied. For instance, somewhat inside is a relaxation of the constraint inside since we can expect that it is better satisfied than inside even if some portions are outside. It is conceivable that a fuzzy modifier is either predefined and anchored in the query language, or user defined. We know that a fuzzy topological predicate πf is defined as πf : X × Y→ [0, 1], where X and Y are fuzzy spatial data types. The idea now is to represent each fuzzy modifier γ ∈ Γ = {not, a little bit, somewhat, slightly, quite, mostly, nearly completely, completely} by an appropriate fuzzy set with a membership function µγ : [0, 1] → [0, 1]. Let α, β ∈ {fpoint, fline, fregion}, A ∈ α, and B ∈ β. Let γ π f be a quantified fuzzy predicate (like somewhat inside with γ = somewhat and πf = insidef). Then we can define the following: 509
Fuzzy Spatial Data Types
γ π f ( A , B ) = true : ⇔ (µγ ◦ πf )( A , B ) = 1 That is, only for those values of πf ( A , B ) for which µγ yields 1, the predicate γ π f is true. A membership function that fulfills this quite strict condition is, for instance, the crisp partition of [0,1] into |Γ| disjoint or adjacent intervals completely covering [0,1], and the assignment of each interval to a fuzzy modifier. If an interval [a, b] is assigned to a fuzzy modifier γ, the intended meaning is that µγ (πf ( A , B )) = 1 if a ≤ πf ( A , B ) ≤ b, and 0 otherwise. For example, we could select the intervals [0.0,0.02] for not, [0.02,0.05] for a little bit, [0.05, 0.2] for somewhat, [0.2, 0.5] for slightly, [0.5, 0.8] for quite, [0.8, 0.95] for mostly, [0.95, 0.98] for nearly completely, and [0.98, 1.00] for completely. Alternative membership functions are shown by the fuzzy sets in Figure 5. While we can always find a fitting fuzzy modifier for the partition due to the complete coverage of the interval [0, 1], this is not necessarily the case here. Each fuzzy modifier is associated with a fuzzy number having a trapezoidal-shaped membership function. The transition between two consecutive fuzzy modifiers is smooth and here modeled by linear functions. Within a fuzzy transition area, µγ yields a value less than 1, which makes the predicate γ π f false. Examples in Figure 5 can be found at 0.2, 0.5, or 0.8. Each fuzzy number associated with a fuzzy modifier can be represented as a quadruple (a, b, c, d) where the membership function starts at (a, 0),
linearly increases up to (b,1), remains constant up to (c,1), and linearly decreases up to (d,0). Figure 5 assigns (0.0, 0.0, 0.0, 0.02) to not, (0.01, 0.02, 0, 03, 0.08) to a little bit, (0.03, 0.08, 0.15, 0.25) to somewhat, (0.15,0.25,0.45,0.55) to slightly, (0.45, 0.55,0.75,0.85) to quite, (0.75,0.85,0.92,0.96) to mostly, (0.92,0.96,0.97,0.99) to nearly completely, and (0.97, 1.0, 1.0, 1.0) to completely. So far, the predicate γ π f is only true if µγ yields 1. We can relax this strict condition by defining the following. π f ( A , B ) = true : ⇔
(µγ ◦ πf)( A , B ) > 0
In a crisp spatial database system, this gives us the chance also to take the transition zones into account and to let them make the predicate γ π f true. When evaluating a fuzzy spatial selection or join in a fuzzy spatial database system, we can even set up a weighted ranking of database objects satisfying the predicate γπf at all and being ordered by descending membership degree 1 ≥ µγ (x) > 0. A special, optional fuzzy modifier, denoted by at all, represents the existential modifier and checks whether a predicate πf can be fulfilled to any extent. An example query is “Do regions A and B overlap at all?” With this modifier we can determine whether µγ (x) > 0 for some value x ∈ [0, 1]. Assuming an available implementation of fuzzy spatial data types and fuzzy topological predicates, the following few example queries demonstrate how fuzzy spatial data types and quantified fuzzy topological predicates can be integrated into an SQL-like spatial query language. It is not our objective to give a full description of a specific language. What we need first is a mechanism to declare userdefined fuzzy modifiers and to activate predefined or user-defined fuzzy modifiers. This mechanism should allow us to specify trapezoidal-shaped and triangular-shaped membership functions as well as crisp partitions. In general, this means defining a classification, which could be expressed as shown in Exhibit F. Such a classification could then be activated by:
Fuzzy Spatial Data Types
set classification fq.
CONCLUSION AND FUTURE WORK
Assuming our relations pollution, landuse, and animals, we now pose some example queries. A query could be to find out all inhabited areas where people are rather endangered by pollution. This can be formulated in an SQL-like style as follows (we here use infix notation for the predicates).
In this chapter, we have introduced a fuzzy spatial algebra (type system) that introduces spatial data types for fuzzy points, fuzzy lines, and fuzzy regions for use in databases, and that includes fuzzy spatial operations and fuzzy topological predicates operating on these data types. Structure and semantics of types, operations, and predicates are formally defined on the basis of fuzzy set theory and fuzzy point set topology in an abstract model. The characteristic feature of the design is the modeling of smoothness and continuity, which is inherent to the objects themselves and to the transitions between different fuzzy objects. Assuming an implementation of the introduced fuzzy approach, we demonstrate how fuzzy spatial data types can be employed as attribute data types in relation schemas on the basis of the abstract data type concept, and how fuzzy topological predicates can be leveraged in queries based on an extension of SQL. A first research issue of future work refers to the design of additional fuzzy spatial operations and predicates like directional relationships in order to complete the fuzzy spatial algebra. A second issue relates to the implementation of the whole fuzzy spatial algebra. Appropriate data structures for the fuzzy spatial data types have to be designed, and algorithms for the fuzzy spatial operations and predicates on these data structures have to be devised. The design of fuzzy data structures and algorithms belongs to the development of a discrete model. The abstract model can be seen as a specification of a discrete model. A discrete model aims at finding finite representations for the data types of the abstract model as well as algorithms operating on these finite representations for the operations and predicates of the abstract model. Another interesting research topic refers to the development of fuzzy spatial index structures in the context of databases. While index structures for crisp spatial data have been widely explored, there is not much research on index structures that include the aspect of spatial vague-
select landuse.name from pollution, landuse where landuse.use = “inhabited” and pollution.pollzone quite overlaps landuse.area
This query and the following two represent fuzzy spatial joins. Another query could ask for those inhabited areas lying almost entirely in polluted areas. select landuse.name from pollution, landuse where landuse.use = “inhabited” and landuse.area nearly completely inside pollution.pollzone
For animals, we can search for pairs of species that share a common living space to some degree. select A.name, B.name from animals A, animals B where A.territory at all overlaps B.territory
As a last example, we can ask for animals that usually live on land and seldom enter the water, or for species that never leave their land area (the built-in aggregation function sum is here applied to a set of fuzzy regions and aggregates this set by repeated application of fuzzy geometric union). select name from animals where (select sum(area) from landuse) nearly completely covers or completely covers territory
511
Fuzzy Spatial Data Types
ness. Their design could lead to a more efficient execution of fuzzy spatial joins and selections.
REFERENCES Altman, D. (1994). Fuzzy set theoretic approaches for handling imprecision in spatial analysis. International Journal of Geographical Information Systems, 8(3), 271-289. Beaubouef, T., Ladner, R., & Petry, F. (2004). Rough set spatial data modeling for data mining. International Journal of Geographical Information Science, 19, 567-584. Blakemore, M. (1983). Generalization and error in spatial databases. Cartographica, 21(2/3), 131-139. Bogàrdi, I., Bárdossy, A., & Duckstein, L. (1990). Risk management for groundwater contamination: Fuzzy set approach. In R. Khanpilvardi & T. Gooch (Eds.), Optimizing the resources for water management (pp. 442-448). ASCE. Brown, D. G. (1998). Mapping historical forest types in Baraga County Michigan, USA as fuzzy sets. Plant Ecology, 134, 97-111. Buckley, J. J., & Eslami, E. (2002). Advances in s oft computing: An introduction to fuzzy logic and fuzzy sets. Physica-Verlag. Burrough, P. A. (1996). Natural objects with indeterminate boundaries. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 3-28). Taylor & Francis. Burrough, P. A., & Frank, A. U. (Eds.). (1996). Geographic objects with indeterminate boundaries (GIS- DATA Series, Vol. 2). Taylor & Francis. Burrough, P. A., van Gaans, P. F. M., & Macmillan, R. A. (2000). High-resolution landform classification using fuzzy k-means. Fuzzy Sets and Systems, 113, 37-52. Chang, C. L. (1968). Fuzzy topological spaces. Journal of Mathematical Analysis and Applications, 24, 182-190.
512
Cheng, T., Molenaar, M., & Lin, H. (2001). Formalizing fuzzy objects from uncertain classification results. International Journal of Geographical Information Science, 15, 27-42. Clementini, E., & Di Felice, P. (1996a). An algebraic model for spatial objects with indeterminate boundaries. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 153-169). Taylor & Francis. Clementini, E., & Di Felice, P. (1996b). A model for representing topological relationships between complex geometric features in spatial databases. Information Systems, 90(1-4), 121-136. Cohn, A. G., & Gotts, N. M. (1996). The “eggyolk” representation of regions with indeterminate boundaries. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 171-187). Taylor & Francis. de Berg, M., van Krefeld, M., Overmars, M., & Schwarzkopf, O. (2000). Computational geometry: Algorithms and applications. Springer-Verlag. De Gruijter, J., Walvoort, D., & Vangaans, P. (1997). Continuous soil maps: A fuzzy set approach to bridge the gap between aggregation levels of process and distribution models. Geoderma, 77, 169-195. Dilo, A., de By, R. A., & Stein, A. (2007). A system of types and operators for handling vague spatial objects. International Journal of Geographical Information Science, 21(4), 397-426. Dubois, D., & Jaulent, M.-C. (1987). A general approach to parameter evaluation in fuzzy digital pictures. Pattern Recognition Letters, 251-259. Dutta, S. (1989). Qualitative spatial reasoning: A semi-quantitative approach using fuzzy logic. In First International Symposium on the Design and Implementation of Large Spatial Databases (LNCS 409, pp. 345-364). Springer Verlag. Dutta, S. (1991). Topological constraints: A representational framework for approximate spatial and temporal reasoning. In Second International Symposium on the Design and Implementation of
Fuzzy Spatial Data Types
Large Spatial Databases (LNCS 525, pp. 161-180). Springer Verlag. Edwards, G. (1994). Characterizing and maintaining polygons with fuzzy boundaries in GIS. In Sixth International Symposium on Spatial Data Handling (pp. 223-239). Egenhofer, M. J. (1989). A formal definition of binary topological relationships. In Third International Conference on Foundations of Data Organization and Algorithms (LNCS 367, pp. 457-472). Springer- Verlag. Erwig, M., & Schneider, M. (1997). Vague regions. In Fifth International Symposium on Advances in Spatial (pp. 298-320). Springer Verlag. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Idea Group Publishing. Hendricks Franssen, H., van Eijnsbergen, A., & Stein, A. (1997). Use of spatial prediction techniques and fuzzy classification for mapping soil pollutants. Geoderma, 77, 243-262. Kollias, V. J., & Voliotis, A. (1991). Fuzzy reasoning in the development of geographical information systems. International Journal of Geographical Information Systems, 5(2), 209-223. Lagacherie, P., Andrieux, P., & Bouzigues, R. (1996). Fuzziness and uncertainty of soil boundaries: From reality to coding in GIS. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 275-286). Taylor & Francis. Liu, Y.-M., & Luo, M.-K. (1997). Fuzzy topology (Advances in fuzzy systems: Applications and theory, Vol. 9). World Scientific. Ma, Z. (2005). Fuzzy database modeling with XML. Springer-Verlag. Parsons, S. (1996). Current approaches to handling imperfect information in data and knowledge bases. IEEE Transactions on Knowledge and Data Engineering, 8(3), 353-372.
Pauly, A., & Schneider, M. (2004). Vague spatial data types, set operations, and predicates. In Eighth East-European Conference on Advances in Databases and Information Systems (pp. 379-392). Pauly, A., & Schneider, M. (2005). Topological predicates between vague spatial objects. In Ninth International Symposium on Spatial and Temporal Databases (pp. 418-432). Pauly, A., & Schneider, M. (2006). Topological reasoning for identifying a complete set of topological predicates between vague spatial objects. In 19th International FLAIRS Conference (pp. 731736). Pawlak, Z. (1982). Rough sets: Basic notions. International Journal of Computer and Information Science, 11, 341-356. Petry, F. E. (1996). Fuzzy databases: Principles and applications. Kluwer Academic Publishers. Petry, F. E., Cobb, M., Ali, D., Angryk, R., Paprzycki, M., Rahimi, S., et al. (2002). Fuzzy spatial relationships and mobile agent technology in geospatial information systems. In P. Matsakis & L. M. Sztandera (Eds.), Soft computing in defining spatial relations (Soft Computing, pp. 123-155). Physica-Verlag. Schneider, M. (1996). Modelling spatial objects with undetermined boundaries using the realm/ROSE approach. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 141-152). Taylor & Francis. Schneider, M. (1997). Spatial data types for database systems: Finite resolution geometry for geographic information systems (LNCS 1288). Springer-Verlag. Schneider, M. (1999). Uncertainty management for spatial data in databases: Fuzzy spatial data types. In Sixth International Symposium on Advances in Spatial Databases (LNCS 1651, pp. 330-351). Springer Verlag. Schneider, M. (2000). Metric operations on fuzzy spatial objects in databases. In Eighth ACM Symposium on Geographic Information Systems (pp. 21-26). ACM Press. 513
Fuzzy Spatial Data Types
Schneider, M. (2001a). A design of topological predicates for complex crisp and fuzzy regions. In 20th International Conference on Conceptual Modeling (pp. 103-116). Schneider, M. (2001b). Fuzzy topological predicates, their properties, and their integration into query languages. In Ninth ACM Symposium on Geographic Information Systems (pp. 9-14). ACM Press. Schneider, M. (2003). Design and implementation of finite resolution crisp and fuzzy spatial objects. Data & Knowledge Engineering, 44(1), 81-108. Schneider, M., & Behr, T. (2006). Topological relationships between complex spatial objects. ACM Transactions on Database Systems, 31(1), 39-81. Shi, W., & Guo, W. (1999). Modeling topological relationships of spatial objects with uncertainties. In International Symposium on Spatial Data Quality (pp. 487-495). Stonebraker, M., Rubenstein, B., & Guttman, A. (1983). Application of abstract data types and abstract indices to CAD databases. In ACM/IEEE Conference on Engineering Design Applications (pp. 107-113).
national Journal of Geographical Information Systems, 8(2), 143-162. Wang, F., & Hall, G. B. (1996). Fuzzy representation of geographical boundaries in GIS. International Journal of Geographical Information Systems, 10(5), 573-590. Wang, F., Hall, G. B., & Subaryono. (1990). Fuzzy information representation and processing in conventional GIS software: Database design and application. International Journal of Geographical Information Systems, 4(3), 261-283. Yazici, A., & George, R. (1999). Fuzzy database modeling. Physica-Verlag. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353. Zhan, B. F. (1997). Topological relations between fuzzy regions. In ACM Symposium on Applied Computing (pp. 192-196). ACM Press. Zhan, B. F. (1998). Approximate analysis of topological relations between geographic regions with indeterminate boundaries. Soft Computing, 2, 28-34.
KEY TERMS
Tang, X., & Kainz, W. (2002). Analysis of topological relations between fuzzy regions in a general fuzzy topological space. In Joint International Symposium on Geospatial Theory, Processing and Application.
Fuzzy Spatial Algebra: It i s a system of fuzzy spatial data types including a comprehensive set of fuzzy spatial operations and fuzzy spatial predicates and satisfying closure properties.
Tilove, R. B. (1980). Set membership classification: A unified approach to geometric intersection problems. IEEE Transactions on Computers, C-29, 874-883.
Fuzzy Spatial Data Type: It is a data type for representing a fuzzy point, fuzzy line, or fuzzy region object that can be employed as an attribute data type in a database system.
Usery, E. L. (1996). A conceptual framework and fuzzy set implementation for geographic features. In P. A. Burrough & A. U. Frank (Eds.), Geographic objects with indeterminate boundaries (pp. 71-85). Taylor & Francis.
Fuzzy Spatial Query Language: This is a fullfledged query language that integrates fuzzy spatial data types, operations, predicates, modifiers, and other fuzzy concepts.
Wang, F. (1994). Towards a natural language user interface: An approach of fuzzy query. Inter-
514
Fuzzy Topological Predicate: A fuzzy topological predicate characterizes the relative position of two fuzzy spatial objects to each other.
Fuzzy Spatial Data Types
Geometric Anomaly: This occurs when the results of geometric set operations on fuzzy regions are, from an application standpoint, considered degeneracies like isolated or dangling point or line features and missing points and lines in the form of cuts and punctures in the interior of regions. Regularization: It is a formal concept based on fuzzy topology that removes geometric anomalies on fuzzy regions. Spatial Database System: It is a full-fledged database system that, in addition to the functionality of standard database systems for alphanumeric data, provides special support for the storage, retrieval, management, and querying of spatial data, that is, objects in space. Spatial Fuzziness, Spatial Vagueness: Inherent property of many spatial objects in reality that do not have sharp boundaries or whose boundaries cannot be precisely determined. Topological Space: It is a set X together with a collection T of subsets of X satisfying the following axioms. (a) The empty set and X are in T. (b) The union of any collection of sets in T is also in T. (c) The intersection of any pair of sets in T is also in T. The collection T is called a topology on X , and the elements of X are called points. Under this definition, the sets in T are the open sets, and their complements in X are the closed sets. The requirement that the union of any collection of open sets be
open is more stringent than simply requiring that all pairwise unions be open as the former includes unions of infinite collections of sets.
Endnotes 1
2
3
4
5
6
7
Intuitively speaking, a (crisp or fuzzy) set U is open if you can “wiggle” or change any point x in U by a small amount in any direction and still be inside U. A (crisp or fuzzy) closed set contains its own boundary. Intuitively speaking, if you are outside a closed set and you wiggle a little bit, you will stay outside the set. The reason is that the complement of a closed set is open. A function f : X → is upper semicontinuous and ⇔ ∀ r∈ : {x | f (x) < r} is open. (The notation :⇔ means that items are defined as being equivalent.) A set X ⊆ 2 is called convex : ⇔ ∀ p, q ∈ 2 ∀ λ ∈ with 0 < λ < 1 : r = λ p + (1 − λ) q ∈ X ( p, q, and r are here regarded as vectors). The application of a function f to a set X of values is defined as f (X ) = { f (x) | x ∈ X}. Note that χA is a unary crisp predicate and that A is a unary fuzzy predicate. It also relieves us of the necessity to describe the implementation of fuzzy spatial data types. Such a description requires sophisticated concepts from computational geometry and is beyond the scope of this chapter.
515
516
Chapter XX
Fuzzy Classification in Shipwreck Scatter Analysis Yauheni Veryha ABB Corporate Research Center, Germany Jean-Yves Blot Portugal Institute of Archaeology, Portugal Joao Coelho Portugal Institute of Archaeology, Portugal
Abstract There are many well-known applications of fuzzy sets theory in various fields of science and technology. However, we think that the area of maritime archaeology did not attract enough attention from researchers of fuzzy sets theory in the last decades. In this chapter, we present examples of problems arising in shipwreck scatter analysis where fuzzy classification may be very useful. Using a real-world example of fragments of ceramics from an ancient shipwreck, we present an exemplary application of the fuzzy classification framework with SQL querying for data mining in archaeological information systems. Our framework can be used as a data mining tool. It can be relatively easily integrated with conventional relational databases, which are widely used in existing archaeological information systems. The main benefits of using our fuzzy classification approach include flexible and precise data analysis with userfriendly information presentation at the report generation phase.
Introduction At the stage of the typical data classification in information systems, there appear some types
of uncertainty, for instance, when the boundaries of a class of objects are not sharply defined (Borgodna, Leporati, Lucarella, & Pasi, 2000; Kacprzyk & Zadrozny, 2000). In this case, the most
Fuzzy Classification in Shipwreck Scatter Analysis
common, useful, and widely accepted approach is the introduction of data fuzzification (Bellma & Vojdani, 2000; Schindler, 1998). Fuzzy sets provide mathematical meanings to natural-language statements and become an effective solution for dealing with uncertainty. There is no simple procedure for combining various attributes that define a particular data set into one general performance measure because attributes may be measured with different scales, the relative significance of different criteria differs, and for some criteria the objective is maximization while for others it is minimization or another specific target. The approach of fuzzy sets theory with its membership functions is widely used to form a realistic aggregated description of the data set. In fuzzy sets theory, various attributes with separate scales and optimization objectives can be combined into a joint response measure: the aggregated value of membership (Zimmermann, 1992; Zimmermann and Zysno, 1980). The precise analysis of shipwreck data in maritime archaeology leads us to distinguish between ship structure and equipment, ship cargo, and personal items present on board at the time of the accident. However, the data appear strongly blurred in shallow underwater sites marked by long-term sea dynamics (Blot, 1998; Muckelroy, 1978). In some cases, the only remains of the original shipwreck are composed of fragments of durable materials associated with the cargo, ceramics containers (amphorae for instance), or coins. In such an extreme environment where no remains of the ship itself have been found yet, and neither should it be expected due to the sea environment made of protruding rocks and shallow sandy patches, no clear pattern emerges from the immediate reading of the spatial distribution of the fragments. In the meantime, some physical parameters attached to the fragments themselves may be explored. The matter has been previously discussed within a purely quantitative approach related to coins in a shallow oceanic environment (Vargiolu, Zahouani, & Blot, 2005). The case examined here deals with a strictly qualitative approach applied this time to broken
ceramics collected from a shipwreck in Peniche, Portugal, that happened some two millennia ago, of a small vessel coming from southwest Spain (Roman province of Baetica) with a cargo of wine amphorae (Haltern 70) and some fine ceramics (Samian ware) from Italy. Three of the parameters isolated in the original examination of the ceramics materials (abrasion, cleavage, and neat fracture) have been selected and implemented using a purely qualitative scaling incorporating linguistic terms (e.g., visible, invisible, slightly visible, etc.) and experts’ opinions within a specially defined scale of points, which are later assigned to a particular shipwreck fragment. Those parameters find an echo in sedimentary petrology (Dobkins & Folk, 1970) or in hydrology with prehistoric tools. Shackley, long ago, illustrated with flint implements “how measurements of abrasion can indicate different environments, i.e. edge rounding by sand and smaller particles and percussion craters formed by impacts with larger pebbles” (Brown, 1997; Shackley, 1974). Unlike hydrology where the search is directed toward the measurements of transport and fluxes, archaeology tends to look for clues related to “nonmovement” and physical parameters observable on the artifacts that may be relevant to the spatial analysis of a site. In the case commented on in this chapter, our interest was directed toward ceramic items brought from the underwater site that might be associated with the initial moments of the nautical accident that occurred some two millennia ago. The experiment was thus directed toward testing the three descriptive parameters referred to above within a group of several hundred ceramic sherds observed and described by three experts or judges acting independently. Among the several biases subjacent to the experiment was the learning curve related to the personal adjustment of each expert in using unconventional, previously untested, descriptive parameters. Even when using classical statistical techniques, like principal component analysis or factorial analysis, which are very robust in mathematical terms, they often do not allow easily integrating
517
Fuzzy Classification in Shipwreck Scatter Analysis
this sort of fuzzy information about ceramic items in their models (Zimmermann, 1992). In addition, these methods consider very restrictive initial hypotheses that may not be fulfilled by the given data. As a result, one cannot obtain a global overview over the whole data set. In our research, we used fuzzy sets theory in the analysis of properties of found ceramic items. In fuzzy sets theory, one can combine both numerical and linguistic information that allows obtaining a final data map over the whole data set, which considers all disposable data in an efficient way and allows observing the relative distance among data points (Golubski, 2003; Veryha, 2001). The fuzzy classification allows extracting knowledge from large, multirelational, high-dimensional, and imperfect information sources, and closing the semantic gap between structured data and human notions and concepts (Kruse & Klose, 2002). To summarize the main benefits of fuzzy sets theory compared to classical approaches, one should mention that
• • •
it can be used for data sets that cannot be easily modeled mathematically, it can be used to efficiently represent experts’ knowledge about a problem, and continuous variables may be represented by linguistic constructs that are easier to understand, making it more user friendly for end users; for example, instead of using numeric values, temperature may be characterized as cold, cool, warm, or hot.
In a database context, there are several proposals to develop models that support fuzziness, uncertainty, and impreciseness in the real world (Carrasco, Vila, & Galindo, 2003; Kacprzyk & Zadrozny, 2002; Veryha, 2005). A number of different schemes and tools for the implementation of fuzzy sets in database management systems were proposed in recent years, such as fuzzy querying, fuzzy extension of SQL (structured query language), fuzzy object-oriented database schemes,
Figure 1. Overview of fuzzy sets implementations in the database context
518
Fuzzy Classification in Shipwreck Scatter Analysis
and so forth (Bellma & Vojdani, 2000; Bosc & Pivert, 2000; Kacprzyk & Zadrozny, 2000). We roughly categorized those methods in three major categories, as it is shown in Figure 1. Three main categories (see Figure 1) were used. • •
•
Conventional: Methods that do not require changing existing relational database concepts Hybrid: Methods that require minor changes in existing relational database concepts, for example, relational databases can still be used, but special add-ons are introduced to the database core Unconventional: Methods that require significant changes in existing relational database concepts and, in many cases, require introduction of special-purpose fuzzy object-oriented databases
Another chapter of this book written by Kacprzyk, Zadrożny, de Tré, and de Caluwe includes a review about flexible querying. In this book, you can also find a chapter by Urrutia, Tineo, and González studying both SQLf and fSQL languages. The reality showed that despite the high importance of data mining, fuzzy methods were not widely used in relational database systems in practice. We think that the main reason was that the majority of proposed fuzzy methods for data mining required changing the conventional relational database structure or adding special nonrelational features to the database management tools, for example, by modifying the functionality of conventional SQL (Bosc & Pivert, 2000; Kacprzyk & Zadrozny, 2000). Most of the database system owners and users do not want to switch to fuzzy database structures and, as a result, they often avoid applying fuzzy concepts in their information systems. Thus, we directed our efforts toward the development of tools that allow for imprecise querying in the relational database context without SQL modifications (Veryha, 2005).
In this chapter, we implemented a framework based on fuzzy classification (Schindler, 1998; Veryha, 2005; Zimmermann & Zysno, 1980) and SQL queries for data mining and data warehouse management. The main benefit of this framework is that there is no need to modify the functionality of the conventional relational databases. All manipulations can be done as an extension of the database schema by applying fuzzy data classification together with common SQL querying. Therefore, all benefits of using fuzzy sets and fuzzy classification in data mining, like userfriendly data presentation, the precision of the data classification, the use of linguistic variables instead of numeric values, easy-to-use facilities for querying an extended database schema, and so forth become available for users of conventional relational databases.
shipwreck scatter analysis One of the goals of shipwreck scatter analysis within highly disturbed and dynamic marine contexts is to explore inferences regarding the original distribution of the underwater finds (Blot, 1998; Muckelroy, 1978). If the remains of ships are preserved, they provide sound and straightforward support for the spatial analysis of artifacts lying in their vicinity (Gregory, 1995; Oxley, 1990). If not, a more intricate approach is required in order to probe potential distribution patterns among the scattered remains. Muckelroy led the way two decades ago, proving that some evidence may be found indeed underwater, even in highly scattered, nonhomogeneous shipwreck sites. In the case analyzed here on the southern coast of the Peniche peninsula, open to the oceanic swell, scattered ceramic fragments were found in shallow water with no remains whatsoever of the original vessel. The immaterial presence of the ship itself and its original cargo are therefore inferred from the homogeneity and strict isochrony of the ceramic materials: All fragments are dated from the same period; dating is provided by the presence of fragments of Samian
519
Fuzzy Classification in Shipwreck Scatter Analysis
ware from Italy ranging from 15 B.C. to 15 A.D. The original cargo is further reflected through the reduced typological spectrum of the ceramic remains, the overwhelming majority of which is composed of Haltern 70 amphorae from southwest Spain, a type containing wine and wine derivates whose chronological spectrum ranges from mid first century B.C. to the second half of the first century A.D.. In the case analyzed here, the scattered ceramics provide the single available support for the analysis of potential distribution patterns of artifacts within a noisy oceanic environment. The fuzzy analysis approach aims at distinguishing ceramic fragments potentially associated with initial short-term processes (initial fracture of the ceramic) from the sherds related to later, postdepositional processes (abrasion resulting from moving mineral particles or cleavage and other marks resulting from impacts with larger scale elements, stones, or other larger ceramic fragments). The diagram in Figure 2 will help us to understand how those parameters describing the physical
state of ceramic fragments fit into the long-term chronology of the underwater site. It illustrates the dynamic processes affecting archaeological materials after the destruction of the ship. The diagram (Figure 2) relates velocity (v) (generally less than 10 knots or 18.5 km/h, often much less for most commercial sailing boats) with time (t). The diagram starts with the last moments of a sailing vessel cruising at sea. Velocity as a function of time remains fairly stable with a corresponding horizontal feature of the curve and drops abruptly at shipwreck time (t0), until it stabilizes close to zero values. The bumpy feature of the curve at that stage (t1) corresponds to the last moments of the accident, with the sudden drop in speed (negative values of the derivate of the curve) and shortduration events corresponding to floating debris progressively stabilizing on the sea bottom (end of Phase 1 in the diagram in Figure 2). This is where clean and neat fractures are investigated among the broken ceramics from our shipwreck example in Portugal. Later, Events 2, 3, 4, 5, and so forth
Figure 2. Diagram of short- and long-term dynamics affecting ship materials before and after a shipwreck
Velocity (≤10 knots)
Last moments under sail 1
Shipwreck t0 t1 (hours/days)
520
Marine processes (including stormy episodes) 2
3
t2
t3
4
t4 (months/years/centuries)
5
t5
t
Fuzzy Classification in Shipwreck Scatter Analysis
in the diagram in Figure 2 are related to physical modifications of the archaeological materials underwater at moments t2, t3, …, tn; they are spread at random along the later moments of the curve and correspond not to historical events anymore but to sea dynamics and high-energy episodes, when sea motion interacts with the materials. The abrasion parameter on the ceramic sherds is related to the motion of mineral particles around and on the ceramic texture. Due to swell action, for instance, artifacts may be submitted to abrasion and erosion without moving themselves. Similar processes may occur also in a steady current, when the artifacts lie exposed above the sediments or at sediment level. The phenomenon is brought
to a halt when the artifacts lie fully buried under the sediment, leaving the way for other physical processes, mostly electrochemical, not taken into account in our approach. Unlike the abrasion features that may occur on a regular basis in a marine environment (daily wave-derived kinetics), the cleavage parameter in the diagram is related to the long-term series of bumps spread at random along time and correlated with high-energy episodes, namely oceanic storms. This is when the artifacts are put into movement again by the sea environment (water, sediments, rocks) or directly interact with it (ceramic fragments hit by rocks or other fragments). All physical episodes occurring after the time of the accident (t0 to t1) tend to blur the
Figure 3. Data sources in shipwreck scatter analysis: (a) fragments from several broken amphorae in one single underwater cluster, (b) largest amphora fragment (Halter 70 shape) collected by the original finder, a spear fisherman, (c) Amphora neck as found underwater (November 2004) on the top layer of the cluster of 146 fragments depicted in the image (see also 3a), (d) recording (by A. Dias Diogo) of the profile of an amphora rim (photos by J.-Y. Blot, R. Venâncio, and J. Russo during the CNANS/IPA mission at Cortiçais, Peniche, 2004)
a
b
c
d
521
Fuzzy Classification in Shipwreck Scatter Analysis
physical information attached to the neat-fracture parameter of each ceramics fragment and to its original position once settled on the sea bottom. To give some feeling of which kind of data are used in shipwreck scatter analysis, we presented a few photos in Figure 3, where some exemplary geographical, geometrical, visual, and other data that became available in the shipwreck scatter analysis from Peniche, Portugal, are shown. The shipwreck scatter analysis typically includes the following main phases, which are also common in knowledge data discovery (Han & Kamber, 2001). •
Data Preparation and Preprocessing: This phase generally consists of cleaning imperfect data, integration of data from multiple
•
•
sources, selection of relevant data, and transformation of data. Data Mining: This phase consists of the application of intelligent methods to extract knowledge from data. Data mining relies on several essential tools, such as clustering, classification, and association rules. Pattern Evaluation and Presentation: This phase involves the identification and evaluation of interesting and useful knowledge from the mined information and its presentation to the user in a suitable format.
Another chapter of this book includes an introduction to fuzzy data mining methods by Feil and Abonyi. The preprocessing phase is a very important phase because without adequate preparation
Figure 4. Application of fuzzy classification in knowledge discovery process
Shipwreck Scatter Data
Photos, films, etc.
World Wide Web, e.g., Google Earth, etc.
Documents
Data Transfer SQL querying
Database
Fuzzy Classification
Reasons for ship sinking
522
Knowledge Portal
Expert Conclusions
Conditions under which a ship sank
Fuzzy Classification in Shipwreck Scatter Analysis
of data, the outcome of any data mining algorithm can be disappointing. To adequately prepare the shipwreck scatter data for further analysis, we will use fuzzy sets theory on the data preparation and preprocessing phase together with fuzzy classification on the data mining level. Figure 4 demonstrates our proposed visionary application scheme for knowledge extraction (Veryha, 2003) with the help of fuzzy classification for shipwreck scatter data. First, the shipwreck scatter-related data are collected by archaeologists and stored as documents, photos, films, and so forth, or they can be even directly stored in the relational database. Second, the collected data are preprocessed, preanalyzed, and transferred to the relational database. Third, using the fuzzy data classification framework, the shipwreck scatter-related data are queried with conventional SQL. The extracted knowledge about the data dependencies and patterns are stored in the socalled knowledge portal, which can be a common relational database with a special predefined data structure (Veryha, 2003). Fourth, the archaeological experts can access knowledge portal reports and based on them make their decisions about the most probable reasons and conditions under which a ship or boat sank. Fuzzy sets theory will directly support decision making with the fuzzy membership values representing probabilities.
fuzzy classification and fSQL Conventional nonfuzzy data classification is easy to implement using native SQL queries. The main problem of conventional data classification is that the class information is often not sufficient: For example, some of the classified items belong to different classes and it is hard to distinguish which class the given item belongs to more and which class it belongs to less because of a precise boundary definition of class atomic values. Thus, generated reports based on conventional data classification are often not precise enough (Schindler, 1998; Veryha, 2005).
In the real-world application, it is often very difficult to assign unique values to object attributes. Fuzzy sets theory is an effective solution for dealing with uncertainty. An important feature of fuzzy sets is that they provide formalism for incorporating ambiguity in a typical classification scheme that combines various data. To be more specific, one can define a fuzzy set PXi for any probabilistic event Xi (i = 1…n, where n is the number of probabilistic events) in triangular form (see Figure 5) as PXi = (pi, qi),
(1)
where pi is the left-limit probability value of the given event Xi when unique probability value p becomes 1 and qi is the right-limit probability value of the given event Xi when unique probability value p becomes 0 (see Figure 5). The membership function PXi(p) is shown in Figure 5. Such fuzzy set PXi can be presented based on the graph in Figure 5 as 1 if p − pi P Xi ( p ) = 1 − ∆pi 0 if
0 < p ≤ pi pi < p ≤ qi (2)
if
qi < p ≤ 1
where p is the unique value of the probability and ∆pi = qi – pi is the difference between the nodes of the fuzzy probability, as it is shown in Figure 5. To present some exemplary operations on fuzzy sets, one can refer to the multiplication of fuzzy sets PXi PXj (j = 1...n), which gives the following result for PXiXj (p) if one simply multiplies each of components for events Xi and Xj from Equation 2: 1 h ij P XiXj ( p ) = 1 + − 2 0
if p − pi p j ∆pi ∆p j
h ij + 2
0 < p ≤ pi p j
2
if if
pi p j < p ≤ qi q j qi q j < p ≤ 1
(3)
where
523
Fuzzy Classification in Shipwreck Scatter Analysis
Figure 5. The membership function PXi(p) using fuzzy sets theory
PXi(p) ∆pi
1
0
h ij =
pi
∆pi p j + ∆p j pi ∆pi ∆p j
is a help function used for a compact presentation of PXiXj(p). Similarly, one can use some parameter value on the X-axis instead of the probability p to define fuzzy sets for a particular application case. To improve the precision and user-friendliness of data classification, various fuzzy data classification methods can be used (Schindler, 1998; Zimmermann, 1992). The classification accuracy is largely dependant on the original language scale and atomic values used by experts during their assessment of pieces. As more values on the scale are used by experts during their assessments, the later classification accuracy rises. The membership function PXi(p) (see Figure 5) can be used to represent the highest later classification accuracy possible under the given values on the expert assessment scale, avoiding a drawback of typical classification methods where mean or average values are used and, thus, a classification accuracy
524
qi
1
p
decrease can be expected due to rounded values. The selection of the most suitable fuzzy data classification method is usually dependent on the given application. We selected the fuzzy classification approach described in Schindler (1998) and Veryha (2005) for shipwreck scatter data analysis because it was the most suitable for its direct implementation in relational databases. The key aspect in fuzzy classification is the introduction of linguistic variables. The area of definition of the linguistic variable is the predefined domain D with verbal terms {Term1, Term2, …, TermN}, for example, visible, invisible, and so forth, which define appropriate classes. The most important feature of linguistic variables is that every term of a linguistic variable represents a fuzzy set. The membership function of the fuzzy set is defined over the domain of the corresponding attributes. The membership of an object to a specific class can be calculated by an aggregation over all terms of the linguistic variables that define the class. The terms visible and slow, for instance, describe class C1. The membership to
Fuzzy Classification in Shipwreck Scatter Analysis
class C1 is then a conjunction of corresponding values of membership functions µvisible and µslow. There exist a number of operators that can be used to calculate conjunctions of membership function values (Zimmermann & Zysno, 1980). For example, one can apply the γ-operator, which is used as a compensatory AND and was empirically tested in Zimmermann (1992). The membership µA~ ,comp (x ) of x object with m linguistic variables to the given classes can be calculated based on the following equation (Zimmermann & Zysno, 1980): i
m
µA~ ,comp (x ) = ∏ µ i (x ) i
i =1
(1− γ )
γ
m 1 − ∏ (1 − µ i (x )) , x ∈ X, 0 ≤ γ ≤ 1 i =1
, (4)
where γ is a control parameter with default value 0.5 (Zimmermann & Zysno), µi(x) is the membership value of x object to a particular linguistic variable, and m is the number of linguistic variables. We will use this compensatory AND operator in our exemplary application for shipwreck scatter data analysis. To attract more attention to fuzzy applications, we used only conventional SQL functionality for querying data with fuzzy classification, as it is described in Veryha (2005). In this case, database users do not need to add any new clauses to the SQL syntax and can continue using a conventional syntax of SQL as if there were no fuzzy classification behind the data. We developed an interpreter as a stored procedure that translated conventional SQL commands into native SQL queries of a particular database. Users can formulate SQL queries with well-defined and familiar terms and do not need to know definitions of equivalence classes in detail or fuzzy classification details behind the data. In addition, all mathematical formulas for calculating membership functions are hidden from users; the interpreter will take care of them. In the developed prototype, which is similar to the one described in Veryha (2005) but more complex because it can accommodate three parameters (one parameter more than in Veryha), the following
basic functionality of SQL (or fSQL in this case) for fuzzy classified data was implemented: SELECT [INTO] FROM [WHERE] .
We do not support a full SQL language syntax for fuzzy classification at the moment, but our current implementation can be easily extended to support further SQL operators like WHERE, ORDER BY, and so forth. The scheme of the developed fuzzy classification implementation in the relational database is shown in Figure 6. The implementation scheme of fSQL for fuzzy classified data querying includes the following steps. 1.
2.
Design of database extensions (additional tables that contain linguistic variables, membership values, and descriptions of atomic values); database extensions can be generated automatically (additional programming is required in this case) Design and implementation of interpreter for SQL transformation into native SQL for the given relational database management system using syntactical analysis of queries and execution of native SQL subqueries (grouping of objects into classes, calculation of compensatory AND membership of objects to the classes, and calculation of normalized membership of objects to classes) with parameters from fSQL clauses for fuzzy classified data querying
To illustrate the implementation of fuzzy classification in shipwreck scatter analysis, we present in the next section a real-world example for fragments of ceramics from an ancient shipwreck.
525
Fuzzy Classification in Shipwreck Scatter Analysis
exemplary implementation for wear parameters An important area of fuzzy classification application in shipwreck scatter analysis is an accurate classification of pieces of ceramics to classes and, thus, identification of the most valuable pieces with the help of membership values. The main benefit is that one can easily combine assessments from various experts into one criterion using user-friendly linguistic variables. Later on, based on fuzzy classified data, one can make a final conclusion incorporating viewpoints from different experts to formulate the most probable reasons and conditions under which a ship sank. Large volumes of data about ceramic fragments found in the sea are currently stored in conventional relational database management systems, like Microsoft SQL Server, Microsoft Access, or Oracle. Database administrators are usually very restrictive in doing any significant changes (change of data structures, merge of internal data with external ones, etc.) on such databases to avoid data loss. Thus, to implement fuzzy classification, we
needed the fSQL approach with conventional SQL querying, as described above, where there is no need to modify the functionality of the conventional relational databases. All data manipulations had to be done as an extension of the database schema by applying fuzzy data classification together with common SQL querying. The sample data used for exemplary fuzzy classification implementation consists of broken fragments of ceramics collected from an underwater site located on the southern coast of the Peniche peninsula in Portugal. The underwater site lies among sharp protruding rocks where the ceramics fragments are found in crevices under a shallow but variable layer of coarse sand. A clear isochrony (same period) for all ceramics identified leads us to interpret the site as the wreck of a small ship with a cargo of amphorae from southwest Spain (Baetica). The shape is defined as Haltern 70, an amphorae used for wine and wine-derivates transport. The dating (Haltern 70 amphorae cover a range from mid first century B.C. to the second half of the first century A.D.) is refined by the presence of several fragments of Italian fine work
Figure 6. Scheme of fuzzy classification framework implementation in relational database SQL queries for fuzzy classified data Interpreter as stored procedure Native SQL queries
Database Tables Initial database tables
526
Extensions for fuzzy data classification
Reports, views with fuzzy data classification
Fuzzy Classification in Shipwreck Scatter Analysis
(sigillata) dating back to 15 B.C. to 15 A.D. The underwater site is exposed to southwest wind and swell associated with major low-pressure events in this area of the southwest Iberian Atlantic facade. The resulting effects of local dynamics are visible on the ceramics fragments, which have being weighed and observed from a wear-type point of view. In those circumstances, fragments have been described according to three factors. •
•
•
Abrasion (A): Abrasion is characterized by rounded corners and a coarse texture of the clay. It is visible as a rounding effect along corners of the ceramics and on the surface texture (see Figure 7a). The coarse elements included in the clay are particularly visible after the fine clay external layer of the amphora fragments has been eroded. Cleavage (C): Cleavage describes horizontal or tilted fractures or marks of impacts (see
Figure 7b). It appears as the result of localized impacts on the ceramics. The coarse nature of the clay, with frequent discontinuities in the original used by Baetican potters 2,000 years ago, facilitates the cleavage of the amphorae fragments, leading to items with a substantially reduced thickness. Neat Fracture (F): A neat fracture is one that is not abraded (see Figure 7c); it is expected in recently broken ceramics.
All three factors A, C, and F are described along a natural-language scale corresponding to four degrees resulting from the visual estimates. • • • •
Very visible Barely visible Almost invisible Invisible
- - - -
3 2 1 0
Figure 7. Definition of wear type factors: (a) abrasion, (b) cleavage, and (c) neat fracture (adapted from Blot et al., 2006)
a
b
c
527
Fuzzy Classification in Shipwreck Scatter Analysis
the very moment of the nautical accident 2,000 years ago. It can therefore be highly correlated with the spot where the ceramic vase was broken during or shortly after the accident. As expected, the wear type attributes are highly dependent on subjective factors attached to the expert and his or her experience in the field or with the specific collection of fragments. These aspects are reflected in the parameter estimate variations from one person (expert) to another. The extract from the results of this test analysis (an assessment of the visual state of ceramic fragments as part of shipwreck scatter analysis) are presented in Table 1, where appropriate values (as described above) for the parameters abrasion, cleavage, and neat fracture
Practice revealed that subvalues, that is, 0.5, 1.5, and 2.5, were in some cases used by experts. The usage of those intermediate values was left to the personal decision of each expert. The detailed observation of 2,908 ceramics fragments brought from the site after the first intensive survey campaign (May 2005) lead us to conclude that in most cases the three wear type factors commented on above are combined at different levels within one single fragment. In terms of shipwreck-pattern potential inferences, the F factor appears as the most promising since a maximum value of this factor may be attached to a fragment that has not suffered from transportation or exposure to the sea elements since
Table 1. Extract from results of test analysis of ceramics fragments found near Peniche peninsula in Portugal Expert 1
Fuzzy Classification in Shipwreck Scatter Analysis
were provided for all found ceramic fragments, identified with indices 2591, 2613, 2607, and so forth by three independent experts. Table 1 contains rough data about the visual state of ceramic fragments, which require further processing (classification) to identify any possible correlations between the found items. To classify ceramics fragments, the following classes were introduced with the help of the linguistic variables visible and invisible.
•
•
•
•
C1 (A, invisible; C, invisible; F, visible): It was broken during the shipwreck and was not affected by further kinetics from the sea environment. C2 (A, visible; C, invisible; F, invisible): Due to long-term actions of the sea, the original fracture was invisible under the wear due to abrasion.
• •
• •
C3 (A, invisible; C, visible; F, invisible): It was affected by a sudden action of sea. The clay was cleaved under some related shock not at the time of the original shipwreck. It might have been due to storm action. C4 (A, invisible; C, invisible; F, invisible): This is a supplementary class for fuzzy implementation. C5 (A, invisible; C, visible; F, visible): This is a supplementary class for fuzzy implementation. C6 (A, visible; C, invisible; F, visible): This is a supplementary class for fuzzy implementation. C7 (A, visible; C, visible; F, invisible): This is a supplementary class for fuzzy implementation. C8 (A, visible; C, visible; F, visible): This is a supplementary class for fuzzy implementation.
Figure 8. Exemplary database schema for fuzzy classification Supplementary Tables for Fuzzy Classification Initial Tables
+
529
Fuzzy Classification in Shipwreck Scatter Analysis
The Microsoft SQL Server Express database management system was chosen for exemplary fuzzy classification implementation because of its powerful Transact-SQL language features used to write the fuzzy SQL interpreter and because of its free licensing. The syntactical analysis (typical extraction of clauses from the query string) of SQL queries for fuzzy classified data was implemented using Transact-SQL functions (SUBSTRING, LTRIM, RTRIM, LEFT, and PATINDEX) for strings and the embedded Microsoft SQL Server Express stored procedure (sp_executesql). The linguistic variables visible and invisible were mapped to original atomic values very visible, barely visible, almost invisible, and invisible with appropriate fuzzy values of {1, 0.83, 0.67, 0.5, 0.23, 0.17, and 0} for our original point scale of {3, 2.5, 2, 1.5, 1, 0.5, and 0}. First, initial data about ceramics fragments and experts were imported to tables Assessments, Experts, and Items with primary keys ExpertID and ItemID (see Figure 8).
Second, supplementary tables with the definitions of classes (tables fClasses and fClassesDefinition) and linguistic variables for wear-type parameters (tables fFracture, fAbrasion, fCleavage, and fLinguisticVariables) were added (see Figure 8). The AtomicValue column in tables fFracture, fAbrasion, and fCleavage is the actual numeric value for the given parameter and FuzzyValue is the respective fuzzy value based on data fuzzification (see Figure 5). Third, stored procedure fSQL (see Figure 9) was implemented as it was described previously, where we first performed a syntactical analysis of the input string parameter (SQL string for fuzzy classification), then we grouped parts of the input string to expected clauses (SELECT, INTO, FROM, and WHERE), calculated the compensatory AND membership to predefined classes, and, finally, calculated normalized membership to classes, which was later output as a result as shown in Figure 9. The full source code of the fSQL stored procedure can be obtained from the authors for
Figure 9. fSQL stored procedure for fuzzy classification and exemplary query results Part of fSQL stored procedure source code
530
fSQL stored procedure call with conventional SQL statement as string parameter
Fuzzy Classification in Shipwreck Scatter Analysis
free. Now, using the following simple SQL look and feel queries, we generated fuzzy classification reports with the help of the stored procedure fSQL, as presented in Tables 2 and 3:
• •
execute fSQL “SELECT Item FROM Assessments” (see Table 2 for partial results), execute fSQL “SELECT Item INTO MyView FROM Assessments,” and
Table 2. Partial results of fSQL statement (Execute fSQL “SELECT Item FROM Assessments”)
Calculations of membership functions to various classes were performed using the compensatory AND operator (see Equation 4 for more details). We think that users of the presented fuzzy data analysis will benefit from the given example due to the following.
Fuzzy Classification in Shipwreck Scatter Analysis
It has the ability to precisely classify data (one view that incorporates aggregated fuzzy membership as the assessment criteria); for example, if any item belongs to the given class with fuzzy membership 1, one can be almost 100% sure that it is correct. For example, items with memberships to the given class below 0.8 are suspicious and may require additional investigation. It is in fact a key feature of the fuzzy classification approach that the same items can be classified to multiple classes. Membership values to various classes provide required classification measures for various items.
The generated data can be directly put into database views to be used in any other reports, for example, to print extended reports with detailed class names. There is a significant reduction of the number of attributes for the description of items; that is, one can substitute them with only two linguistic variables visible and invisible. At the same time, you do not lose your data precision; you simply delegate data uncertainty to fuzzy membership.
In the experience described above, which dealt with 267 ceramic fragments, eight classes
533
Fuzzy Classification in Shipwreck Scatter Analysis
Table 4. Results of our exemplary analysis of 267 ceramic fragments Class
Abrasion
Cleavage
Facture
1 2 6 7 8
invisible visible visible visible visible
invisible invisible invisible visible visible
visible invisible visible invisible visible
(C1 to C8) were isolated although cases of 100 % membership were verified in only five classes (C1, C2, C6, C7, and C8). Contrasted against the whole population (267 items) of broken ceramics, those 49 fragments corresponding to cases of 100% membership within a class represent only a small portion of the whole lot. Among the eight classes (C1 to C8) isolated through fuzzy logics procedures, class C7 was the largest, reflecting the overwhelming presence of sea-dynamics (abrasion and impact) phenomena on the underwater site investigated (see Table 4). In contrast to the obvious nature of class C7 (abrasion and impact) for fragments deeply marked by sea dynamics, or to class C1 (no trace of abrasion or impact, only a neat fracture), the exact nature of other classes scarcely represented (e.g., class C8 or C6) requires further interpretation. In the case of the elusive class C8, for instance, it is possible that a secondary neat fracture appears on already much-eroded fragments. Unlike fragments from class C1, which reflects the existence of ceramic debris buried shortly after their initial fragmentation, the items in class C8 may reflect a secondary neat fracture occurring on ceramic materials already intensely eroded by sea action. In this case, class C8 may simply be considered as a subclass of class C7 within the general category
534
Ceramic Fragments with 100% Membership within a Class 5 3 1 37 3 Total: 49
of ceramic fragments submitted to long-term shallow-water sea dynamics. From an archaeological point of view, only class C1 appears therefore with a heuristic potential to explore the original scatter pattern of the shipwreck and the first moments of the fragmentation of the ceramic recipients contained within the hold of a small ship. Fuzzy sets theory becomes particularly important in order to identify all other possible members of class C1; for example, all pieces of ceramics that have the fuzzy membership to class C1 (see Table 2) of more than 0.4 (empirical value) become interesting for a more detailed investigation of their possible belonging to the very important class C1. In our approach, one can get such valuable classification results very easily with only one fuzzy SQL query. In the future, we plan to integrate fuzzy SQL querying in the complex knowledge discovery environment, shown in Figure 4, and extend it for usage in fuzzy clustering (Galindo, Urrutia, & Piattini, 2006).
conclusion Based on a real-world example of fragments of ceramics from an ancient shipwreck, we have presented the exemplary application of fuzzy
Fuzzy Classification in Shipwreck Scatter Analysis
classification (a more detailed discussion of the archaeological materials and related features and marine context may be consulted online in Blot et al., 2006). We think that fuzzy classification can be useful in discovering interesting and useful knowledge, such as patterns, trends, and associations met in shipwreck scatter analysis. First, using our fuzzy classification approach, we have shown that it is quite easy to obtain a user-friendly representation of the wear-type parameters of fragments of ceramics from an ancient shipwreck. Second, the employment of the fuzzy classification approach helped us to classify fragments of ceramics by detecting intrinsic classes and neighborhood relations, keeping high precision of data classification in comparison to classical methods that usually perform badly with data sets coming from uncertain environments where quite often only linguistic variables and experts’ opinions are available. The main benefits of our fuzzy classification framework for relational databases with SQL querying are the following. 1.
2.
It provides user-friendly data presentation with descriptive SQL-based queries, linguistic variables, and aggregated fuzzy membership values that lead to decreasing the time spent on extracting required information from the given database. There is good integration with conventional relational databases; for example, the database administrator does not have to change underlying databases, which are, in practice, very large. With the developed approach, one operates mainly on the database schema level by simply adding additional fuzzy classification relations with linguistic variables and later querying those data using SQL-based queries.
The main drawback of our fuzzy classification framework for relational databases with SQL querying is that it may not be generic enough to support the easy addition of new aggregation op-
erations. As a result, a reencoding of the stored procedure would be necessary under different circumstances. To implement the fuzzy classification framework for shipwreck scatter analysis, the prototype based on the Microsoft SQL Server Express was developed. The interpreter for SQL querying of fuzzy classified data was developed using TransactSQL with the benefits of storing the interpreter as a stored procedure on the server side and providing easy access for all database users. Future works will include testing of the developed fuzzy classification framework on larger data sets from shipwreck scatter analysis.
Acknowledgment The authors would like to thank archaeologists Antonio Dias Diogo Alessia Amato, and Sonia Bombico for their valuable help in the collection and assessment of ceramics fragments from Peniche peninsula in Portugal.
References Bellma, M., & Vojdani, N. (2000). Fuzzy prototypes for fuzzy data mining. In Studies in fuzziness and soft computing (Vol. 39, pp. 175-286). New York: Springer. Blot, J.-Y. (1998). From Peru to Europe, 1784-1786: First steps in the analysis of a ship overload. Bulletin of the Australian Museum of Maritime Archaeology, 22, 21-34. Blot, J.-Y., Diogo, A. D., Almeida, M. J., Venâncio, R., Veryha, Y., Maricato, C., et al. (2006). ���������������� O sítio submarino dos cortiçais (Costa Meridional da Antiga Ilha de Peniche). In ��� Actas das primeiras jornadas de património de Peniche. Peniche, Portugal. Borgodna, G., Leporati, A., Lucarella, D., & Pasi, G. (2000). ����������������������������������� The fuzzy object-oriented database
535
Fuzzy Classification in Shipwreck Scatter Analysis
management system. In Studies in fuzziness and soft computing (Vol. 53, pp. 209-236). New York: Springer. Bosc, P., & Pivert, O. (2000). SQLf query functionality on top of a regular relational database management system. In Studies in fuzziness and soft computing (Vol. 39, pp. 171-191). New York: Springer. Brown, A. G. (1997). Alluvial geo-archaeology: Food plain archaeology and environmental change. Cambridge, United Kingdom: Cambridge University Press. Carrasco, R. A., Vila, M. A., & Galindo, J. (2003). FSQL: A flexible query language for data mining. In M. Piattini, J. Filipe, & J. Braz (Eds.), Enterprise information systems IV (pp. 68-74). Dordrecht, The Netherlands: Kluwer Academic Publishers. Dobkins, J. E., & Folk, R. L. (1970). Shape ������������ development on Tahiti-Nui. Journal of Sedimentary Petrology, 40(4), 1167-1203. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Golubski, W. (2003). A software tool for fuzzy regression analysis. In Proceedings of International Conference on Computational Intelligence for Modeling, Control & Automation (pp. 567-574). Vienna. Gregory, D. (1995). Experiments into the deterioration characteristics of materials on the Duart Point wreck site: An interim report. International Journal of Nautical Archaeology, 24(1), 61-65. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. New York: Morgan Kaufmann Publishers. Kacprzyk, J., & Zadrozny, S. (2000). On combining intelligent querying and data mining using fuzzy logic concepts. In Studies in fuzziness and soft computing (Vol. 53, pp. 67-84). New York: Springer.
536
Kacprzyk, J., & Zadrozny, S. (2002). Linguistic data summaries: Towards an increased role of natural language in data mining. In Proceedings of Eighth IEEE International Conference on Methods and Models in Automation and Robotics (pp. 121-126). Szczecin, Poland. Kruse, R., & Klose, A. (2002). Information mining with fuzzy methods: Trends and current challenges. In Proceedings of Eighth IEEE International Conference on Methods and Models in Automation and Robotics (pp. 117-120). Szczecin, Poland. Meier, A., Savary, C., Schindler, G., & Veryha, Y. (2001). Database schema with fuzzy classification and classification query language. In Proceedings of International Congress on Computational Intelligence: Methods and Applications (pp. 243-248). Bangor, United Kingdom. Muckelroy, K. (1978). Maritime archaeology. Cambridge, United Kingdom: Cambridge University Press. Oxley, I. (1990). Factors affecting the preservation of underwater archaeological sites. International Journal of Nautical Archaeology, 19(4), 340341. Schindler, G. (1998). Fuzzy datenanalyse durch kontextbasierte datenbankanfragen. Wiesbaden, Germany: DUV/Gabler. Shackley, M. R. (1974). Stream abrasion of flint implements. Nature, 248, 501-502. Vargiolu, R., Zahouani, H., & Blot, J.-Y. (2005). Analyse de la topographie des pièces de Monnaie du San Pedro de Alcantara. Revista Portuguesa de Arqueologia, 8(1), 433-457. Veryha, Y. (2001). Neural network error accommodation with fuzzy logic elements in robot time optimal path tracking. In Proceedings of National Science Council of Taiwan ROC(A), 25(6), 367376. Veryha, Y. (2003). Enterprise knowledge discovery and management using semantic knowledge
Fuzzy Classification in Shipwreck Scatter Analysis
processing technology. Journal of Information & Knowledge Management, 2(1), 33-41. Veryha, Y. (2005). Implementation of fuzzy classification in relational databases using conventional SQL querying. Information and Software Technology, 47, 357-364. Zimmermann, H. (1992). Fuzzy set theory: And its applications. London: Kluwer. Zimmermann, H., & Zysno, P. (1980). Latent connectives in human decision making. Fuzzy Sets and Systems, 4(1), 37-51.
Key Terms Data Mining: Data mining, also called knowledge discovery in databases or knowledge discovery and data mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, and so forth. Data mining is a complex topic, has links with multiple core fields such as computer science, and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning, and pattern recognition. Fuzzy Set Operations: A fuzzy set operation is an operation on fuzzy sets. These operations are generalizations of crisp set operations. There is more than one possible generalization. The most widely used operations are called standard fuzzy set operations. There are three operations: fuzzy complements, fuzzy intersections, and fuzzy unions. Fuzzy Sets: Fuzzy sets are an extension of classical set theory and are used in fuzzy logic. In classical set theory, the membership of elements in relation to a set is assessed in binary terms according to a crisp condition: An element either belongs or does not belong to the set. By contrast, fuzzy set theory permits the gradual assessment
of the membership of elements in relation to a set; this is described with the aid of a membership function. Maritime Archaeology: Maritime archaeology (also known as marine archaeology) is a discipline that studies human interaction with the sea, lakes, and rivers through the study of vessels, shoreside facilities, cargoes, human remains, and submerged landscapes. One specialty is underwater archaeology, which studies the past through any submerged remains. Another specialty within maritime archaeology is nautical archaeology, which studies vessel construction and use. Membership Function: The membership function of a fuzzy set is a generalization of the indicator function in classical sets. In fuzzy logic, it represents the degree of truth as an extension of valuation. Degrees of truth are often confused with probabilities; however, they are conceptually distinct because fuzzy truth represents membership in vaguely defined sets, not the likelihood of some event or condition. Petrology: Petrology is a field of geology that focuses on the study of rocks and the conditions by which they form. There are three branches of petrology corresponding to the three types of rocks: igneous, metamorphic, and sedimentary. The word petrology itself comes from the Greek word petra, meaning rock. The word lithology once was approximately synonymous with petrography, but today lithology is essentially a subdivision of petrology focusing on macroscopic hand-sample or outcrop-scale descriptions of rocks. Structured Query Language: Structured query language (SQL) is the most popular computer language used to create, modify, retrieve, and manipulate data from relational database management systems. The language has evolved beyond its original purpose to support objectrelational database management systems. It is an ANSI/ISO standard.
537
538
Chapter XXI
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance Yan Chen Louisiana State University Agricultural Center, USA Graham H. Rong Massachusetts Institute of Technology, USA Jianhua Chen Louisiana State University, USA
ABSTRACT A Web-based fabric database is introduced in terms of its physical structure, software system architecture, basic and intelligent search engines, and various display methods for search results. A fuzzy linear clustering method is used to predict fabric drape coefficient from fabric mechanical and structural properties. Experimental data indicate that fuzzy linear clustering is quite effective for this purpose. A hybrid method combining fuzzy linear clustering with K-nearest neighbor is also applied for the prediction of the fabric drape coefficient with improved prediction accuracy. The study also reveals that the fuzzy linear clustering method can also be used for predicting fabric tailorability with good prediction accuracy. Mathematical principles of fuzzy comprehensive evaluation are summarized and a typical application for assessing fabric comfort is exhibited. Through the fuzzy calculation, a single numerical value is produced to express female preferences for six fabric types for use in blouses, slacks, and underpants with respect to fabric property changes in an incremental-wear trial. Finally, a neuro-fuzzy computing technique for evaluating nonwoven fabric softness is presented. The combinational use of the fuzzy logic models (CANFIS) and the neural network method makes a significant step toward launching a fabric database application for neural network computing as a routine laboratory evaluation.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
INTRODUCTION Fabric end-use performance is determined by fabric mechanical and physical properties. Many aspects of performance, such as softness, tactility, and comfort, are assessed by physiological responses and subjective judgment only because of their physical complexity and users’ preference. Instrumental approaches for directly measuring these fabric quality features are still limited to date. However, with the availability of high-performance computers and advanced computing techniques, opportunities to solve these problems become more realistic. Internet technology is making a notable impact on the traditional textile industry. This revolutionary information technology is helping textile manufacturers to enhance their competitiveness in production management and marketing. As Internet users communicate with each other across networked computers, diverse business applications, ranging from the design of textile products to clothing retailing, are popping up through the Internet. For instance, to meet fashion designers’ increasing desire for online fashion tracking, a Web site called Worth Global Style Network (WGSN, http://www.wgsn.com) was established in London. This Web site is providing the fashion and style industries with trend-watching news and services, including resources for yarn, fabrics, and garment accessories, and graphics of updated design styles and fashion trends. The information, which covers more than 250,000 pages, comes from a team of 150 designers, trend analysts, journalists, and photographers all over the world. Another Web site, TextileWeb (http://www.textileweb.com), was developed as a community for professionals of the textile industry. It provides product information (buyers’ guide and marketplace) and professional services (job search and training). Online shopping is the ultimate desire for both manufacturers and consumers and is driving Internet technology toward e-commerce. More and more clothing retailers favor a strong Internet presence to promote online shopping. A recent example of this can be seen with the retailer Neiman Marcus
launching a $24 million Web site investment with new multimedia applications (Kemp & Lewis, 2000). The company hopes that the new investment will extend its merchandising strategy and promise to make the online shopping experience more realistic. Today, apparel retailing holds the second place for online sales, next to long-term e-business leader online travel. It is reported that the online sales of apparel, footwear, and accessories have risen to $18.3 billion in 2006, and are expected to reach $22.1 billion in 2007 (Dilworth, 2007). All these figures indicate that the textile and clothing industries will further stimulate the IT industry to develop new technologies for accelerating e-commerce capabilities. Although the IT achievements are significant, online fabric sourcing and shopping still has many obstacles to overcome. Technology, customer service, and distribution management are all challenging apparel manufacturers and retailers. From a technical point of view, apparel design and manufacturing is still more a kind of art than science. For example, fabric quality is mainly assessed by experts’ subjective impression by hand. This traditional skill is still commonly used by fabric finishers, bespoke tailors, and even mass-production garment makers. Thus, few apparel designers care about the importance of fabric physical properties and about how to determine these properties and incorporate them into their designs. Garment making is on a trial-and-error basis. Garment quality relies largely on technicians’ experience and operators’ skill. However, with the severe shortage of experienced textile engineers and hand evaluation experts, the traditional approach is now not practical. Moreover, as automation increases in garment manufacturing, the determination of fabric properties becomes more and more necessary for the control of interaction between making-up machines and fabric materials. Instrumental measurement of fabric properties is therefore an important technology (Hearle, 1993a). Online evaluation of fabric performance in terms of hand, comfort, tailorability, and drapability mainly depends on two technical aspects.
539
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
One is the development of a fabric database that includes instrumental data for describing fabric mechanical and physical properties. The other is the identification of mathematical approaches capable of establishing physical models for the prediction or grading of fabric performance. Instrumental methods for measuring fabric’s different properties are mostly available today. The computing techniques for Web-based fabric databases and intelligent search engines can also be obtained inexpensively to meet various end-use needs in the textile and apparel industries. A key issue is the employment of the right algorithms for the right evaluation models. In reality, the search criteria for fashion trends and suitable fabric materials are often fuzzy. For example, the answer to whether or not a particular garment design is fashionable depends on many factors and cannot be easily quantified. Similarly, the judgment of whether or not (or how much) a fabric material is suitable for a specific garment, or whether a garment made from a fabric is comfortable is a fuzzy one. There is also quite some degree of fuzziness in judging whether a fabric is soft or not. Moreover, various fabric properties are related to each other in a fuzzy way. The fabrics stored in the online database are characterized by more than a dozen mechanical properties (such as tensile, shear, etc.) and fabric contents, structure, and end-use information. In addition, a fabric also has other properties such as appearance and tailorability. Intuitively, these more intangible properties are related to mechanical properties of a fabric in a subtle way. A systematic prediction method is desired so that fuzzy queries for fabrics with desired appearance and performance such as tailorability can be answered. In spite of some works on fabric evaluation and classification (Y. Chen, Collier, & Collier, 1999; Y. Chen, Collier, Hu, & Quebedeaux, 2000), little is known in the textile literature for predicting fabric appearance and tailorability as well as comfort and softness from its mechanical properties. This chapter introduces a fabric database and specific fuzzy logic methods in the application of assessing the softness, comfort, drapability,
540
and tailorability of apparel and consumer fabrics according to fabric mechanical and physical properties. The physical structure and software system architecture of the Web-based database is described. Basic and intelligent search engines provided by this database are demonstrated. The searched results can be displayed on the Web page in different formats, such as tables, charts, and dynamic images. Mathematical principles of fuzzy comprehensive evaluation are summarized. The algorithms of fuzzy C-means clustering and fuzzy linear clustering (FLC) are illustrated in the application for establishing the intelligent search engines. A hybrid neuron-fuzzy model called coactive neuro-fuzzy inference systems (CANFIS) is also introduced for use as a preprocessor to perform fuzzy transformation in neural network computing. As an application exhibition, three case studies are presented. The first case is the use of a hybrid method combining the K-nearest neighbor method with fuzzy linear clustering to improve the prediction accuracy for fabric drapability. The second case is the fuzzy comprehensive evaluation of apparel fabric comfort using the fuzzy model M(•, ⊕). The last case is the application of the neuron-fuzzy model for grading fabric softness. The combinational use of the fuzzy logic models and neural network method makes a significant step toward launching a fabric database application on an office PC. This helps meet the industries’ requirements for routine implementation, dynamic update, and cost effectiveness of evaluating textile material quality.
PROGRESS OVERVIEW Fuzzy sets and neural networks have been used extensively to solve problems in engineering and science, health care, and financial applications. Applications of fuzzy and neural computing techniques in the fabric and apparel manufacturing industry are gaining more interest from researchers in IT and textile fields. A fuzzy logic based approach was reported that could detect fabric defects in real time (during the weaving process) and
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
control the weaving process (Dorrity, Vachtsevanos, & Jasper, 1996). A back-propagation neural network was used to detect fabric defects (Kuo & Lee, 2003). The researchers have also done work in using fuzzy C-means clustering for automatic fabric print pattern recognition (Kuo, Shih, Kao, & Lee, 2005; Kuo, Shih, & Lee, 2004). We have presented our work in using fuzzy linear clustering for fabric drape prediction (J. Chen, Chen, Zhang, & Gider, 2002). Fuzzy clustering has been used widely in various applications. The fuzzy linear clustering method was developed independently by two different research groups (Hathaway & Bezdek, 1993; Kundu & Chen, 1994). The study showed that the fuzzy linear clusters possess the nice property of being invariant under linear transformations. The applications of fuzzy linear clustering in fuzzy control rule learning have also been investigated (J. Chen & Kundu, 1999; Mikulcic & Chen, 1996; Sabharwal & Chen, 1996). These results indicate that the fuzzy linear clustering method is very useful for capturing linear patterns in data, and that the method has a strong generalization capability for function approximation. We use fuzzy linear clustering for fabric drape prediction and fabric tailorability prediction in this current work. Approaches to evaluating fabric hand can be categorized as subjective assessment and objective assessment. Many textile experts have advocated significantly to the study of subjective assessment and have contributed much to the research literature (Brand, 1964; Ellis & Garnsworthy, 1980; Howorth, 1964). Other textile scientists made great efforts in objective evaluation of fabric hand (Kawabata & Niwa, 1989; Pan, Yen, Zhao, & Yang, 1988a, 1988b, 1988c; Postle & Mahar, 1982). Recently, the method of fuzzy comprehensive evaluation was used to grade fabric hand (Raheel & Liu, 1991; Rong & Slater, 1992). A multivariate statistical method of discriminant analysis was also proposed to establish a nonlinear discriminant function for predicting end uses of apparel fabrics (Y. Chen & Collier, 1997). With the rapid development of computer hardware and software, neural network
techniques have been adopted for modeling complex nonlinear problems of textiles. Many neural network models, such as parallel distributed processing, and connectivist and adoptive systems, were applied to the determination of fiber spinability (Pynckels, Kiekens, Sette, Langenhove, & Impe, 1995), yarn strength prediction (Cheng & Adams, 1995), fabric classification (Barrett, Clapp, & Titus, 1996), fabric faults identification (P. W. Chen, Liang, Yau, & Lin, 1998), fabric performance evaluation (Gong & Chen, 1999), and seam pucker rating (Chang & Tae, 1997).
FABRIC DATABASE An initial objective of this research is to establish an online intelligent database server that will help clothing manufacturers and retailers to pinpoint desired fabrics that match fashion trends in color, drape, and style; to narrow fabric selections to fabrics possessing good physical properties that insure high quality of garment products; to find better-buy fabrics; and to locate fabric manufacturers and determine earliest shipping dates of roll materials. This database server is able to provide a dynamic fabric databank consisting of structural parameters, mechanical properties, fabric drape images, making-up process information, and contact information of fabric manufacturers. This chapter reports a research progress in the establishment of the database server, acquisition of apparel fabrics and fabric property measurement, Web site design and activation, database construction, and data mining programming.
Physical Structure The physical structure of this online fabric database is illustrated in Figure 1. The system can be accessed by any client through the Internet. A central piece of this system is a networked PC server running Microsoft Active Page Server and DB2 database software. The server stores all fabric information (fabric bank) and database codes. In the present
541
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Figure 1. Database structure
Client
Internet
Database
Client
Server Fabric Bank Client
stage, the database includes 185 apparel fabrics from different fabric manufacturers. Mechanical properties of these fabrics have been tested using the instruments of the Kawabata Evaluation System for Fabrics (KES-FB). The KES-FB instruments consist of a tensile and shear tester, a pure bending tester, a compression tester, and a friction and roughness tester that can be used delicately for measuring basic fabric deformations (Kawabata & Niwa, 1989). Fabric drapability has been measured using a Cusick drape tester. On this tester, a fabric circle is draped over a pedestal, and the draped image (shadow) is projected on a top plate above the pedestal. A paper ring having the same diameter as that of the fabric circle is used to trace the drape shadow. The traced-shadow part of the paper ring is cut and weighed. The ratio of the paper weight of the traced shadow vs. the weight of the original paper ring is defined as the drape coefficient. The lower the drape coefficient, the better the fabric drapability. Dynamic fabric drape images have been videotaped using a digital video camera. Fabric structural, material, and contacting information was also inputted into the database.
Software System Architecture The software system for the fabric database is a Web-based application written in Java. As shown
542
in Figure 2, the system consisted of three major components: a graphical user interface, an intelligent search engine, and data mining and learning algorithms. A Web page has been designed and used as a graphical user interface. This Web page is located on a school’s Web server (http://www.textilelab.huec.lsu.edu). The intelligent search engine supports several types of user queries. The simplest type of query is a search request that specifies particular values of fabric mechanical properties. For this type of queries, a straightforward search with a certain percentage of searching accuracy for
Figure 2. Database system architecture
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
all measured fabric properties can be performed. Matched fabrics are displayed as an answer for the search. Subsequently, clients can further pull out information about fabric manufacturers and fabric prices upon different users’ needs. Fabric mechanical properties such as extension, bending, friction, and so forth are closely related to fabric drapability and tailorability. Such correlation typically takes mechanical properties as independent variables and models the drapability or tailorability as a function of these independent variables. Therefore, a key problem is to discover such correlation patterns between fabric properties and draping or processing performance using computing techniques of data classification. Obtained correlation patterns can then be used for predicting drapability or tailorability of new fabric products or customer fabrics, for example, whether the fabric drapes greatly, moderately, or slightly. In this research, a method of fuzzy linear clustering is applied for data mining and pattern recognition in the established database. Algorithmic details for this method can be found in the related literature (Bezdek, 1980; J. Chen et al., 2002).
Search Options The present database provides three types of search options: accurate search, criteria search, and intelligent search. The accurate search helps users find a fabric with the specific mechanical properties they require. The search result can be a perfect match to users’ input, or can have a 10% to 30% variation in the input. This search method is useful when users want to pinpoint a particular fabric for which mechanical properties need to be strictly controlled. For example, assume that a garment maker has completed a batch of men’s shirts and the mechanical properties of the used shirting fabric have been measured using the Kawabata instruments and stored as a fingerprint in a computer system. The next time the company acquires a new order of the men’s shirt, fabric purchase staff can search for a new fabric supplier using the company’s filed fabric fingerprint. Figure 3 illustrates a sample of the accurate search. The criteria search allows users to search for fabrics by different categories, such as end uses, fiber types, fabric structures, or manufacturers. In the present
Figure 3. User’s interface for accurate search
543
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
state, the database provides a criteria search for end uses only. The database intelligent search is based on search engines that run special codes for data mining and pattern learning in response to clients’ needs for evaluating fabric quality features, such as making-up processability (tailorability), hand and drape, and durability. In the present work, an intelligent search engine for predicting the fabric drape coefficient was developed. For each time a search is executed, the search engine runs five times and outputs a mean value of the prediction. Overall accuracy for the drape prediction is about 88% on test data.
Fabric Data Display The database has a Web page that allows users to browse the fabric bank by selecting fabric IDs and fabric properties. Once a fabric search is completed, matched fabrics are listed by the fabric ID. Fabric data can be displayed by clicking any individual
Figure 4. Fabric data display
544
fabric ID (Figure 4). A dynamic drape image of a selected fabric can be viewed by clicking Click to View Video. The video runs 5 seconds of dynamic drape images for each fabric. This function of the database provides a visual tool for apparel designers and fabric producers to assess real fabric drape appearance.
Fuzzy Clustering for Predicting Fabric Drape/Tailorability The Problem and Fuzzy Linear Clustering Method The problem addressed in this section is the following. Suppose we know the major mechanical properties (such as shear, tensile, etc.) of a fabric. Can we predict from these properties whether this fabric would drape heavily or not? Or similarly, can we predict whether a fabric is easy to sew
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
(tailor) or not given knowledge of the fabric’s mechanical properties? Thus, the problem to be addressed is a fabric property (draping and/or tailorability) estimation or prediction from other related properties. It is quite natural to desire a method of systematically estimating fabric tailorability. Fabric tailorability depends on many properties of the fabric, and there is no fixed model to automatically estimate tailorability from fabric mechanical properties. Typically, tailorability is obtained by laboratory test sewing, a time- and labor-consuming process. The capability to automatically and reliably estimate fabric tailorability would save garment manufacturers time and money in selecting suitable fabrics. What about the usefulness of the ability to estimate the draping property of a fabric? Although drape coefficient data can be obtained for each fabric by using the Cusick drape tester, it is a tedious process. Thus, it would be desirable to have the ability to predict the drape property of a fabric from its other physical properties so that we probably do not have to physically measure the drape coefficients for every fabric. That is why we conducted the prediction experiments with fuzzy linear clustering in this research. The main principle in addressing the prediction problem is to find a reliable prediction method that is tolerant to noisy data, yet is quite efficient in computation time. Moreover, we would prefer a prediction method that is easily interpreted for human understanding. The approach taken in our work is to use fuzzy linear clustering combined with the K-nearest neighbor method for fabric drape and tailorability prediction. The FLC method of Hathaway and Bezdek (1993) is a generalization of the fuzzy C-means of Bezdek (1980). FLC finds fuzzy clusters by iteratively optimizing an objective function. Given a set of sample data points D = {p = <xi, yi>: 1 ≤ i i ≤ n} and the desired number of clusters C (≥ 2), the algorithm produces C fuzzy clusters Ak, 1 ≤ k ≤ C, and the membership values µki = µk (p ) for i each point pi and cluster Ak. Here, each given data point p is of the form p = <xi, yi>, where xi = <xi1, i
i
xi2, ..., xis> is a real-valued vector of dimension s ≥ 1, and yi is a real number. Each fuzzy cluster Ak (1 ≤ k ≤ C) is characterized by a linear function gk (x) = ak0 + ak1x1 + ... + aksxs. The algorithm finds the membership values µki and the coefficients in the linear functions gk (x) such that the objective function Jm is minimized: C
n
( ki ) m [y − g ( x )] .
J m = ∑∑ M k =1 i =1
i
k
i
2
(1)
The constraints ∑ M ki = 1 (summed over all k) and µki ≥ 0 apply here as well. We call the resulting clusters fuzzy linear clusters because of the linearity of functions gk (x). The computation of the fuzzy linear clusters proceeds from initial randomly generated membership values. The algorithm iteratively computes the linear coefficients by linear regression from the current membership values and then computes the membership values from the current linear coefficients according to the following equation: M ki
[( y − g ( x )) ] = ∑ [( y − g ( x )) ] i
k
1 2 − m −1
i
C
j =1
i
j
i
1 2 − m −1
,
(2)
where 1 ≤ k ≤ C and 1 ≤ i ≤ n. The algorithm terminates when the maximum change of the membership values between consecutive iterations falls below a given threshold.
FLC Plus K-Nearest Neighbor Approach Application of fuzzy linear clustering for data prediction typically involves two steps: the training step and the prediction step. In the training step, we apply the fuzzy linear clustering algorithm (with the number of clusters C specified) to a training data set with n data points. This will produce the cluster centers gk (in the form of linear equations) for 1 ≤ k ≤ C and the fuzzy membership values
545
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
µki for 1 ≤ k ≤ C and 1 ≤ i ≤ n. The training step will also generate a predicted y value for each training data point <xi, yi >. In the prediction step, given only the x part of a data point (x = <x1, ..., xs> ), the prediction algorithm will produce an estimated y part. The y value is computed by the following steps. 1.
2. 3.
First, for the given data x0, find the nearest neighbor <xi,yi > of x0 from the training data, and thus use µki, the membership value for < xi, yi > in cluster k as membership values µk0 for x0 for each 1 ≤ k ≤ C. Next, compute gk (x0) for each linear equation gk. Finally, combine the results of all clusters by the following equation: C
y = ∑ g k ( x0 ) M ki k =1
.
(3)
Our initial experiments have confirmed the viability of FLC for fabric drape coefficient prediction. Subsequently, we develop a hybrid method combining the K-nearest neighbor approach with fuzzy linear clustering for the drape prediction task. The hybrid method differs from the initial method described above only in the prediction step. Instead of using only one nearest neighbor to obtain the membership values of a new data point x0 to the fuzzy clusters, multiple nearest neighbors are used for this purpose. Namely, to estimate the membership value µk0 for x0 in cluster Ak, we find the K-nearest neighbors of x0 from the training data set, and then just set µk0 to be the average of membership values of these K-nearest neighbors to cluster Ak. The rest of the prediction step proceeds in the same way.
Experiments and Results A preliminary result was reported on using FLC for drape prediction from fabric mechanical properties with a relatively small (about 100 points) sample data (J. Chen et al., 2002). The works presented
546
here are based on a larger fabric data set (300+ data points). Here, each data point is of the form <x, y>, where x = <x1, ..., x16>, a vector of dimension 16 corresponding to 16 mechanical properties of a fabric, and y is the fabric’s drape coefficient. The objective here is to predict a fabric’s drape coefficient value from its mechanical property values. We also experimented on tailorability prediction by FLC with a smaller (about 100 points) data set. After training on fabric drape data, we observed the prediction accuracies of the learned clusters on both training data and on a separate testing (drape) data set. As we expected, the prediction accuracy on training data is quite high (nearly 95% when we use three clusters). Increasing the number of clusters is observed to increase the prediction accuracy on the training data. Experiments on using the learned fuzzy model to predict drape values for unseen test data produce encouraging results. Without any further fine tuning, we achieved a prediction accuracy of about 88% on test data. Subsequently, the hybrid method combining Knearest neighbor with FLC is investigated, which leads to prediction accuracy of about 92%. In the initial prediction study (with single nearest neighbor to estimate fuzzy membership), two experiments were conducted. In one experiment, we observe the prediction accuracies on testing data (disjoint from training data) in connection with various numbers of clusters used in training. We observed that using too many clusters in training is not helpful to reduce prediction error on unseen test data. As can be seen in Table 1, prediction accuracy on testing data initially increases with the increase of the number of clusters, but then it starts getting worse. This suggests that using too many clusters for training is likely to overfit the training data, and the clusters obtained actually lose generalization power and thus cause the drop in prediction accuracy. In the other experiment, we tried different sizes for the training and testing data split, and observed the resulting prediction accuracy on testing data. Table 2 shows the prediction accuracy in connection with training data size (we have a total of 300 data points). It is obvious that
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
the prediction accuracy initially improves with the growth of the training data size, but after the training data size reaches 150 (half of the total data), the prediction accuracy deteriorates. Again, here we observe the effect of overfitting. The lesson is that training with too many data points is not necessarily a blessing: The models discovered may not generalize well. An intuitive explanation is that when the training data set is too big, the fuzzy clusters may overfit the data by memorizing
the peculiarities of data, and thus they do not give good prediction results on unseen data. We have implemented the hybrid method combining the K-nearest neighbor method with FLC for fabric drape prediction. We find that the hybrid approach improves prediction accuracy further to about 92%. Tables 3 and 4 show the results of two experiments with different values for K and C. In these two experiments, the total data set size is 183 (rather than 300). Again, one can in some sense
Table 1. Prediction accuracy vs. number of clusters (training data size is 200, testing data size is 100) number of clusters prediction accuracy
3
4
5
10
20
30
84.3
86.6
83.1
83.3
81.7
82.2
Table 2. Prediction accuracy vs. size of training data: number of clusters C = 4 training size prediction accuracy
50
100
150
200
250
79.0
82.1
87.3
82.7
83.1
Table 3. Results using hybrid method with K-nearest neighbors: k = 3 and C = 6 training size prediction accuracy
90 91.3
100 91.7
110 92.2
120 92.6
130 92.5
140 91.9
Table 4. Results using hybrid method with K-nearest neighbors: k = 5 and C = 4 training size prediction accuracy
90 90.1
100 90.7
110 91.3
120 91.9
130 91.8
140 91.3
547
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
observe that a bigger training data size may not always produce better performance, and overfitting must be avoided. For tailorability prediction, we performed experiments using a data set of about 100 points. The data came from the tailorability test laboratory at the University of Louisiana at Lafayette, where the fabric tailorability scores were obtained through actual sewing tests of the fabrics. The FLC method is able to produce a prediction accuracy of 92% with a 75-25 split between training data and testing data.
Summary for Fuzzy Clustering for Predicting Fabric Drape and Tailorability Information technology is bringing tremendous opportunities to the textile and garment industry. Our online database and its intelligent search engine provide cloth designers, manufacturers, and retailers with a useful and convenient tool to quickly find suitable fabric materials that best fit their needs. The flexible queries supported by our system enhance the system’s usability. Data mining methods such as fuzzy clustering are applied effectively to discover patterns relating fabric properties. The system can be seen as a first step toward a comprehensive online business exchange system for cloth designers, manufacturers, retailers, and fabric manufacturers. Fuzzy linear clustering appears to be quite effective for predicting fabric appearance from fabric physical properties. The experiments indicate a promising application of the fuzzy approach to the discovery of patterns relating fabric properties. Moreover, the experiments show that we need to guard against overfitting in applying fuzzy linear clustering: Trying to fit the training data with too many clusters or training with too many data points may cause a loss of generalization power. Our study also indicates that the hybrid method combining the K-nearest neighbor method with fuzzy linear clustering produces superior prediction accuracy.
548
Besides further experiments and validations of the fuzzy linear clustering method, we see several ways to extend our works of applying the fuzzy approach to the search engine on the fabric database. These include the application of fuzzy linear clustering to discover new patterns among fabric properties, the use of the fuzzy C-means algorithm for fabric classification and query answering, and the development of hybrid approaches combining fuzzy methods with decision-tree learning to predict fabric appearance and tailorability. In a recent book on fuzzy databases (Galindo, Urrutia, & Piattini, 2006), the fuzzy clustering method has been combined with the fuzzy query language fSQL for an interesting application. We would like to compare our fuzzy clustering and query-answering method with the fSQL approach. We are investigating hybrid methods that combine fuzzy clustering with decision-tree learning for predicting fabric appearance and tailorability. The idea is to first apply the fuzzy C-means algorithm for making numerical valued fabric properties discrete, and then construct a decision tree from the fabric property data. The decision tree can be used for the prediction of fabric appearance or tailorability, which will be incorporated into the search engine.
Fuzzy Comprehensive Method for Evaluating Apparel Comfort Textile comfort performance had long been considered impossible to describe quantitatively until a new approach, the fuzzy comprehensive evaluation technique, was proposed (Rong & Slater, 1992). It has become a very popular and widely used fuzzy mathematical technique applied in a variety of areas. For example, it is applied as a new approach to the evaluation of fabric hand and performance in textile engineering (Raheel & Liu, 1991). The technique was also applied to the objective evaluation of fabric softness (Y. Chen et al., 2000). Statistical methods of ANOVA (analysis
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
of variance) and factor analysis were suggested to determine the fuzzy factor subset A~ from the fabric mechanical property data measured with the KES-FB instruments. Although it is difficult to define fabric comfort performance accurately, the fuzzy comprehensive evaluation technique introduced here, which takes into account the measured values of relevant mechanical or physical attributes of the material and the durability (the changes of these measured values) within its serviceable lifetime, and consumers’ preference for specific end uses as well, can be considered a comprehensive approach to evaluating textile comfort performance. This approach provides an alternative means to the use of measurements from instrumental ones and makes it possible to assess objectively fabric comfort behavior. Here, we will first introduce the mathematical principles of fuzzy comprehensive evaluation by introducing four evaluation models and discussing the selection of the models. Then, we discuss the analytical procedure and the evaluation results of three types of female ready-to-wear casual summer clothing (blouse, slacks, and underpants) based on the variations of their physical properties in various abrasion times and the relative importance of physical-comfort-related factors (weight, thickness, bending length, and air permeability) by questionnaires answered by female university students.
Mathematical Principles Fuzzy comprehensive evaluation is a type of converting operations among different fuzzy sets. A general form B~ = A~ * R~ is used to express a process that converses the fuzzy factor subset (vector) A~ into fuzzy grade subset (vector) B~ through a fuzzy relation R~. It has become a very popular and widely used fuzzy mathematical technique applied in a variety of areas, for example, in the evaluation of fabric hand and performance in textile engineering (Raheel & Liu, 1991). Given a certain computational model of the A~ * R~, or a specific conversion relation R~ (i.e., membership
function), one can get different mathematical evaluation models, each having a different essence and therefore dramatically different grading results. Hence, both the essence of each fuzzy comprehensive evaluation model and the determination of the fuzzy membership function have to be understood thoroughly so as to accurately apply these models.
Essence of Fuzzy Comprehensive Evaluation Models For the fuzzy comprehensive evaluation operator B~ = A~ * R~, the current models are below (Rong & Slater, 1992; Wang, 1984): b j = ∨ a i rij m b j = min{ 1, ∑ a i rij } i =1 m b j = min{ 1, ∑ min{ a i , rij }} i =1 ,
I . Model ( ∧ ,∨ ) :
b j = ∨ im=1 ( a i ∧ rij )
II . Model ( • ,∨ ) :
m i =1
III . Model ( • ,⊕ ) : IV . Model ( ∧ ,⊕ ) :
(4)
in which ai is a member of A∼ with i = 1, …, m. For Model I, Model (∧, ∨), the operator is bj = ∨ im=1 ( ai ∧ rij ) ,
(5)
where ∧ and ∨ denote minimum (min) and maximum (max) operations respectively, namely, bj = max[min( a1 , r1 j ), min( a2 , r2 j ),, min( am , rmj )]
(6)
.
In this model, the grade of membership r ij of a single element vi to a corresponding evaluation grade uj is amended as rij∗ = ai ∗ rij = ai ∧ rij = min( ai , rij )
(7)
This clearly indicates that ai is the upper limit of r ij (j=1, 2, ⋅⋅⋅, n) in considering multielement
549
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
evaluation; namely, the grade of memberships of element vi to any evolution grade uj (j = 1, 2, ⋅⋅⋅, n) is restricted to be less than ai. Obviously, _ in the operator accounts only for the most important element (of maximum rij∗ ) in regard to every evaluation grade uj while neglecting the contributions of all other elements. This is a kind of major element-dominating evaluation. Hence, this model is only suitable when there are not many elements (m is small). It is noteworthy that A~ should not be manipulated for weight distribution among elements, and we should let Σai = 1 (i = 1, 2, ⋅⋅⋅, m) since, in this way, the small weight value ai will mask the contributions of various elements. For Model II, Model (·, ∨), the operator is bj = ∨ im=1 ai rij ;
The difference between Model (·, ∨) and Model ∗ ∗ (∧, ∨) is that rij = aij ∧ rij is replaced by rij = ai rij . rij is amended by being multiplied by a less-than-1 factor, ai, instead of restricted by an upper limit. Although the contributions of every element are taken into account, A~ is by no means a weighting factor vector and it is not necessary that ai = 1 ( i =1, 2, ⋅⋅⋅, m). For Model III, Model (·,⊕), the operator is bj = ( a1rij ) ⊕ ( a2 r2 j )⊕,,⊕( am rmj ) ,
(10)
where the symbol ⊕ denotes the bounded sum, defined as a ⊕ b = min( 1,a + b ) . In this model, bj is calculated as the sum of all rij∗ = ai rij instead of the maximum as in Model (·, ∧,). In this way, it takes into account the contributions of all elements vi (i = 1, 2, ⋅⋅⋅, m). The entries of the factor set A~ = a1 , a2 ,, am are a weighting factor and its sum obeys ∑ ai = 1 ( i = 1,2,, m ) . Since ∑ ai rij (i = 1,2,, m) , the operator becomes bj = ( a1r1 j ) + ( a2 r2 j )+,,+( am rmj ) ,
550
(11)
where ∑ ai = 1 (i = 1,2,, m ) . In this case, the model can also be denoted as Model (·, +). For Model IV, Model (∧, ⊕), the operator is bj = ( ai ∧ r1 j ) ⊕ ( a2 ∧ r2 j )⊕,,⊕( am ∧ rmj )
. (12)
Similar to Model (∧, ∨), this model restricts the upper limit of rij to be ai, namely, rij∗ = ai rij. However, it calculates the bi as the bounded sum of all rij∗ . This model accounts for the contributions of all elements, but does not follow ∑ ai = 1 (i = 1,2,, m ) .
Selection of Fuzzy Comprehensive Evaluation Models Different people in different situations may have their different views of evaluating a set of objects. Sometimes, only the major element is important, that is, major-element dominant, whereas sometimes the conditions of the total elements are important regardless of the value of the major element. The above models can somehow reflect these different views of object evaluation. For one object under a same A~ and R~, evaluation grade sets from the above models obey the following theorem (Y. Y. Chen, Liu, & Wang, 1983): B(∧,⊕) ≥ B(⋅,⊕) ≥ B(⋅,∨) . B(∧,⊕) ≥ B(∧,∨) ≥ B(⋅,∨)
(13)
This can be proven by comparing their operators, that is, ∀ a,b ∈ 1, a·b ≤ a∧b ≤ a⇔b ≤ a⊕b. Therefore, Model (·, ∨) and Model (∧, ⊕) account more for the contributions of nonmajor elements than Model (∧, ∨) while still allowing for the dominant effect of the major element. Hence, Model (·, ∨) and Model (∧, ⊕) are advisable to use when Model (∧, ∨) fails or more accounts of nonmajor elements need to be taken. Model (·, ⊕) is sort of a weighted average model, accounting for the contributions of all elements according to their weights, and is, therefore, more suitable to the cases where whole sets of elements are interesting.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
In practical use, the relative comparison of the entry values of an evaluation grade subset, B~, is more meaningful rather than their absolute values. To compare or evaluate a set of objects, one can first calculate B~ using Model (∧, ∨) and Model (·, ⊕), and then calculate B~ using either Model (·, ∨) or Model (∧, ⊕). According to the above theorem (Equation 13), if the values of B(∧, ∨) and B(·, ⊕) are rather small, Model (·, ⊕) is more advisable, otherwise, Model (·, ∨). For a same set of objects, different evaluation results could be obtained using different models. This is comparative to the fact that different conclusions may be drawn for the quality order of a set of objects when they are viewed from different angles. To comprehend the results from different models, a second-stage comprehensive evaluation can be carried out using Model (·, ∨). The multistage evaluation can also allow one to analyze complex systems with different facts of objective measures (Rong & Slater, 1992).
Analytical Procedure and Results The performance of textiles depends not only on their initial properties but also, more importantly, on how they behave in service. Here, we illustrate the analytical procedure for and results of how the fuzzy comprehensive evaluation models are applied to the objective evaluation of textile performance based upon initial properties in conjunction with serviceability and durability, as well as end users’ preferences. The analyses are based on two sets of experimental data: the changes occurring during different stages of incremental abrasion testing and a survey of consumer comfort preferences. The first experimental data set contains the variations of thickness, weight per unit area, softness, and stiffness. They produce changes in the bending length and air permeability of plain and twill cotton, plain and twill wool, plain silk, and plain polyester and fiber or cotton (65/35) blended fabrics, which were subjected to various abrasion times of 0, 2, 3, 4, 5, 6, 7, and 8 minutes in an accelerator. The second data set contains the relative importance of physical-comfort-related
factors (weight, thickness, bending length, and air permeability) of three types of female ready-towear casual summer clothing (blouse, slacks, and underpants) recorded in questionnaires answered by female university students. In addition, multistage evaluation can be used to allow one to analyze a situation where there are different degrees of facets of an objective set of measures while avoiding the difficulties of determining the grade of membership of fuzzy subsets, A~—a step that would be too complicated by the presence of many different components or effects in each element to evaluate an article (a fabric, say). The grade of membership, m(x), of thickness, weight, and bending length are calculated according to the formula below: m(x) m ( x) =
x i +1 − x d
x i < x ≤ x i +1
(i = 1,2,...,6) , (14)
where x represents some measure of the property under consideration. Conversely, since the increased air permeability is considered desirable whereas a decrease in other properties is advantageous, the formula for the membership function of air permeability is m(x) m ( x) =
x − x i +1 d
x i +1 < x ≤ x i
(i = 1,2,...,6) . (15)
In all cases, the property is divided into six grades, and d, the difference between successive grade intervals, is equal to (xmax - xmin)/5, where xmax and xmin are the maximum and minimum values of each property item, respectively. The grades of membership of each property in different abrasion stages can be calculated for each fabric. Using Model III, that is, Model ( •, ⊕) :
b j = min{1,
m
∑a r }, i =1
i ij
the evaluation result for the initial property of a plain cotton fabric used for a blouse is B~ = (0.10, 0.15, 0.22, 0.04, 0.15, 0.35).
551
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Similarly, results for this fabric are different levels of abrasion for different end uses as well as the corresponding results for other fabrics. If an evaluation result expressed in a single number is desired, it can usually be obtained by means of a weighted-average calculation, for example: m
= ∑ b kj j =1
m
j
/ ∑ b kj, j =1
(16)
where aj is the individual value of a (just as mj is the value of m for each term in Equation 14 above) ranging from 1 to 6. The power component k can be determined for each specific case, though it is normally suggested that k should be set at 2. Assuming the importance of all properties at each of the eight abrasion times is equal and Model III is used, the second-stage evaluation can be simply carried out. The final single-value results of fuzzy comprehensive evaluation for all fabrics in three end uses are derived, as listed in Table 5 (Rong & Slater, 1992). By comparing the final results, we can see that plain wool is preferred to twill wool fabric for blouses, slacks, and underpants since the former has a higher comprehensive-evaluation value than the latter; the same is true for plain and twill cotton fabrics. Silk is the best among these six results for the three end uses since it has the highest evaluation.
Summary for Fuzzy Comprehensive Method for Evaluating Apparel Comfort The fuzzy comprehensive evaluation technique is demonstrated as a new way of dealing with textile comfort and performance, the two widely accepted phenomena, which, because they are complicated by being vague and affected by many attributes, are usually assessed subjectively. It can be seen from the procedure and the results that it is possible to establish a procedure for the assessment and selection of a fabric for a specific end use, with account taken of the consumer’s preference together with considerations of durability. Although the results in this example may not be universally
552
representative, of course, because of the limited availability of experimental data, they are still able to illustrate how the technique can provide useful information to textile and clothing manufacturers. Once the process is established, it should be possible to develop an intelligent database system for comprehensively evaluating textile fabric comfort and performance.
Fuzzy Neural Network for Grading Fabric Softness Application Scenario In the manufacture of fabric materials, soft hand is always a critical priority to satisfy customers. Even denim manufacturers are pursuing the soft touch of jeans by improving weaving and finishing so as to creating a denim lifestyle: casual and comfortable (Rudie, 1996). Nonwoven fabrics, particularly spunbond and air-blown nonwovens, are more like papery sheeting materials because of their specific structure, randomly laid and bonded fiber web that is distinct from the yarn-interlacing structure of woven fabrics. Therefore, the nonwoven fabrics usually lack fabric soft hand. The improvement of softness for spunbond nonwoven fabrics becomes strategically important in many end-use applications. Many approaches have been proposed for improving nonwoven softness, including the use of chemical softeners, enzymatic treatment (mainly for natural fiber nonwovens), and molecular modifications. A question raised by nonwoven manufacturers is how to objectively evaluate the improvement of nonwoven softness after use of new raw materials, adjustment of processing parameters, or application of new finishing methods. This means that we need to develop an instrumental method that is able to sensitively detect any incremental progress of fabric softness. Previous research with this aim was primarily focused on tissue and paper towel products. An example was the use of a mechanical stylus scanning method to measure the surface
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Table 5. Single output of fuzzy comprehensive evaluation for three fabric end uses Fabric Type Plain cotton Twill cotton Plain wool Twill wool Plain silk Plain polyester/cotton
Blouse 3.42 1.56 4.14 2.96 5.04 3.74
property of tissues (Rust, Keadle, Allen, & Barker, 1994). This research proposed a frequency index obtained through the fast Fourier transform (FFT) to describe tactile sensitivity. Other researchers (Kim, Shalev, & Barker, 1994) studied the softness property of paper towels using the KES-FB instruments. They developed a linear regression model to predict the so-called softness intensity. Both of the above research relies on human subjective input in grading the softness of tissue or paper towel products. Objective hand measurements of nonwoven fabrics were investigated in Japan (Kawabata, Niwa, & Wang, 1994). This research applied the same technique that Professor Kawabata and his coworkers developed for evaluating the hand of wool and wool-rich suiting fabrics. Recently, a neural network technique was used for the online prediction of tissue softness (Sarimveis & Retsina, 2001). An inferential sensor operated in real time together with a distributed control system. An application of the neural network technique for predicting fabric end uses was also reported (Y. Chen, Zhao, & Collier, 2001). In this case study, an objective method of fuzzy neural networks for evaluating nonwoven softness is exhibited. The purpose of this study is to investigate a practical approach to grading the softness of nonwoven fabrics based on nonwoven mechanical properties instead of human hand judgment. Spunbond polypropylene nonwovens are targeted in this study because of their diverse end uses ranging from personal care to home interior. The research method involves two techniques: the KES-FB instruments and neural network
Slacks 3.54 1.46 4.14 2.80 5.19 3.68
Underpants 3.42 1.58 4.14 3.00 5.01 3.77
computing. The KES-FB instruments are a de facto method for measuring fabric mechanical properties in academia and industries. The neural network computing technique is among the most promising approaches to solving real-life fuzzy problems because of its power in pattern recognition, particularly in learning highly nonlinear complexity.
Approach Fabric mechanical properties deal with these basic mechanical deformations: extension, shear, bending, and compression. Fabric softness or hand is believed to be a complex behavior of these mechanical properties, plus surface friction and roughness (Hearle, 1993a, 1993b). The KES-FB instruments are ideal for measuring fabric basic mechanical properties and can provide 16 instrumental parameters (Table 6). A computerized data acquisition system recently developed at Louisiana State University is used to record and calculate the KES-FB data automatically (Y. Chen, Zhao, & Turner, 2001). According to industry’s input, two types of polypropylene spunbond nonwovens were targeted. One (Target 2) had the best soft hand with a desired input value of 1, and the other (Target 1) had the worst softness with a desired input value of 0. From each of these targeted nonwoven fabrics, 15 specimens were prepared and tested using the KES-FB instruments. To establish a neural network model for predicting the nonwoven softness, obtained KES-FB data were imported to
553
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Six experimental spunbond nonwovens were selected and measured using the KES-FB instruments to form a test data set. These spunbond nonwovens were divided into two groups, Group 1 (four samples) and Group 3 (two samples), each representing a type of modified polypropylene. The softness of these six samples was graded by the
the commercial software NeuroSolutions (Version 4 for PCs) for training (Principe, Euliano, & Lefebvre, 2000). This software features MS Excel compatibility and can easily run on a PC desktop with Excel-format data input. This software provides a fuzzy logic neuronetwork model CANFIS to enhance computer learning performance (Jang, Sun, & Mizutani, 1997). With integrated fuzzy rules (membership functions) as a preprocessor, the neural network can characterize inputs that are not very discrete and establish an efficient model quickly. Figure 5 illustrates the structure of this neuro-fuzzy network. The CANFIS model includes two types of fuzzy membership functions, the bell-shaped curve and the Gaussian-shaped curve. The number of membership functions assigned to each network input can also be selected (usually 2 to 4 for smallor medium-sized data sets). Fuzzy models applied in the CANFIS model are the Tsukamoto and Sugeno fuzzy models. Regarding these models, we can refer to Jang et al. for more details. The configuration for the CANFIS model needs to specify the number of network hidden layers, the type of membership function, the number of the membership function per input, and the type of fuzzy model. In this case study, two hidden layers, the Bell membership function and the Tsukamoto model, were used.
Kawabata Parameter Bending Hysteresis 2HB (gf⋅cm2/cm) Compressive Linearity Compressive Energy (gf⋅cm/cm2) Compressive Resilience (%) Maximum Compressive Rate (%) Mean Frictional Coefficient Mean Deviation of Mean Frictional Coefficient Mean Surface Roughness (micron)
Input Code X9 X10 X11 X12 X13 X14 X15 X16
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
established neural network model. These softness grades, with numerical values between 0 and 1, were defined as the softness index.
Results and Discussion The fuzzy logic neural network model was established by training the training data set (Target 1 and Target 2). Figure 6 is a learning curve for this model. To assess the performance of the established network model, mean square error (MSE) was calculated (Principe et al., 2000). MSE is defined as P
MSE =
N
∑∑ ( d j =0 i =0
ij
− yij )2
,
NP
(17)
where P is the number of output processing elements, N is the number of exemplars (sample points) in the data set; yij is a network output for exemplar i at processing element j, and dij is the desired output for exemplar i at processing element j. In the present study, the final MSE value is 0.0027 for the training and 0.0096 for cross-validation. This means that the trained model is highly accurate.
Table 7 lists the grading results for the six tested samples. According to these softness indexes, it can be understood that the softness of the nonwoven samples in Group 1 is similar to that of Target 1 because of a lower softness index value varying between 0.28 and 0.72. In contrast, the nonwoven samples in Group 3 (3.1 and 3.2) have a very similar softness index to that of Target 2. As a result, the softness index can be used for softness interpretation and comparison among different spunbond nonwoven fabrics because of the availability of the numerical grading values for fabric softness. This will bring the convenience of communications between manufacturers and consumers. For a graphical interpretation of the softness difference among the spunbond nonwoven samples in Group 1 and Group 3, with reference to Target 1 and Target 2, discriminant analysis can be used (SAS Institute Inc., 1990). The canonical discriminant function is defined as CAN = u1 x1 + u 2 x2 + + u16 x16 ,
(18)
where CAN is a discriminant score, ui is a canonical coefficient (discriminant weight), and xi is a
Figure 6. Neural network learning curve 1.0 0.9
Training MSE
0.8
Cross Validation MSE
0.7 MSE
0.6 0.5 0.4 0.3 0.2 0.1 0.0 1
40
79
118 157
196 235 274 Epoch
313
352 391
555
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Table 7. Softness grades by neural network model Sample ID
Target 1
Softness Index
0
#1.1 0.55
Group 1 #1.2 #1.3 0.72 0.28
KES-FB instrumental variable. The canonical coefficient vector u = (u1, u2, …, u16) can be obtained by solving the following matrix equation (Lindeman, Merenda, & Gold, 1980): ( W −1B − l I )u = 0 .
(19)
W-1 is an inverse matrix of a pooled within-group matrix of sums of squares and cross-products, calculated by W = ∑ S( j ) j
(j = 1, 2, …, 4, presenting jth group),
(20)
where S ( j ) = ( S klj )16×16 and nj
S klj = ∑ ( x kij − x kj )( xlij − xlj )
(k, l = 1, 2,, 16; nj = number of samples in Group j). i =1
B is the between-group matrix of sums of squares and cross-products expressed as B = (bkl )16×16 ,
(21)
where 4
bkl = ∑ n j ( xkj − xk )( xlj − xl ) j =1
.
I is an identity matrix. λ is called an eigenvalue of the matrix W-1B and can be determined by the following characteristic equation: 556
#1.4 0.52
Group 3 #3.1 #3.2 1.00 0.98
W −1B − LI = 0 .
Target 2 1
(22)
The number of eigenvalues is equal to the number of groups minus 1 (here 4-1 = 3). Substi-1 tuting each λ in the matrix W B - λI allows the -1 determination of each adjoint matrix of W B - λI. -1 Any column in an adjoint matrix of W B – λI is an eigenvector u in terms of an eigenvalue λ. In the case of using the discriminant analysis for classifying four nonwoven fabric groups discussed here, three canonical discriminant functions (CAN1, CAN2, and CAN3) are obtained. Using these three sets of CAN as a coordinating system, the four nonwoven fabric groups can be plotted in this three-dimensional space (Figure 7). Because the cumulative proportion of the eigenvalues for the discriminant functions CAN1 and CAN2 reaches 96.39%, the projects of the four nonwoven groups on the CAN1-CAN2 plane are particularly examined (Figure 8). It can be seen that Group 3 is located between +CAN1 and +CAN2 and the softness index of the samples in Group 3 is closer to that represented by Target 2. On the contrary, Group 1 is located between –CAN1 and –CAN2 and the softness of the samples in this group more resembles that represented by Target 1.
Summary for Fuzzy Neural Network for Grading Fabric Softness Two types of experimental spunbond nonwovens were selected to represent two softness extremes: one with best softness and the other with worst softness. The mechanical properties of these nonwovens in terms of extension, shear, bending,
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Figure 7. Fabric softness difference interpreted by discriminant analysis
Figure 8. Fabric softness difference interpreted by first two discriminant functions
compression, and surface friction and roughness were measured using the KES-FB instruments. Obtained KES-FB data helped fingerprint the softness of these two extreme nonwoven samples. A softness grading model was established by using fuzzy neural network computing. This model graded the softness of a group of six spunbond nonwovens made from commercial polypropylene with numerical values between 0 (indicating the worst softness) and 1 (indicating the best softness). These numerical grading values could then be used as a softness index for spunbond nonwoven fabrics in communications among manufacturers and customers. The model cross-validation indicated that the estimate for the model MSE was as low as 0.0096. The multivariate method of discriminant analysis could be used to provide a graphical interpretation for the nonwoven softness index. The established model in this study was demonstrational because of a limited number of exemplars (sample points) in the training data set. Another disadvantage of the present procedure is that the neuron-fuzzy program was stand alone and not accessible through the Internet. Further research is needed to expand the present fabric database to include various types of nonwoven fabrics and to incorporate the neuron-fuzzy computing pro-
cedure in the Web-based database structure. This will enable manufacturers or end users to execute the fabric softness evaluation by logging onto the fabric database Web site.
CONCLUSION AND FURTHER WORK In this chapter, we have exhibited a fabric database and three different approaches of fuzzy computing techniques for applications in evaluating fabric end-use properties regarding drape, tailorability, comfort, and softness. Quantitative assessment for these quality aspects of fabric end uses is still more art than science in today’s textile production and consuming. The presented research cases help explore the ability of database computing and fuzzy logic computing technology to solve those fabric performance grading problems that are continuously challenging the textile community. In the application for predicting fabric drape and tailorability, the established online database and its intelligent search engine provided a useful tool for textile end users to quickly find quality fabrics that would meet specific product requirements.
557
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
The flexible queries supported by the database system enhanced the system’s search efficiency. The method of fuzzy linear clustering was feasible for predicting fabric drape based on measured mechanical properties. The experiment indicated that trying to fit the training data with too many clusters or training with too many data points might cause a loss of generalization power. Therefore, we need to guard against overfitting when using fuzzy linear clustering. Our study also indicated that the hybrid method combining the K-nearest neighbor method with fuzzy linear clustering improved the prediction accuracy for fabric drape. For the purpose of evaluating apparel comfort, four types of fuzzy comprehensive evaluation models were reviewed. Model (•, ⊕) was used to assess six different fabrics for three different end uses. The implementation of this model evaluation indicated that the approach of fuzzy comprehensive evaluation was useful for assessing the appropriateness of a specific fabric for a specific end use based on instrumental testing for the fabric’s critical physical properties and on the acquisition of the consumer’s preference for apparel comfort. From the present study, we could conclude that plain wool fabric was better than twill wool fabric for making blouses, slacks, and underpants because plain wool fabric has a higher fuzzy comprehensive value. Similarly, plain cotton fabric was preferred to twill cotton fabric for the end uses of blouses, slacks, and underpants. Plain silk fabric was the best among those six types of fabric for the three end uses because of its highest fuzzy comprehensive values. In the case of grading nonwoven softness using the method of fuzzy neural network computing, the neuro-fuzzy model CANFIS was selected to perform machine learning for the two targeted polypropylene spunbond nonwovens after inputting their mechanical property data (KES-FB instrument data). The trained neuro-fuzzy model was then applied to evaluate the softness of six spunbond nonwoven samples made from commercial polypropylene. The model cross-validation revealed that the model MSE was 0.0096. This
558
neuro-fuzzy model was able to produce a numerical value between 0 and 1 for grading nonwoven softness. It should be noted that there were some limitations throughout this research. First, the three fuzzy applications were three individual cases based on different data sets. The fabric database introduced was used only for the establishment of fuzzy clustering models for evaluating fabric drape and tailorability. Second, the number of samples in each data set was limited, particularly, in the data sets for the second and third cases. This might affect the universal representativeness for those established fuzzy models. Finally, all the fuzzy computing procedures discussed in the above cases were separated and were not integrated into a single computer program package. Further research is needed to advance the progress of the fuzzy computing techniques for textile applications. Recommended future work is described below. The volume of the current fabric database needs to be increased so that it can include different fabric types from wovens to nonwovens, and different fabric properties from durability to aesthetics. By expanding this fabric database, we are able to develop an integrated computer program package capable of running not only the fuzzy clustering methods, but also the basic fuzzy comprehensive models discussed and the neuro-fuzzy hybrid approach introduced. In this way, end users may have several choices to implement different fuzzy evaluation procedures upon their different application needs. Additionally, the Web-based database architecture also needs enhancing to allow all the fuzzy evaluation procedures to be accessible through the Internet. Further experiment and validation will be carried out about the capability of the fuzzy clustering methods for establishing new search engines for the fabric database. Special interests include the use of fuzzy linear clustering for discovering new patterns among fabric properties, the application of the fuzzy C-means algorithm for fabric classification and query answering, and the develop-
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
ment of hybrid approaches for combining fuzzy methods with decision-tree learning to predict fabric appearance and tailorability. Furthermore, the fuzzy clustering and query-answering method will be compared to the fSQL approach so as to find more effective ways to incorporate these two approaches for different applications.
ACKNOWLEDGMENT The authors would like to acknowledge their special thanks to the following individuals and organizations for all the assistance they have rendered in the course of this research work: Professor Sukhamay Kundu for helpful discussions on topics related to this work and for his permission to use his program for fuzzy linear clustering; Dr. Jackie Robeck from the University of Louisiana at Lafayette for providing fabric tailorability data; Sreeram Vuppala, Bin Zhang, Ayse Gilder, and Ting Zhang of Louisiana State University for their assistance in the implementation of the online fabric database and the acquisition of the fabric property data; Dr. Billie J. Collier and Zuopang Li of the University of Tennessee for providing experimental nonwoven samples; and finally, to the Louisiana Board of Regents and ExxonMobil Chemical Company for financial support. This work is also partially supported by the NSF grant ITR-0326387 and AFOSR grants FA955005-1-0454, F49620-03-1-0238, F49620-03-1-0239, and F49620-03-1-0241.
References Barrett, G. R., Clapp, T. G., & Titus, K. J. (1996). An on-line fabric classification technique using a wavelet-based neural network approach. Textile Research Journal, 66, 521-528. Bezdek, J. C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2, 1-8. Brand, R. H. (1964). Measurement of fabric aesthetics: Analysis of aesthetic components. Textile Research Journal, 34, 791-804. Chang, K. P., & Tae, J. K. (1997). Objective rating of seam pucker using neural networks. Textile Research Journal, 67, 494-502. Chen, J., Chen, Y., Zhang, B., & Gider, A. (2002). Fuzzy linear clustering for fabric selection from online database. In J. Keller & O. Nasraoui (Eds.), 2002 Annual Meeting of the North American Fuzzy Information Processing Society Proceedings (pp. 518-523). Piscataway, NJ: IEEE. Chen, J., & Kundu, S. (1999). Fuzzy control system design by fuzzy clustering and self-organization. In Proceedings of NAFIPS’96 Conference. Berkeley, CA: IEEE. Chen, P. W., Liang, T., Yau, H., & Lin, H. C. (1998). Classifying textile faults with a back-propagation neural network using power spectra. Textile Research Journal, 68, 121-126. Chen, Y., & Collier, B. J. (1997). Characterizing fabric end-use by fabric physical properties. Textile Research Journal, 67, 247-252. Chen, Y., Collier, B. J., & Collier, J. R. (1999). Application of cluster analysis to fabric classification. International Journal of Clothing Science and Technology, 11, 206-215. Chen, Y., Collier, B. J., Hu, P., & Quebedeaux, D. (2000). Objective evaluation of fabric softness. Textile Research Journal, 70, 443-448. Chen, Y., Zhao, T., & Collier, B. J. (2001). Prediction of fabric end-use using a neural network technique. Journal of Textile Institute, 92, 157-163. Chen, Y., Zhao, T., & Turner, B. (2001). A new computerized data acquisition and analysis system for KES-FB instruments. Textile Research Journal, 71, 767-770.
559
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Chen, Y. Y., Liu, Y. F., & Wang, P. Z. (1983). Models of multifactorial evaluation. Fuzzy Mathematics, 1, 61-70.
Kawabata, S., & Niwa, M. (1989). Fabric performance in clothing manufacture. Journal of Textile Institute, 80, 19-50.
Cheng, L., & Adams, D. L. (1995). Yarn strength prediction using neural network. Textile Research Journal, 65, 495-500.
Kawabata, S., Niwa, M., & Wang, F. (1994). Objective hand measurement of nonwoven fabrics. Textile Research Journal, 64, 597-610.
Dilworth, D. (2007). For first time, apparel outsells computers online: Shop.org. DMNEWS. Retrieved May 22, 2007, from http://www.dmnews.com/cms/ trackback/41082-1
Kemp, T., & Lewis, D. (2000). Retail site pushes Web envelope. InternetWeek. Retrieved September 8, 2000, from http://www.internetwk.com/lead/ lead090800.htm
Dorrity, J. L., Vachtsevanos, G. J., & Jasper, W. (1996). Real time fabric defect detection and control in weaving processes (Tech. Rep. No. G94-2). Wilmington, DE: National Textile Center.
Kim, J. J., Shalev, I., & Barker, R. L. (1994). Softness properties of paper towels. TAPPI Journal, 77, 83-89.
Ellis, B. C. & Garnsworthy, R. K. (1980). A review of techniques for the assessment of hand. Textile Research Journal, 50, 231-238. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design, and implementation. Hershey, PA: Idea Group Publishing. Gong, R. H., & Chen, Y. (1999). Predicting the performance of fabrics in garment manufacturing with artificial neural networks. Textile Research Journal, 69, 447-482. Hathaway, R. J., & Bezdek, J. C. (1993). Switching regression models and fuzzy clustering. IEEE Transactions on Fuzzy Systems, 1, 195-204. Hearle, J. W. S. (1993a). Can fabric hand enter the dataspace? Part I. Textile Horizons, 14-17. Hearle, J. W. S. (1993b). Can fabric hand enter the dataspace? Part II: Measuring the unmeasurable? Textile Horizons, 16-20. Howorth, W. S. (1964). The handle of suiting, lingerie, and dress fabrics. Journal of Textile Institute, 55, T251-T260. Jang, J. R., Sun, C., & Mizutani, E. (1997). Neurofuzzy and soft computing. Upper Saddle River, NJ: Prentice-Hall, Inc.
560
Kundu, S., & Chen, J. (1994). Fuzzy linear invariant clustering with applications in fuzzy control. In Proceedings of the North American Fuzzy Information Processing Society (NAFIPS) Biannual Conference (pp. 196-200). Piscataway, NJ: The Institute for Electrical and Electronics Engineers (IEEE). Kuo, C. J., & Lee, C. (2003). A back-propagation neural network for recognizing fabric defects. Textile Research Journal, 73, 147-151. Kuo, C. J., Shih, C., Kao, C., & Lee, J. (2005). Color and pattern analysis of printed fabric by an unsupervised clustering method. Textile Research Journal, 75, 9-12. Kuo, C. J., Shih, C., & Lee, J. (2004). Automatic recognition of fabric weave patterns by a fuzzy c-means clustering method. Textile Research Journal, 74, 107-111. Lindeman, H. R., Merenda, P. F., & Gold, R. Z. (1980). Introduction to bivariate and multivariate analysis. Glenview, IL: Scott, Foresman and Company. Mikulcic, A., & Chen, J. (1996). Experiments on application of fuzzy clustering in fuzzy system design. In Proceedings of FUZZ IEEE 1996. New Orleans, LA: The Institute for Electrical and Electronics Engineers (IEEE).
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988a). A new approach to the objective evaluation of fabric handle from mechanical properties: Part I. Objective measure for total handle. Textile Research Journal, 58, 438-444. Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988b). A new approach to the objective evaluation of fabric handle from mechanical properties: Part II. Objective measure for primary handle. Textile Research Journal, 58, 531-537. Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988c). A new approach to the objective evaluation of fabric handle from mechanical properties: Part III. Fuzzy cluster analysis for fabric handle sorting. Textile Research Journal, 58, 565-571. Postle, R., & Mahar, T. J. (1982). Basic requirements for an international objective measurement programme for wool fabric quality. In S. Kawabata, R. Postle, & M. Niwa (Eds.), Objective specification of fabric quality, mechanical properties and performance (pp. 1-22). Osaka, Japan: The Textile Machinery Society of Japan. Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (2000). Neural and adaptive systems: Fundamentals through simulations. New York: John Wiley & Sons, Inc. Pynckels, F., Kiekens, P., Sette, S., Langenhove, L. V., & Impe, K. (1995). Use of neural nets for determining the spinnability of fibres. Journal of Textile Institute, 86, 425-437. Raheel, M., & Liu, J. (1991). An empirical model for fabric hand: Part I. Objective assessment of light weight fabrics. Journal of Textile Institute, 61, 31-38. Rong, G. H., & Slater, K. (1992). A new approach to the assessment of textile performance. Journal of Textile Institute, 83, 197-208. Rudie, R. (1996). Denim does spring. Bobbin, 34-37. Rust, J. P., Keadle, T. L., Allen, D. B., & Barker, R. L. (1994). Tissue softness evaluation by mechani-
cal stylus scanning. Textile Research Journal, 64, 163-168. Sabharwal, J., & Chen, J. (1996). Intelligent pH control using fuzzy linear invariant clustering. In Proceedings of Southeastern IEEE Symposium on Systems Theory (pp. 514-518). Baton Rouge, LA: IEEE. Sarimveis, H., & Retsina, T. (2001). Tissue softness prediction using neural network methodologies: Such tools can improve productivity and minimize out-of-spec production. Pulp and Paper Canada, 102, 42-45. SAS Institute Inc. (1990). SAS/STAT user’s guide, version 6 (4th ed.). Cary, NC: SAS Institute Inc. Wang, G. Y. (1984). On the essence and application of models of comprehensive evaluations. Fuzzy Mathematics, 4, 81-88.
Key Terms Artificial Neural Network (ANN): ANN is a computing paradigm that loosely simulates cortical structures of the brain. The simplest element of ANN is called a processing element, or node. Soft computing techniques are used to develop different types of ANN models based on different processing elements. Bounded Sum: Denoted by the symbol ⊕, it is defined a s α ⊕ β = min(1, α + β) or α ⊕ β = (α + β) ∧ 1. Here, ⊕ denotes and in Boolean algebra. Drapability: It is the ability of a fabric to form pleating folds when deformed under its own weight. Durability: Durability is a denotation for textile and apparel quality features related to product reliability. These features include the change of tensile strength, tear strength, abrasion resistance, colorfastness, and cracking and bursting strength during service life.
561
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Eigenvalue: It is a solution derived from a vector-characteristic equation, representing a variance of a main effect. Fabric Drape Coefficient: It is a ratio of a projected pleating fold area formed by a piece of fabric after draping under its own weight to the original area of this piece of fabric without draping. The higher the fabric drape coefficient, the lower the fabric drapability. Fabric Performance: Fabric performance is a general term to implicate fabric end-use properties regarding durability, comfort, and aesthetics. Typical fabric properties like breakage and abrasion, heat and moisture transport, hand and drape, and pattern and color are among those end-use properties. Fuzzy Clustering: It is a family of clustering methods that partition a set of given data objects into (nondisjoint) fuzzy clusters. Each fuzzy cluster is a fuzzy set, and each object’s membership degree in all the clusters sum up to one. Fuzzy Linear Clustering: It is a fuzzy clustering method in which the prototype for each fuzzy cluster is a linear function of the input variables.
562
Fuzzy Set: A fuzzy set is a generalization of an ordinary (crisp) set. A fuzzy set S allows an element to have partial degree (between zero and one) of membership in S. Pure Bending Tester: It is an instrument for measuring fabric pure flexural deformation. Shear Tester: It is an instrument for measuring fabric in-plane shear deformation. Softness: Softness is a type of fabric touch feeling by a human’s hand related to fabric bulky, flexible, and springy properties. Spunbond: Spunbond is a specific nonwoven web-forming process by extruding, drawing, and laying synthetic filament fiber on a convey belt for fiber collection. Tailorability: It is the ease of converting a piece of 2-D fabric into a required piece of 3-D garment component. Yarn Interlacing: It is a method of forming woven fabrics by weaving two sets of yarns.
563
Chapter XXII
Applying Fuzzy Data Mining to Tourism Area R. A. Carrasco Universidad de Granada, Spain F. Araque Universidad de Granada, Spain A. Salguero Universidad de Granada, Spain M. A. Vila Universidad de Granada, Spain
ABSTRACT Soaring is a recreational activity and a competitive sport where individuals fly un-powered aircrafts known as gliders. The soaring location selection process depends on a number of factors, resulting in a complex decision-making task. In this chapter, we propose the use of an extension of the FSQL language for fuzzy queries as one of the techniques of data mining that can be used to solve the problem of offering a better place for soaring given the environmental conditions and customer characteristics. The FSQL language is an extension of the SQL language that permits us to write flexible conditions in our queries to a fuzzy or traditional database. After doing a process of clustering and characterization of a large customer database in a data warehouse, we are able of classify the next clients in a cluster and offer an answer according to it.
INTRODUCTION We can define data mining (DM) as the process extracting interesting information from the data in
databases (DBs). According to Frawley, PiatetskyShapiro, and Matheus (1992), discovered knowledge is interesting when it is novel, potentially useful, and nontrivial to compute. A series of new
added services aimed to support the customer in the postconsumption phase. The goal is to build up strong customer relationships and loyalties, which may provide continuous buying behavior. Some examples of ICT value-added services that a tourism enterprise can offer are the automatic categorization of user travel preferences in order to match users up with travel options (Gretzel, Mitsche, Hwang, & Fesenmaier, 2004) or search engine interface metaphors for trip planning (Xiang & Fesenmaier, 2005). Soaring is a recreational activity and competitive sport where individuals fly un-powered aircrafts known as gliders. The pilots of these gliders have had to sharpen their good meteorological sense to maximize their soaring experience. The selection of the best zone to fly in is directly related to a pilot’s skill. Soaring pilots get their lift from one main source: atmospheric instability. The more instability, the more height gained by pilots. The problem is that instability implies turbulences and novel pilots can be injured. We use the data recorded by the GPS (Global Positioning System) devices of pilots for discovering regularities and patterns to predict and select worthy zones to fly in for pilots depending on their characteristics. In order to provide information to predict patterns and trends more convincingly and to analyze a problem or situation more efficiently, an integrated DSS (decision support system) designed for this particular purpose is needed. A DSS for adventure practice recommendation can be offered as a postconsumption value-added service by travel agencies to their customers. Therefore, once a customer makes an online reservation, the travel agency can offer advice about adventure practices available in the area that the customer may be interested in. Due to the high risk factor accompanying most adventure sports, a regular information system is far from being accurate. A more sophisticated ICT system is required in order to extract and process quality information from different sources. In this way, the customer can be provided with true helpful assistance to be aided in the decision-making process.
Applying Fuzzy Data Mining to Tourism Area
This chapter is organized as follows. In the next section we introduce a brief explanation about FSQL and the FSQL server. Then, we introduce FSQL and the FSQL server and show its capabilities related to DM. Next, we use the proposed system to solve some particular problems of soaring location selection. Finally, we suggest some conclusions and future works.
Literature Review We can find in the literature authors who propose the use of group segmentation in order to help develop tourist-oriented marketing strategies aimed at increasing visitation to specific locations. Visitor segmentation approaches have been used to identify potential gender constraints for participating in recreational activities (Hudson, 2000). In this study, the author measured perceptions of intrapersonal, interpersonal, and structural constraints on skiing participation for potential skiers. Significant differences between men and women were discovered using a probabilistic approach, mainly based on the Chi-square analysis. Also, the use of group segmentation has helped to explain differences in what United States consumers want and what they actually buy in vacation scenarios (Shoemaker, 1994). The author claims there is an implicit assumption in all tourist motivation studies that the consumer will choose the destination or type of holiday or vacation that will best satisfy his or her desires or needs. Segmentation according to age has been carried out extensively for the senior market. Even within the senior market there are heterogeneous groups (Shoemaker, 1989). Also in the senior market, Fleischer and Pizam (2002) find that income and health are the determinant factors of heterogeneity between the groups in terms of vacation length. Furthermore, the number of vacation days decreases with age. We can see this as an example of functional dependency discovering. In order to gain a profile of high-spending groups, segmentation studies have been carried out according to levels of expenditure at a des-
tination. Nevertheless, studies on visitors to the Canary Islands and the Balearic Islands show that different nationalities and different demographic clusters imply heterogeneous expenditure patterns (Bethencourt, Díaz, Alvarez, & Gonzalez, 2002; Perez & Sampol, 2000). The activities undertaken by tourists can also be used to segment the tourist market. This is usually done by defining different types of tourists (Gibson & Yiannakis, 2002) or more specifically using activity choices to build homogeneous clusters (Shoemaker, 1994).
FSQL A LANGUAGE FOR FLEXIBLE QUERIES FSQL (Galindo et al., 2006) extends SQL to allow flexible queries. We show an abstract with the main extensions added to the SELECT command. •
Linguistic Labels: If an attribute is capable of undergoing fuzzy treatment, then linguistic labels can be defined on it. These labels will be preceded with the symbol $ to distinguish them easily. There are two types of labels and they will be used in different fuzzy attribute types. 1. Labels for attributes with an ordered underlined fuzzy domain: Every label of this type has an associated trapezoidal possibility distribution as shown in Figure 1. So, for example, we can define labels $Very Short, $Short, $Normal, $Tall, and $Very Tall for the height attributed to a person. For instance, we can set a possibility distribution for label $Tall with the values αA=175, βA=180, γ A=195 y δ A=200 (in centimeters). 2. Labels for attributes with an unordered domain: Here, there is a similarity relation defined between each two labels in the domain. The similarity degree is in the interval [0,1]. For example, for the
565
Applying Fuzzy Data Mining to Tourism Area
Figure 1. Trapezoidal possibility distributions A and B
•
•
566
hair colour attributed to a person, we can define that labels $Fair and $RedHaired are similar with a 0.6 degree. Fuzzy Comparators: In addition to the common comparators (=, >, etc.), FSQL includes fuzzy comparators of two trapezoidal possibility distributions A and B with A=$[αA,βA,γA,δA] and B=$[αB,βB,γB,δB] (see Figure 1) as seen in Table 1. In the same way as in SQL, fuzzy comparators can compare one column with one constant or two columns of the same type. Necessity comparators are more restrictive than possibility comparators; that is, their fulfillment degrees are always lower than the fulfillment degrees of their corresponding possibility comparators (Galindo, Medina, Vila, & Pons, 1998). Fulfillment Thresholds γ: For each simple condition, a fulfillment threshold may be established (default is 1) with the format THOLD γ, indicating that the condition must be satisfied with a minimum degree γ in [0,1]. The reserved word THOLD is optional and can be substituted by a traditional crisp comparator (=, ≤, etc.), modifying the meaning of the query. The word THOLD is equivalent to the use of the crisp comparator ≥. Example 1: Give me all persons with fair hair (in minimum degree 0.5) that are possibly taller than label $Tall (in minimum degree 0.8).
•
•
•
SELECT * FROM Person WHERE Hair FEQ $Fair THOLD 0.5 AND Height FGT $Tall THOLD 0.8;
If we are looking for persons that are necessarily taller than label $Tall, then we use the fuzzy comparator NFGT instead of FGT. CDEG() Function: This function shows a column with the fulfillment degree of the condition of the query for a specific attribute, which is expressed in brackets as the argument. If logic operators appear, the calculation of the compatibility degree is carried out using the minimum T-norm and the maximum T-conorm, but the user may change these values by modifying only one view. If the argument of the CDEG function is an attribute, then the CDEG function uses only the conditions that include that attribute. We can use CDEG(*) to obtain the fulfillment degree of each tuple (with all of its attributes, not just one of them) in the condition. Character %: It is similar to the character * of SQL but this one also includes the columns for the fulfillment degrees of the attributes in which they are relevant. Fuzzy Constants: We can use and store all of the fuzzy constants (which appear in Table 2) in FSQL. Example 2: A query that shows a compatibility degree, uses a trapezoidal constant,
Applying Fuzzy Data Mining to Tourism Area
and avoids the UNKNOWN values could possibly be the following:
Figure 2. Fuzzy quantifier most
SELECT City,CDEG(Inhabitants) FROM Population WHERE Country=’Spain’ AND Inhabitants FGEQ $[200,350,650,800] .75;.
The following are remarks concerning the example. • The minimum threshold is set at 0.75. The word THOLD does not appear because it is optional.
Table 1. The 18 fuzzy comparators for FSQL (16 in the possibility/necessity family, and 2 in the inclusion family) Possibility FEQ or F= FDIF, F!= or F FGT or F> FGEQ or F>= FLT or F< FLEQ or F> MLT or F NFGEQ or NF>= NFLT or NF< NFLEQ or NF> NMLT or NF b (Goguen implication). There are many choices of truth functions of logical connectives. However, the chosen collection of connectives should obey reasonable properties such as the adjointness property, which is required to be satisfied by the truth functions of conjunction and implication.
endnote
1
Supported by grant No. 1ET101370417 of GA AV ČR ������������������������������������������ ��������������������������������������� and by institutional support, Research Plan MSM 6198959214
657
658
Chapter XXVI
Fuzzy Inclusion Dependencies in Fuzzy Databases Awadhesh Kumar Sharma MMM Engg College, Gorakhpur, UP, India A. Goswami I.I.T., Kharagpur, India D. K. Gupta I.I.T., Kharagpur, India
Abstract
In this chapter, the concept of fuzzy inclusion dependencies (FIDαs) in fuzzy databases is introduced and inference rules on such FIDαs are derived. These FIDαs may arise while putting efforts to integrate fuzzy relational databases into a fuzzy relational multidatabase. Since the concept of FIDαs itself is new, no work has been done as yet for its discovery in fuzzy relational databases. Hence, an algorithm has been proposed for the discovery of FIDαs that may exist between two given fuzzy relations stored in one or more fuzzy relational databases. The existence of such an FIDα will indicate that the two relations are relevant for integration during the course of integration of fuzzy databases.
INTRODUCTION AND BACKGROUND Fuzzy Relational Databases Crisp databases and fuzzy databases are developed to provide users with the ability to store data that
can be used in deriving information satisfying their needs. The design of these databases requires several theoretical foundations, efficiency, and ease of use. Thus, they accommodate a wider range of real-world requirements and provide a friendlier environment for man-machine interaction. The
information obtained from them can be used in decision making and problem solving in certain environments. The environment that involves uncertainty and imprecise, incomplete, or vague information can be dealt with using approximate reasoning. This is important because real-world data occur with partial or incomplete knowledge associated with them. For example, if the temperature of a person suffering from fever is 101.8oF, then one may specify it to be around 102oF. Others may indicate the same by simply saying that the person is suffering from high fever. All the above statements are relevant for answering queries related to the condition of the person in terms of the state of his or her fever. To manipulate such information and a variety of null values such as unknown or inapplicable data values, the fuzzy relational data models are developed based on fuzzy set theory and possibility theory. Thus, the fuzzy relational data models are rigorous schemes for incorporating fuzzy information in classical relational databases and in operations of relational algebra (Biskup, 1983; Brodie, 1984; Codd, 1979; Lipski, 1979, 1981; Maier, 1983). The basic definitions, concepts, and notations used in this chapter are taken from fuzzy set theory. The details of these concepts can be obtained from the works of Dubois and Prade (1980), Kaufmann (1975), Zadeh (1965, 1979, 1981, 1983, 1985), Kandel (1986), and Zemankova and Kandel (1985).
Definition: Let U be a universe of discourse. A set F is a fuzzy set of U if there is a membership function µ F : U → [0,1] that associates with each element u ∈ U a membership value µ F (u ) in the interval [0,1]. The membership value µ F (u ) for each u ∈ U represents the grade of membership of the element u in the fuzzy set F. F may be represented by F = {µ F (u ) / u | u ∈ U } .
Notations and Definitions
The Intelligent-Student Relation is
In this section, a brief review of definitions, notations, and concepts extensively used in this chapter is carried out.
r ⊆ Name × SSN × Age × Grade .
Definition: Let U * = U1 × U 2 × × U n be the Cartesian product of n universes and A1 , A2 , , An be a fuzzy set in U1 , U 2 , , U n , respectively. Then the Cartesian product A1 × A2 × × An is defined to be a fuzzy subset (denoted by ⊆ ) of U1 × U 2 × × U n , with f
(
µ A1× A2 ×× An (u1 , u2 , , un ) = min µ A1 (u1 ), µ A2 (u2 ), , µ An (un )
),
where ui ∈ U i , i = 1, 2, , n . An “n-ary” fuzzy relation R in U* is a relation that is characterized by an n-variate membership function ranging over U*, that is, µ R : U * → [0,1] . Example: Suppose there is a need to capture the set of intelligent students in a university. Let the attributes that identify the intelligence level of a student be Name, Social Security Number (SSN), Age, and Grade (in university examination). The fuzzy relation, namely, Intelligent-Student Relation, can be represented by the Table 1.
f
Therefore,
Table 1. Intelligent-student relation Name
SSN
Age
Grade
µr
Jack Dave
123456789 987654321
0.87/22 0.93/33
0.77/A 0.88/B
0.77 0.88
659
Fuzzy Inclusion Dependencies in Fuzzy Databases
µr ( Jack ,123456789, 0.87 / 22, 0.77 / A) = 0.77 .
The concepts of α -cuts and strong α -cuts also find their importance and are defined as follows. Definition: Given a fuzzy set A defined on U and any number α ∈ [0,1] , the α -cut α A and the strong α -cut α + A are the crisp sets α
A = {u | µ A (u ) ≥ α }
α+
µ Middle
for
0 x − 20 = 6015− x 15 1
either
x ≤ 20 or
x ≥ 60
for 20 < x < 35 for 45 < x < 60 for 35 ≤ x ≤ 60
x ≤ 45 0 for − 45 µOld = x 15 for 45 < x < 60 1 for x ≥ 60
They are represented by Figure 1, where for the sake of simplicity, Young, Middle, and Old have been denoted by A1, A2, and A3. Then, 0 A1 = 0 A2 = 0 A3 = [0,80] = X , where 0 A1 , 0 A2 , and 0 A3 represent α -cuts of fuzzy sets Young, Middle, and Old, respectively, for α = 0 .
A = {u | µ A (u ) > α } . α
Example: Consider the fuzzy sets Young, Middle, and Old identifying the age of persons with the following membership functions in the interval [0,80]. x ≤ 20 1 for µYoung ( x) = 3515− x for 20 < x < 35 0 for x ≥ 35
Figure 1. Membership functions for A1, A2, and A3
660
α
A1 = [0,35 − 15α ],
α+
A1 = (0,35 − 15α ),
1+
α+
A2 = [15α + 20,60 − 15α ],
A2 = (15α + 20,60 − 15α ),
α+
α
A3 = [15α + 45,80],
A3 = (15α + 45,80),
for
for
all α ∈ (0,1];
all α ∈ [0,1);
A1 = 1+ A2 = 1+ A3 = φ . f
f
Then, α A ⊇ α A and α + A ⊇ α + A for a fuzzy set A and a pair α1 ,α 2 ∈ [0,1] , such that α1 < α 2 . The support 0 + A of a fuzzy set A within a universal set U is the crisp set that contains all the elements of U that have nonzero membership grades in A. The height, h(A), of a fuzzy set A is given by 1
2
1
2
Fuzzy Inclusion Dependencies in Fuzzy Databases
A fuzzy set A is called normal when h(A) = 1; it is called subnormal when h(A) < 1. The height h(A) of A may also be viewed as the supremum of α for which α A ≠ φ . Any property generalized from classical set theory into the domain of fuzzy set theory that is preserved in all α -cuts for α ∈ (0,1] in the classical sense, is called a cut-worthy property. If it is preserved in all strong α -cuts for α ∈ [0,1] , it is called a strong cut-worthy property. h( A) = sup µ A (u ) . u∈U
Properties of
α
-Cuts
The concepts of α -cuts and strong α -cuts play a principal role in the relationship between fuzzy sets and crisp sets. They can be viewed as a bridge by which fuzzy sets and crisp sets are connected. Let A, B ∈ ( X ) , where ( X ) denotes the fuzzy power set of X (the set of all ordinary fuzzy sets of universal set X). Then the following properties hold for all α , β ∈ [0,1] : 1. 2. 3. 4. 5.
α+
f
A⊆ α A , f
α ≤ β ⇒ α A⊇ β A α
f
f
f
f
( A B) = α A α B
α+
( A) =
(1−α ) +
A⊇ β + A ,
f
f
α
( A B) = α A α B ,
and
α+
( A B) = α + A α + B ,
( A B) = α + A α + B
f
f
( A) .
Let Ai ∈ ( X ) for all i ∈ I , where I is an index set. Then, 6.
f
α
i∈ I
7.
f
α+
i∈ I
A⊆ B f
f
Ai ⊆
f
8.
α
Ai ⊆ Ai i∈ I f
A⊆ B
α+
f
iff iff
α+
and
α
i∈ I
f Ai i∈ I α
f
and
f
i∈ I
f
α
A⊆ B; f
A⊆ α +B;
,
α+
α
Ai ⊆ Ai , i∈ I f
f
Ai ⊆
α+
10.
A=B
iff
α
A=B
iff
α+
α
A=
f
β
A=
β 2 ( A − B)( x) = ( x + 6) / 4 for −6 < x ≤ −2 (2 - x) / 4 for −2 < x ≤ 2 0 [3 − (4 − x)1/ 2 ]/ 2 ( A ⋅ B)( x) = 1/ 2 (1 + x) / 2 [4 − (1 + x)01/ 2 ]/ 2
for x < −5 and x ≥ 15 for −5 ≤ x ≤ 0 for 0 ≤ x < 0 for x < −1 and x ≥ 3 for 3 ≤ x < 15
( x + 1) /(2 − 2 x) for −1 ≤ x ≤ 0 ( A / B)( x) = (5 x + 1) /(2 x + 2) for 0 ≤ x < 1/ 3 (3 − x) /(2 x + 2) for 1/ 3 ≤ x < 3
The second method of developing fuzzy arithmetic is based on the extension principle. Employing this principle, standard arithmetic operations on real numbers are extended to fuzzy numbers. Let * denote any of the four basic arithmetic operations and let A,B denote fuzzy numbers. Then a fuzzy set A ∗ B on is defined by the
666
z=x+ y
( A − B )( x) = sup min[ A( x), B ( y )] z=x− y
( A ⋅ B)( x) = sup min[ A( x), B ( y )] z = x⋅ y
( A / B)( x) = sup min[ A( x), B( y )] z=x / y
Although A ∗ B is a fuzzy set on , it is a continuous fuzzy number for each ∗∈ {+, −, ⋅, /} (Klir & Yuan, 1995).
Fuzzy Relations An important tool for many applications of fuzzy set theory is the concepts of fuzzy relations and fuzzy relational equations. Mathematically, an n-ary fuzzy relation r is defined as a fuzzy subset of the Cartesian product of some universes. Thus, given n universes U1,U2,...,Un, fuzzy relation r is a fuzzy subset of U1 × U 2 × × U n and is characterized by the n-variate membership function (Dubois & Prade, 1980; Kaufman, 1975; Zadeh, 1981) µr : U1 × U 2 × × U n → [0,1] .
While applying this definition to relational databases, it is necessary to provide appropriate interpretation for the elements of U i , i = 1, 2,, n and µr . For this purpose, it is noted that for a relational data model that can support imprecise information, it is necessary to accommodate two types of impreciseness, namely, the impreciseness in data values and impreciseness in the association among data values. As an example of impreciseness in data values, consider the Employee(Name, Salary) database, where the salary of an employee,
Fuzzy Inclusion Dependencies in Fuzzy Databases
say, John, may be known to the extent that it lies in the range of $20,000 to $40,000, or it may be known only that John has a high salary. Similarly, as an example of impreciseness in the association among data values, let Likes(Student, Course) represent how much a student likes a particular course. Here, the data values may be precisely known, but the degree to which a student, say, John, likes a course is imprecise. It is also not difficult to envisage examples where both ambiguity in data values as well as impreciseness in the association among them are present. Some attempts have been made to use fuzzy sets theory and related concepts for providing a suitable interpretation of different types of impreciseness in relational databases. Buckles and Petry (1982a, 1982b, 1985a, 1985b) have suggested that attribute values be replaced by sets of values. Also, a fuzzy similarity measure has been used to identify similar tuples. Ruspini (1982) has used a lattice organization for domains, where domain values correspond to one or more lattice points determined by a possibility distribution. Umano (1982, 1984) and Prade and Testemale (1984) have proposed models based explicitly on possibility distribution, where domain values are taken from sets of possibility distributions and associations among entities are also measured by possibility distributions. Prade and Testemale (1982, 1984) have shown that such an extended data model can accommodate different types of null values used in the classical relational database literature. Baldwin (1983) has used a mixed approach, where domain values are allowed to be fuzzy sets, and association among entities is represented as a truth value in [0,1]. Zemankova and Kandel (1985) have provided a good exposition of the fuzzy relational data model and have advocated the use of linguistic quantifiers. As the fuzzy set theory is incorporated in relational database systems, it can treat null values as in Codd (1979), incomplete or partial information as in Lipski (1979, 1981), and uncertain or even probabilistic knowledge. The present treatment of the fuzzy relational data model will try to adhere to the notations used
in the classical relational database theory as far as possible. Thus, a relation scheme R is a finite set of attribute names {A1,A2,...,An} and will be denoted by R(A1,A2,...,An) or simply by R. Corresponding to each attribute Ai ,1 ≤ i ≤ n , there exists a set dom(Ai), called the domain of Ai. However, unlike classical relations, in the fuzzy relational data model, dom(Ai) may be fuzzy set or even a set of fuzzy sets. Hence, along with each attribute Ai, a set Ui is associated, called the universe of discourse for the domain values of Ai. Definition: A fuzzy relation r on a relation scheme R(A1,A2,...,An) is fuzzy subset of dom( A1 ) × dom( A2 ) × × dom( An ) . Depending on the complexity of dom(Ai),i = 1,2,...n, fuzzy relations are classified into two categories. In Type 1 fuzzy relations, dom(Ai) can only be a fuzzy set (or a classical set). A Type 1 fuzzy relation may be considered as a first-level extension of classical relations that will enable us to capture the impreciseness in the association among entities. Type 2 fuzzy relations provide further generalization by allowing dom(Ai) to be a set of fuzzy sets (or possibility distributions). By enlarging dom(Ai), Type 2 relations enable us to represent a wider type of impreciseness in data values. Such relations can be considered as a second-level generalization of classical relations. Like classical relations, a fuzzy relation r is represented as a table with an additional column for µr (t ) denoting the membership value of the tuple t in r. This table will contain only those tuples for which µr (t ) > 0 .
Possibility Distribution Instead of treating µ F (u ) to be the grade of the membership of u in F, one may interpret it as a measure of the possibility that a variable X has a value u, where X takes values in U. For example, the fuzzy set High- Salary may be considered as follows:
667
Fuzzy Inclusion Dependencies in Fuzzy Databases
High-Salary = {0.5/20000, 0.6/30000, 0.7/40000, 0.9/50000, 0.1/70000}. Suppose it is known that John has a high salary; then, according to possibilistic interpretation, one concludes that the possibility of John having salary=$40000 is 0.7. Zadeh (1979, 1981) has suggested that a fuzzy proposition X is F, where F is a fuzzy subset of U, and X, which is a variable that takes its value from U, induces a possibility distribution Π X , which is equal to F (i.e., Π X = F ). The possibility assignment equation is interpreted as Poss ( X = u ) = µ F (u ), for all u ∈ U .
Thus, the possibility distribution of X is a fuzzy set that serves to define the possibility that X could have any specified value u ∈ U . One may also define a function Π X : U → [0,1] that is equal to µ F and associates with each u ∈ U the possibility that X could take u as its value, that is, Π X (u ) = Poss ( X = u ) for u ∈ U .
The function Π X is called the possibility distribution function of X. The possibility distribution a fuzzy measure Π X may also be used to define f Π ¦ on U where for any A ⊆U , Π ( A) = Poss ( X ∈ A) = sup Π X (u ) . u∈ A
Type 1 Fuzzy Relational Data Model As discussed earlier, in Type 1 fuzzy relations, dom(Ai) may be a classical subset or a fuzzy subset of Ui. Let the membership function of dom(Ai) be denoted by µ Ai for i = 1,2,...,n. Then, from the definition of the Cartesian product of fuzzy sets, dom( A1 ) × dom( A2 ) × × dom( An ) is a f u zz y subset of U ∗ = U1 × U 2 × × U n . Hence, a Type 1 fuzzy relation r is also a fuzzy subset of U* with membership function µr . Also, from the definition of the Cartesian product of fuzzy sets, for all
668
(u1 , u2 , un ) ∈ U * , µr must satisfy
(
)
µr (u1 , u2 , , un ) ≤ min µ A1 (u1 ), µ A2 (u2 ), , µ An (un ) .
According to the possibilistic interpretation of fuzzy sets, µr can be treated as a possibility distribution function in U*. Thus, µr (u1 , u2 , , un ) determines the possibility that a tuple t ∈ U ∗ has t[Ai] = ui for i = 1,2,...,n. In other words, µr (u1 , u2 , , un ) is a fuzzy measure of association among a set of domain values {u1 , u2 , , un } . Example: Consider a relation scheme LIKES(Student,Course), where dom(Student) and dom(Course) are ordinary sets; that is, the domain values are crisp. In the fuzzy relation r shown in Table 2, µr (t ) can be interpreted as a possibility measure of a student liking a particular course. Thus, the possibility of Soma liking OOP is 0.85. So, µr is a fuzzy measure of the association between Student and Course. It is also possible to provide an alternative interpretation of µr as a fuzzy truth value belonging to [0,1]. According to this interpretation, for a tuple t , µr (t ) is the truth value of a fuzzy predicate associated with the relation r when the variables in the predicate are replaced by t[Ai], i = 1,2,...,n. Example: Consider a relation scheme R(N,J,X,S) of highly experienced and highly salaried employees, where N=Employee’s name, J=Job, X=Experience, and S=Salary . Here, dom(N) and dom(J) are ordinary sets, but dom(X) and dom(S) are the fuzzy sets High-Experience and High-Salary in appropriate Table 2. An instance r of LIKES Student
Course
µr
Soma Roma John Mary
OOPS DBMS CG DSA
.85 .75 .8 .9
Fuzzy Inclusion Dependencies in Fuzzy Databases
universes. Suppose that the universe of discourse UX for Experience is the set of positive integers in the range 0 to 30. Similarly, US, the universe of discourse of Salary, is the set of integer numbers in the range 10,000 to 100,000. The membership functions µ HX and µ HS of the fuzzy sets High-Experience and High-Salary are as follows: | x − 10 | −1 1+ for x ≤ 10 and µ HX ( x) = 4 1 for x > 10 | x − 60000 | −1 1+ for x ≤ 60000 . µ HX ( x) = 20000 for x > 60000 1
Note that the membership function associated with the fuzzy set descriptor High is domain dependent. A typical instance r of R is shown in Table 3. In this example, µr (t ) can be interpreted as the truth value of the fuzzy proposition “Y has HighExperience and High-Salary” for tuple t. Thus, the truth value of the fuzzy proposition “Roma has High-Experience and High-Salary” is 0.80. In many applications, it may be necessary to combine both of these interpretations of the membership function. For example, in the entity relationship (E-R) model (Date, 1981; Haier, 1983; Ullman, 1980), one may interpret µr as the possibility of association among the entities and follow the truth value interpretation for membership of a tuple in the entity sets. In this connection, a paper by Zvieli and Chen (1986) may be referred to, where
fuzzy set theory has been applied to extend the E-R model and the basic operations of fuzzy E-R algebra have been examined.
Type 2 Fuzzy Relational Data Model Although Type 1 relations enable us to represent impreciseness in the association among data values, their role in capturing uncertainty in data values is rather limited. For example, in a Type 1 relational model for Employee(Name, Salary), one is not permitted to specify the salary of John to be in the range $ 40,000 to $ 50,000 and that of Mary to be in the fuzzy set Low. With a view to accommodate a wider class of data ambiguities, a further generalization of the fuzzy relational data model has been considered where for any attribute Ai,dom(Ai) may be a set of fuzzy sets in Ui. As a consequence of this generalization, a tuple t = (a1,a2,...,an) in D = dom( A1 ) × dom( A2 ) × × dom( An ) becomes a fuzzy subset of U ∗ = U1 × U 2 × × U n , with
(
)
µt (u1 , u2 , , un ) = min µa1 (u1 ), µa2 (u2 ), , µan (un ) ,
where ui ∈ U i , for i = 1, 2, , n . Since this equation holds for all ui ∈ U i , for i = 1, 2, , n , and according to the definition of a fuzzy relation, a Type 2 fuzzy relation r is a fuzzy subset of D, where the membership function µr : D → [0,1] must satisfy the following condition. µr (t ) ≤
max
( u1 ,u2 ,,un )∈U ∗
(
)
min µa (u1 ), µa (u2 ), , µa (un ) , 1 2 n
where t = (a1 , a2 , , an ) ∈ D . As in the case of Type 1 relations, µr may be interpreted either as
Table 3. An example of fuzzy relation in Type 1 fuzzy relational database Name
Job
Experience
Salary
µr
Soma Roma John Mary
GM DGM DEN CPO
12 9 8 8
80000 70000 40000 60000
1.00 0.80 0.50 0.67
669
Fuzzy Inclusion Dependencies in Fuzzy Databases
a possibility measure of association among the data values or as a truth value of fuzzy predicates associated with r. Regarding the interpretation of a fuzzy data value ai ∈ dom( Ai ) , the ai is treated as a possibility distribution on Ui. In other words, for a tuple t = (a1 , a2 , , an ) ∈ D , the possibility of t[ Ai ] = ui is µ Ai (ui ) . Example: Suppose that an instance of the relation Employee(Name, Salary) may contain a tuple (John, S), where S={0:3/10000, 0:6/20000, 0:8/30000}. Here, S represents the possibility distribution for the salary of John, that is, Poss(Salary of John=30000)=0.8. Based on the possibilistic interpretation for the tuple t of fuzzy relation r, the following is obtained: Poss (t[ A1 ] = u1 , t[ A2 ] = u2 , , t[ An ] = un ) = min{µr (t ), µt (u1 , u2 , , un )} ,
where ui ∈ U i , for i = 1, 2, , n . It is also possible to extend the above equation to find the possibility that for a tuple t = (a1,a2,...,an), t[Ai] = ai, where ai is a fuzzy subset of Ui. The evaluation of such a condition is, however, related to the concept of the compatibility of two fuzzy propositions (Dubois & Prade, 1980; Zadeh, 1979, 1981, 1983, 1985). Example: Consider the relation EMPLOYEE(N,D,J,X,S,I) where N=Name of the Employee, D=Department, J=Job, X=Experience, S=Salary, and I=Income Tax. dom(N), dom(D), and dom(J) are ordinary sets, but dom(X), dom(S), and dom(I) are sets of fuzzy sets in universes UX, US,
and UI, respectively. UX, US, and UI are assumed to be sets of positive integers in the ranges 0 to 30, 10,000 to 100,000, and 0 to 10,000, respectively. A typical instance r of Employee is shown in Table 4, where fuzzy set descriptors High, Low, Mod, and so forth have been used to represent fuzzy data values over respective domains. Since all elements of a classical subset have a membership value of 1.0, for notational convenience, elements of classical subsets are represented without their membership values such as {10, 14, 19} instead of {1.0/10, 1.0/14, 1.0/19}. The membership functions of fuzzy set descriptors High, Low, and so on are domain dependent as follows. For x ∈ U X , (1+ |1 − 8 |)−1 for x > 1 , µ Mod ( x) = otherwise 0 (1 + 12 x )−1 for x > 0 , µ Little ( x) = therwise 0 (1 + a | y − c |)−1 for y ≤ c , and µ High ( y ) = for y>c 1 µ Low ( y ) = 1 − µ High ( y ) ,
where for y ∈U S , a =
. 1 1 , c = 60, 000 and for y ∈ U I , a = , c = 5, 000 20, 000 1, 000
Table 4. An example of a fuzzy relation in Type 2 fuzzy relational database
670
Name
Department
Job
Experience
Salary
Income Tax
µr
Soma Roma John Mary
Finance Finance Electrical Personnel
GM DGM DEN CPO
0.9/Mod 0.8/Little 0.9/Mod 0.2/Little
0.9/High 0.3/Low 0.6/High 0.7/Mod
0.8/High 0.5/High 0.2/Low .4/Low
0.8 0.3 0.2 0.2
Fuzzy Inclusion Dependencies in Fuzzy Databases
Applying the equation for possibilistic interpretation to the third tuple in r, it may be concluded that the possibility of John having Moderate Experience, High Salary, and Low Income Tax is 0.2. The possibility value thus obtained would be useful during query evaluation to identify the tuples that have nonzero (or greater than a given threshold) possibility of satisfying the query predicate.
Fuzzy Data Dependencies A functional dependency is a constraint on a set of attributes (A1,A2,...,Ak,X) in a relation R, specifying that for any two tuples t1 and t2 from R, the following condition holds: t1 ( A1 , A2 , , Ak ) = t2 ( A1 , A2 , , Ak ) ⇒ t1 ( X ) = t2 ( X ) .
The derivation of functional dependencies through inference rules has been treated extensively in Casanova, Fagin, and Papadimitriou (1982), Kantola, Mannila, Räihä, and Siirtola (1992), Missaoui and Godin (1990), and Mitchell (1983). The problem of finding evidence for functional dependencies from the extent of relations has also been considered. Several projects deal with the question of how to efficiently find candidates for functional dependencies from among the attributes of a relation (Bell & Brockhausen, 1995; Savnik & Flach, 1993). Functional dependencies and inclusion dependencies (INDs) are related but have some important differences. In particular, functional dependencies generally are defined only within one relation, whereas the natural purpose of inclusion dependencies is to define relationships across two different relations. Mitchell (1983) also considers inclusion dependencies within one relation. Functional and inclusion dependencies are related in the sense that they both constrain possible valid database states and are thus helpful in database design. However, for the purpose of discovering information about relationships across unknown fuzzy relational databases, the case of fuzzy inclusion dependencies (FIDs) is more useful.
Usually, meta-information about databases, such as the semantics of schema objects, functional dependencies, or relationships between different databases, is not explicitly available for database integration. Often, only the schema is known (or can be queried) and the database can be queried through some kind of interface (e.g., using a query language such as SQL, structured query language). However, many other kinds of meta-information about sources would be beneficial to perform meaningful database integration. One important class of meta-information is the class of constraints that restrict the possible states of a database. Such constraints are useful in the determination of relationships between databases (Larson, Navade, & Elmasari, 1989) and thus for database integration. Manual search for such constraints is tedious and often not possible. This is true in particular when many related databases are available or when large relations (with many attributes) are to be compared for interrelationships. Therefore, the question of whether it is possible to automatically discover meta-information in otherwise unknown databases is important and has been approached by a number of authors, for example, Kantola et al. (1992), Koeller and Rundensteiner (2003), and Savnik and Flach (1993). While some types of constraints have been studied to some extent (for example, functional dependencies [Savnik & Flach, 1993] and various key constraints [Larson et al., 1989]), one important class of constraints, namely, INDs, has received little attention in the literature so far. INDs express subset relationships between databases and are thus important indicators for redundancies between databases. Inference rules on such dependencies have been derived in the literature (Casanova et al., 1982; Mitchell, 1983). Furthermore, researchers have studied the discovery of relationships between Web sites, in which Web sites with their hyperlinks are modeled as graphs (Cho, Shivakumar, & Garcia-Molina, 2000). It has been widely recognized that the imprecision and incompleteness inherent in real-world data suggest a fuzzy extension for information management systems. Various attempts
671
Fuzzy Inclusion Dependencies in Fuzzy Databases
to enhance these systems by fuzzy extensions can be found in the literature. In the context of the integration of fuzzy relational databases, FIDα s (Sharma, Goswami, & Gupta, 2004) can help to solve a very common and difficult problem: discovering redundancies across fuzzy relational databases. Due to the nature of fuzzy data and their generation, fuzzy information may often be stored in multiple places with large amounts of redundancy. When trying to integrate fuzzy databases that are likely to be (even partly) redundant, a method to discover such redundancies would be very beneficial. In general, the discovery of FIDα s (Sharma et al.) will be beneficial in any effort to integrate unknown fuzzy databases. A reliable algorithm to discover FIDα s will enable an integration system to incorporate new fuzzy relational databases that would not have been used previously since their relationships with existing fuzzy relational databases was not known. The following are some basic preliminaries for the progress of current work. Definition: For the fuzzy value equivalent (FVEQ), let A and B be two fuzzy sets defined on universe of discourse U with their membership functions µ A and µ B , respectively. A fuzzy value a ∈ A is said to be equivalent to some other fuzzy value b ∈ B, iff b ∈ µ B ( x) for some x ∈ S , where S is the set of crisp values that are returned by µ A −1 (a ) and µ A −1 is the inverse of the membership function of fuzzy set A. Example: Consider Figure 4, where membership functions representing the fuzzy sets child, young, mid, and old are used to identify the age of a person in relations Emp and Staff. These relations are under different DBAs, hence, the membership functions that correspond to these relations are shown to be nonidentical, being designed by different domain experts. Now let µ A and µ B represent membership functions of the fuzzy set young used in fuzzy relations Emp and Staff respectively and µC represent the membership function of the fuzzy set mid used in fuzzy relations Staff. µ A and µ B are not identical
672
because of individual differences in domain experts. Let there be a fuzzy value (0.5/young)= a ∈ A ; then, µ A −1 (a ) = {25,35} = S . If age x of a person is 25 years, then µ B ( x) = (1.0 / young ) = b ∈ B . Therefore, the fuzzy value (0.5=young) in fuzzy set A is said to be FVEQ to fuzzy value (0.5/young) in fuzzy set B. Similarly, if age x of a person is 35 years, then µC ( x) = (0.5 / mid ) = b ∈ C ; hence, the fuzzy value (0.5/young) in fuzzy set A is said to be FVEQ to fuzzy value (0.5/mid) in fuzzy set C.
Notations used for Fuzzy Relational Databases The fuzzy relational data model as given by Buckles and Petry (1982) and its derivatives are considered here; however, throughout this work, notations similar to that in Casanova et al. (1982) will be used. Set variables will be denoted by capital letters and variables that denote elements of a set will be denoted by small letters. “k-subset of X” would mean a subset of X with cardinality k, while a “k-set” is simply a set with cardinality k. A fuzzy value is an element of data that is stored in a fuzzy relation’s extent. Examples include .6/good, .5/old, or .8/high, and so forth. A domain D is a finite set of fuzzy values. A fuzzy attribute is a bag (multiset) of fuzzy values. A fuzzy relational schema is a pair (Rel,U) where Rel is the name of the fuzzy relation and U = (a1,a2,...,an) is a finite ordered n-tuple of labels that is known to be fuzzy attribute names. A fuzzy relation is a three-tuple R = (Rel,U,E), with Rel and U as above and E ⊆ D1 × D2 × × Dn being the fuzzy relation extent. The sets D1,D2,...,Dn are called the domains of R’s fuzzy attributes. A fuzzy tuple in fuzzy relation R is an element of E. An operator t[a1,a2,...,an] returns the projection of t on the fuzzy attributes named a1,a2,...,an.
fuzzy inclusion dependencies Fagin (1981) introduced and formally defined the inclusion dependency, which can be derived across
Fuzzy Inclusion Dependencies in Fuzzy Databases
two relations. Similarly, the fuzzy inclusion dependency (Sharma et al., 2004) has been introduced and formally defined, and can be derived across two fuzzy relations as given below. Definition: Let R[a1,a2,...,an] and S[b1,b2,...,bm] be (projections on) two fuzzy relations. Let X be a sequence of k distinct fuzzy attribute names from R, and Y be a sequence of k distinct fuzzy attribute names from S, with 1 ≤ k ≤ min(n, m) . Then, fuzzy inclusion dependency FID is an assertion of the form R[ X ] ⊆ S [Y ] , where all the fuzzy values under all the attribute names in R[X] are fuzzy value equivalent to some fuzzy values under the respective attribute names in S[Y]; however, the reverse may not hold. Remark: The assertion R[ X ] ⊆ S [Y ] in the above definition indicates µ X (u ) ⊆ µY (u ), ∀u ∈ U , which may not be fully satisfied because two different database designers may be having different perceptions about the same object, and may have used different membership functions to represent the same fuzzy set. Say, for example, in Figure 4, Emp[Age] uses a fuzzy set mid with support (35-55) to identify middle-aged persons, whereas Staff[Age] uses a fuzzy set mid with support (30-50) in the same context. This leads to the definition of partial fuzzy inclusion dependency FIDα as follows. Definition Partial Fuzzy Inclusion Dependency: Let R[a1,a2,...,an] and S[b1,b2,...,bm] be (projections on) two fuzzy relations. Let X be a sequence of k distinct fuzzy attribute names from R, and Y be a sequence of k distinct fuzzy attribute names from S, with 1 ≤ k ≤ min(n, m) . Then, a partial fuzzy inclusion dependency FIDα is an assertion of the form R[ X ] ⊆ S [Y ] , such that the fuzzy subsethood f
| R[ X ] S [Y ] | ( R[ X ], S [Y ]) = ≥α , | R[ X ] |
where α is specified in the interval [0,1] and most of the fuzzy values under all the attribute names
in R[X] are fuzzy value equivalent to some fuzzy values under the respective attribute names in S[Y]; however, the reverse may not hold. Definition: An FIDα ρ = (R[ai ,a i ,...,ai ] ⊂ 1 2 k S[bi ,bi ,...,bi ]) is valid between two relations 1 2 k R=(r,(a1,a2,...,an),ER) and S=(r,(a1,a2,...,an),ES) if the sets of fuzzy tuples in ER and ES satisfy the assertion given by ρ. Otherwise, FIDα is called invalid for R and S. In other words, FIDα is said to be valid if f ( R[ X ], S [Y ]) = |R[ X|R][XS]|[Y ]| ≥ α holds. Example: Consider the fuzzy relations belonging to different fuzzy relational databases and their respective membership functions and mapping as given in Figure 4. It is observed that Emp[Age]={.9/mid, .9/mid, .4/mid, .9/mid, .9/ mid}, Emp[Pay]={7/mod, .75/low, .4/high, .7/mod, .7/ mod}, Staff[Age]={.6/mid, .9/old, .1/old}, and Faculty[Salary]={.5/mod, .8/low, .4/high, .3/ high}.
Valid Fuzzy Inclusion Dependencies (FIDα) Since the fuzzy subsethood f
where all elements of the fuzzy set Staff[Age] are fuzzy value equivalent tof some element of fuzzy set Emp[Age], Staff [ Age] Emp[ Age] = Staff [ Age] . Thus, ( Staff [ Age], Emp[ Age]) =
|{.6 / mid ,.9 / old ,.1/ old }| 3 . = =1 |{.6 / mid ,.9 / old ,.1/ old }| 3
Hence, the fuzzy inclusion dependency FID = f Staff [ Age] ⊆ Emp[ Age] is valid. Similarly, the fuzzy subsethood
As indicated in Figure 4, all elements (except one, .4/high) of the fuzzy set Faculty[Salary] are fuzzy value equivalent to some element of fuzzy set Emp[Pay]; therefore, f
Figure 4. Fuzzy relations with respective membership functions and mappings
1 .75 .7
low
mod
high
.4 0
1 .8 .6 .5 .3 .1 0
5
10 15 20 Emp.Pay (in thousand Rs.)
low
5
mod
25
high
10 15 20 25 Faculty.Salary (in thousand Rs.)
and partial fuzzy inclusion dependency FIDα =.75 = Faculty[ Salary ] ⊆ Emp[ Pay ] is valid. A fuzzy inclusion dependency is merely a statement about two fuzzy relations that may be true or false. A valid FID describes the fact that a fuzzy projection of one fuzzy relation R forms a fuzzy subset of another fuzzy projection (of the same number of fuzzy attributes) of a fuzzy relation S. Note that FIDs are defined over sequences of attributes, not sets, since the order of attributes is important (FIDs are not invariant under permutation of the attributes of only one side), and the concept of fuzzy value equivalent is used to measure the equality of two fuzzy values or two fuzzy tuples.
Fuzzy Inclusion Dependencies in Fuzzy Databases
Definition: Let X and Y fbe sequences of k fuzzy attributes and ρ = R[ X ] ⊆ S [Y ] be an FID. Then k is the arity of ρ , denoted by | ρ |, and ρ is called a k-ary FID. A similar definition holds for partial FID.
the fuzzy value y is fuzzy value equivalent to a fuzzy value z, then x is fuzzy value equivalent to z.
Example: In Figure 4, the partialf fuzzy inclusion dependencyFaculty[ Name, Salary ] ⊆ Emp[ Name, Pay ] has the arity 2, hence it is said to be a binary the fuzzy inclusion dependency FIDα =.75 , whereas f Staff [ Age] ⊆ Emp[ Age] has got the arity 1, hence it is said to be an unary FID.
⇔ X is Fuzzy Value Equivalent to Z .
Inference Rules for FIDs Casanova et al. (1982) have provided some important insights into the IND problem. They have described a complete set of inference rules for INDs in the sense that repeated application of their rules will generate all valid INDs that can be derived from a given set of valid INDs (i.e., those rules form an “axiomatization” for INDs). Those rules will be redefined from the viewpoint of FIDs as given below.
f
( R[ X ] ⊆ S [Y ] ⇔ ∀x ∈ X , y ∈ µY (q ) for some q ∈ Q where Q = µ X −1 ( x) and f
S[Y ] ⊆ T [ Z ] ⇔ ∀y ∈ Y , z ∈ µT (q ) for some q ∈ Q where Q = µY −1 ( y )) ⇔ (∀x ∈ X , z ∈ µT (q ) for some q ∈ Q where Q = µ X −1 ( x))
Transitivity may not hold for FIDα 1 fnaturally implies a set of unary FID. Let ρ = R[ X ] ⊆ S[Y ] be a k-ary FID. Let f there be unary R[ x] ⊆ S[ y] with x ∈ X and y ∈ Y . Then,f there exists a close relationship between ρ and Σ1 , as formalized in the following corollary.
1.
∀ρ ∈ Σ : (Σ)
2.
∀ρ ∈ (Σ): ¬(( (Σ) − ρ )
1
1
m
1
m
1
m
m
f
Corollary: Let Σ k be the set of all possible k-ary FIDαf between two given fuzzy relations R and S. Let Σ1k be the fuzzy set whose elements are all k-sets of unary FIDα between R and S. Then, there is an f f f isomorphism between Σ k and Σ1k . It is said that Σ1k is implied by Σ . f
k
This isomorphic mapping is possible since FIDα are invariant under permutations of their attribute pairs (such that there are exactly as many k-ary FIDα as there are k-subsets of unary FIDα ), and each pair of single attributes in a k-ary FIDα ρ corresponds to one unary FIDα implied by ρ . Note that the isomorphism does not hold for valid FIDα since clearly the existence of k unary valid FIDα does not imply the existence of any higher f f arity valid FIDα (i.e., only the direction Σ k ⇒ Σ1k holds for valid FIDα , not the reverse). The validity of FIDα is preserved under projections and permutation by Axiom 2. In order to describe all fuzzy inclusion dependency information between two fuzzy relations, it is therefore not necessary to list all FIDα between two fuzzy relations. Rather, a small set of FIDα from which 676
f
f
ρ and
f
f
f
ρ ),
where the symbol − stands for fuzzy set difference. f In other words, the generating set (Σ) contains exactly those valid FIDα from which all valid FIDα f in Σ can be derived. The set is not empty for any f Σ since it can be constructed by first including all f f ρ ∈ Σ into (Σ) and then removing all ρ for which Property 2 does not hold. The set is minimal since f removing any FIDα ρ from a (Σ) for which Property 2 holds would by definition violate Property 1. Therefore, generating sets contain all information about fuzzy inclusion dependencies between fuzzy relations in a minimal number of FIDα . f
Discovery of Fuzzy Inclusion Dependencies The problem may be stated as follows. Given a set of fuzzy relations R* = {R1,R2,...,Rn} stored in one or more fuzzy relational database management systems, search for the generating set of fuzzy inclusion dependencies between any two relations in R*. Since all other FIDα s can be derived from the generating set by projection and permutation, it is sufficient to consider only this set.
A Three-Stage Solution to Discovery of FIDαs Fuzzy relational databases consist of fuzzy relations, which in turn consist of fuzzy attributes. In order to discover FIDα s in such fuzzy databases,
Fuzzy Inclusion Dependencies in Fuzzy Databases
the approach requires that no fuzzy attribute occurs more than once in any one fuzzy data object. The existence of such layers suggests a three-layered strategy to discover relationships between fuzzy databases: Compare fuzzy attributes, compare fuzzy relations, and, finally, compare fuzzy databases. It is clear that two fuzzy relations whose fuzzy attributes are not related cannot in turn be related, and likewise a relationship between fuzzy databases requires relationships between their fuzzy relations. To solve the above problem, the proposed algorithm consists of three necessary stages. 1. 2. 3.
SEARCHn: Find valid FIDα s between a set of given fuzzy relations in a set of fuzzy databases (the general problem). SEARCH2: Find valid FIDα s between a pair of given fuzzy relations. VERIFY: Determine whether a given FIDα is valid.
A simple algorithm (i.e., pairwise comparison) for the first stage could express the general problem for n fuzzy relations as (n2 ) problems on pairs of fuzzy relations. Improvements are possible, for example, by using the transitivity property of FIDα s . The second stage needs to find maximal valid FIDα s (i.e., a generating set for each pair of fuzzy relations considered) with a minimal number of single FIDα s verifications. The focus in this stage is not how to verify the validity of FIDα s against a fuzzy database state, but to find a generating set of FIDα s with a minimal number of verifications. The third stage (verifying validity of a particular FIDα s ), which has to be executed for every FIDα s generated in Step 2, involves querying one or two fuzzy database systems in order to determine a fuzzy inclusion between two sets of fuzzy attributes across two fuzzy relations. When the fuzzy relations exist in the same fuzzy database, they can be queried with an FSQL (fuzzy SQL) intersection query. If the fuzzy relations are in two different fuzzy databases, a single FSQL query is not sufficient. In that case, other techniques can be applied.
Overall, the stage with the highest complexity is the stage of finding valid FIDα s between two fuzzy relations. Therefore, the greatest improvements are expected during the run time for a general algorithm at this stage. This is, therefore, the problem upon which the attention has to be focused.
Comparing Two Fuzzy Databases Looking for a generating set of FIDα s among a set of fuzzy relations means the output of the algorithm should be only those FIDα s that cannot be derived from other FIDα s . In particular, this excludes FIDα s that can be obtained from other FIDα s by transitivity. A simple algorithm that gives a general solution to the problem is as follows. Algorithm SEARCHn(FRDB1,FRDB2) Set FIDα s ← φ For all ( R ∈ FRDB1 ) For all ( S ∈ FRDB2 ) if ( R ≠ S ) FIDα s ← FIDα s SEARCH 2 ( R, S )
removeDeriveableFIDα s ( FIDα s )
If Algorithm (SEARCH2) is assumed to run at unit cost, the Algorithm (SEARCHn) runs in O(n2) for the number of relations in the database.
Search for FIDαs Between Two Fuzzy Relations A Simple Algorithm SEARCH2 It can be resolved from Axiom 2 that a k-ary valid FIDα implies certain i-ary valid FIDα s , for i < k. It can be observed that Axiom 2 does not change if only ordered sequences of integers are allowed to serve as indices for generated sub FIDα s due to the definition of equality of two FIDα s . Since FIDα s between two given relations only are concerned, Axiom 2 in the form of the above observation also states the only derivation rule that can be used to
677
Fuzzy Inclusion Dependencies in Fuzzy Databases
derive new FIDα s as per the definition of derived FIDα s . Thus, the set of FIDα s obtained from an FIDα ρ0 through projection only is equivalent to the set {ρ : ρ0 ρ }. An observation about the number of FIDα s that can be derived as projections (i.e., subset of other FIDα s ) can also be made in terms of the following lemma. Lemma: A k-ary valid FIDα s implies valid FIDs for any 1 ≤ m ≤ k .
( ) m-ary k m
Assume a hypothetical FIDα -finding strategy that generates all possible FIDα s , and then tests each FIDα and marks it as either valid or invalid. In such a strategy, one would define a data structure that holds FIDα s plus its membership grade indicating the subsethood, and a state from the set {unknown, valid, invalid} for each FIDα .
3. 4. 5. 6.
Increment m by 1. f If | Σ m−1 |< (mm−1 ), go to Step 7. Identify m-ary FIDα s whose implied (m-1)f ary FIDα s are all members of Σ m−1 . Generate and test the validity condition µ ( ρ ) ≥ α of all m-ary FIDs identified at Σ Step 5, and retain only valid m-ary FIDα s f in a fuzzy set f Σ m of m-ary FIDα s . Repeat from Step 3 until m = k. f
1
7.
Example: Consider the fuzzy relations Faculty and Emp as given in Table 4. It can be observed that each fuzzy relation has k = 3 attributes. Now, trace the Algorithm SEARCH2 as follows. Step 1: m=1, α=.75 Step 2: Generate all unary FIDαs and retain only valid FIDαs. f
1.
2.
Verify high-arity FIDα candidates. If a valid k-ary FIDα ρ0 is found, mark all FIDα s derivable from ρ0 (i.e., all elements of the set {ρ : ρ0 ρ } as valid. Note that all these FIDα s have arities < k. Verify low-arity FIDα candidates. If an invalid k-ary FIDα ρ0 is found, mark all those FIDα s ρ as invalid from which ρ0 would be derivable if it were valid (i.e., all elements of the set {ρ : ρ0 ρ }).
This strategy would still require one to generate explicitly or implicitly all FIDα s , which is not feasible. However, it suggests Algorithm (SEARCH2) that solves the FIDα searching problem for two fuzzy relations R and S each with k fuzzy attributes. Algorithm: SEARCH2(R,S) 1. Let m=1. Set a suitable value to α from the interval [0,1]. 2. Generate all unary FIDα s and retain only f those FIDα s ρ in a fuzzy set Σ1 of unary FIDα that satisfy the validity condition µ (ρ ) ≥ α . f
Σ1
678
f
[ Name ] Emp[ Name ]| ρ1 = Faculty[ Name] ⊆ Emp[ Name], µ f ( ρ1 ) = |Faculty|Faculty , [ Name ]| Σ1
It can be observed that only ρ18 is accepted for generation. Step 6: Verify the validity of ρ18 . Let Faculty[Name, Salary] = A, and Emp[Name, Pay] = B. Then the following apply.
Thus, the f uzzy set of all valid unar y f FIDα s = Σ1 = {ρ1 ,.75 ρ9 } . Step 3: m = m + 1 = 1 + 1 = 2 f f Step 4: (| Σ1 |= 2 and (12 )= 1) ⇒| Σ m−1 | α } .
Derived Fuzzy Inclusion Dependency (FID): f A valid FID ρ can be derived from a set Σ of valid f FIDs, denoted by Σ ρ , if ρ can be obtained by repeatedly applying axioms on some set of FIDs f taken from Σ . Similarly, a valid partial inclusion dependency FIDα ρ can be derived from a fuzzy f f set Σ of a valid FIDα denoted by Σ ρ if ρ can be obtained by repeatedly applying the axioms f on some set of FIDs taken from Σ . The memberf ship function of the fuzzy set Σ k may be given as follows:
f
µ f (ρ ) =
| R[ai1 , ai2 , , aik ] S [b j1 , b j2 , , b jk ] | | R[ai1 , ai2 , , aik ] |
Σk
≥α ,
f
where ρ = R[ai1 , ai2 , , aik ] ⊆ S [b j1 , b j2 , , b jk ] and k = 1, 2, , min(n, m) , and n and m are the cardinality of sets of fuzzy attribute names belonging to fuzzy relations R and S respectively. For example, a fuzzy set of a valid FIDα =.6 of arity k may be given as f
Σ k = {.66 / ρ1 ,.77 / ρ 2 , ρ3 ,} .
Equality of Fuzzy Inclusion Dependencies f (FIDs): Two FIDs R[ai , , ai ] ⊆ S[bi , , bi ] and f R[ci , , ci ] ⊆ S [di , , di ] are equal iff there is a sequence ( i1 , , im ) of distinct integers 1, , m such that (ai1 = ci1 ∧ bi1 = di1 ) ∧ ∧ (aim = cim ∧ bim = dim ) . A similar definition holds for the equality of partial fuzzy inclusion dependencies too. 1
1
m
1
m
1
m
m
Fuzzy Inclusion Dependency (FID): Let R[a1 , a2 , , an ] and S [b1 , b2 , , bm ] be (projections on) two fuzzy relations. Let X be a sequence of k distinct fuzzy attribute names from R, and Y be a sequence of k distinct fuzzy attribute names from S, with 1 ≤ k ≤ min(n, m) . Then, fuzzy inclusion dependency FID is an assertion of the form R[ X ] ⊆ S [Y ] , where all the fuzzy values under all the attribute names in R[ X ] are fuzzy value equivalent to some fuzzy values under respective attribute names in S [Y ] ; however, the reverse may not hold. Fuzzy Relational Database of Type 1: In Type 1 fuzzy relations, dom( Ai ) may be a classical subset or a fuzzy subset of U i . Let the membership function of dom( Ai ) be denoted by µ Ai for i = 1, 2, , n . Then, from the definition of the Cartesian product of fuzzy sets, dom( A1 ) × dom( A2 ) × × dom( An ) is a fuzzy subset of U ∗ = U1 × U 2 × × U n . Hence, a Type 1 fuzzy relation r is also a fuzzy subset of U ∗ with membership function µr . Fuzzy Relational Database of Type 2: A Type 2 fuzzy relation r is a fuzzy subset of D, where µr : D → [0,1] must satisfy the condition 683
Fuzzy Inclusion Dependencies in Fuzzy Databases
µr (t ) ≤
max
( u1 ,u2 ,,un )∈U ∗
(
)
min µa (u1 ), µa (u2 ), , µa (un ) 1 2 n ,
where t = (a1 , a2 , , an ) ∈ D . Fuzzy Value Equivalent (FVEQ): Let A and B be two fuzzy sets with their membership functions µ A and µ B , respectively. A fuzzy value a ∈ A is said to be equivalent to some other fuzzy value b ∈ B, iff b ∈ µ B ( x) for some x ∈ S , where S is the set of crisp values that are returned by µ A −1 (a ) , −1 and µ A is the inverse of the membership function of fuzzy set A. Generating Set of FIDα : Consider a fuzzy set of valid partial fuzzy inclusion dependencies: f f Σ = {v1 ρ 2 , v2 ρ 2 , , vn ρ n ,}. A generating set of Σ , f denoted by (Σ) , is a set of valid FIDα with the following properties: f
f
1.
∀ρ ∈ Σ : (Σ)
2.
∀ρ ∈ (Σ): ¬(( (Σ) − ρ )
f
where the symbol ence.
684
ρ , and f
f
−
f
ρ ),
stands for fuzzy set differ-
Partial Fuzzy Inclusion Dependency ( FIDα ): Let R[a1 , a2 , , an ] and S [b1 , b2 , , bm ] be (projections on) two fuzzy relations. Let X be a sequence of k distinct fuzzy attribute names from R, and Y be a sequence of k distinct fuzzy attribute names from S, with 1 ≤ k ≤ min(n, m) . Then, a partial fuzzy inclusion dependency FIDα is an assertion of the form R[ X ] ⊆ S [Y ] , such that the fuzzy subsethood f
| R[ X ] S [Y ] | ( R[ X ], S [Y ]) = ≥α , | R[ X ] |
where α is specified in the interval [0,1] and most of the fuzzy values under all the attribute names in R[ X ] are fuzzy value equivalent to some fuzzy values under respective attribute names in S [Y ] ; however, the reverse may not hold. Valid FID: An FIDα ρ = (R[ai ,ai ,...,ai ] ⊆ 1 2 k S[bi ,bi ,...,bi ]) is valid between two relations R = 1 2 k (r,(a1,a2,...,an),ER) and S = (r, (a1,a2,...,an), ES) if the sets of fuzzy tuples in ER and ES satisfy the assertion given by ρ. Otherwise, FIDα is called invalid for R and S. In other words, FIDα is said to be valid if ( R[ X ], S[Y ]) = |R[ X|R][XS]|[Y ]| ≥ α holds. f
685
Chapter XXVII
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases Wai-Ho Au Microsoft Corporation, USA
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
that it has very good size-up, speedup, and scale-up performance. We also evaluated the effectiveness of the proposed interestingness measure on two synthetic data sets. The experimental results show that it is very effective in differentiating between interesting and uninteresting associations.
INTRODUCTION Of the many different kinds of patterns that can be discovered in a database, the mining of association rules has been studied extensively in the literature (see, e.g., J. Han & Kamber, 2001; Hand, Mannila, & Smyth, 2001). It is because the uncovering of the underlying association relationships (or simply associations) hidden in the data can enable other important problems, such as classification (Au & Chan, 2001; Liu, Hsu, & Ma, 1998), to be tackled more effectively. The problem of discovering interesting associations in databases is originally defined over binary or Boolean data (Agrawal, Imielinski, & Swami, 1993). It is then extended to cover many real-life databases comprising both discrete- and continuous-valued data (Srikant & Agrawal, 1996). An association that is considered interesting is typically expressed in the form of rule X → Y, where X and Y are conjunctions of conditions. A condition is either Ai = ai, where ai is a value in the domain of attribute Ai if Ai is discrete, or ai ∈ [li, ui], where li and ui are values in the domain of attribute Ai if Ai is continuous. The association rule X → Y holds with support, which is defined as the percentage of tuples satisfying X and Y, and confidence, which is defined as the percentage of tuples satisfying Y given that they also satisfy X. An example of an association rule is Gender = Female ∧ Age ∈ [20, 25] ∧ Income ∈ [15 000, 20 000] → Occupation = Cashier, which describes that a woman whose age is between 20 and 25 and whose income is between $15,000 and $20,000 is likely a cashier. To handle continuous attributes, many data mining algorithms (e.g., Liu et al., 1998; Srikant
686
& Agrawal, 1996) require their domains to be discretized into a finite number of intervals. These intervals may not be concise and meaningful enough for human experts to obtain comprehensive knowledge from the discovered association rules. Instead of using intervals, many researchers propose to employ fuzzy sets to represent the underlying relationships hidden in the data (Au & Chan, 2001, 2003; Carrasco, Vila, Galindo, & Cubero, 2000; Chan & Au, 1997, 2001; Delgado, Marín, Sánchez, & Vila, 2003; Hirota & Pedrycz, 1999; Hong, Kuo, & Chi, 1999; Kuok, Fu, & Wong, 1998; Maimon, Kandel, & Last, 1999; Yager, 1991; Zhang, 1999). The association rules involving fuzzy sets are commonly known as fuzzy association rules. An example of a fuzzy association rule is given in the following: Gender = Female ∧ Age = Young ∧ Income = Small → Occupation = Cashier, where Gender, Age, Income, and Occupation are linguistic variables, and Female, Young, Small, and Cashier are linguistic terms. This rule states that a young woman whose income is small is likely a cashier. In comparison to its counterpart involving discretized intervals, it is easier for human users to understand. The use of fuzzy sets also buries the boundaries of the adjacent intervals. This makes fuzzy association rules resilient to the inherent noise present in the data, for instance, the inaccuracy in physical measurements of real-world entities. Many of the ensuing algorithms, including those in (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001; Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999), are developed to make use of only a single processor or machine. They can be
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
further enhanced by taking advantage of the scalability of parallel or distributed computer systems. Because of the increasing ability to collect data and the resulting huge data volume, the exploitation of parallel or distributed systems become more and more important to the success of fuzzy association rule mining algorithms. Regardless of whether an algorithm is developed to discover (crisp) association rules (e.g., (Agrawal et al., 1993; Agrawal & Shafer, 1996; Agrawal & Srikant, 1994; Cheung et al., 1996; E.-H. Han, Karypis, & Kumar, 1997; Mannila, Toivonen, & Verkamo 1994; Park, Chen & Yu, 1995a, 1995b; Savasere, Omiecinski, & Navathe, 1995; Shintani & Kitsuregawa, 1996; Srikant & Agrawal, 1996) or fuzzy association rules (e.g., Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999), it quintessentially employs the support-confidence framework to mine interesting associations from databases. Based on this framework, an association is considered interesting if it satisfies the minimum support and the minimum confidence threshold supplied by a user. While these algorithms can be effective in many tasks, what the thresholds should be are often difficult to decide. If the thresholds are not set properly, the discovered association rules can be quite misleading (J. Han & Kamber, 2001; Hand et al., 2001). For an algorithm to reveal association rules more effectively, an objective interestingness measure that does not require a lot of trial-and-error efforts by the user is necessary. In this chapter, we propose a new algorithm, called distributed fuzzy association rule mining (DFARM), for mining fuzzy association rules from very large databases in a distributed environment. It embraces an objective interestingness measure, called adjusted residual (Haberman, 1973). Based on the concept of statistical residual analysis, it is defined as a function of the difference of the actual and the expected number of tuples characterized by different attributes and attribute values. We show how to apply this measure to fuzzy data here. By virtue of this measure, DFARM is able to differentiate between interesting and uninteresting associations without having a user supply
any thresholds. To our best knowledge, DFARM is the first distributed algorithm that utilizes an objective measure for mining interesting associations from fuzzy data without any user-specified thresholds. DFARM begins by dividing a database into several horizontal partitions and assigning them to different sites in a distributed system. It then has each site scan its own database partition to obtain the number of tuples characterized by different linguistic variables and linguistic terms (i.e., the local counts), and exchanges the local counts with all the other sites to find the global counts. Based on the global counts, the adjusted residuals are computed and the sites can reveal interesting associations. By repeating this process of counting, exchanging counts, and calculating the adjusted residuals, DFARM unveils the underlying interesting associations hidden in the data. We made use of two synthetic data sets to evaluate the effectiveness of the interestingness measure in discriminating interesting associations from uninteresting ones. We also implemented DFARM in a distributed system using the parallel virtual machine (PVM; Geist, Beguelin, Dongarra, Jiang, Manchek, & Sunderam, 1994), and used a popular benchmark data set to evaluate its performance. The results show that DFARM has very good size-up, speedup, and scale-up performance.
RELATED WORK The mining of association rules based on the support-confidence framework is defined as follows (Agrawal et al., 1993). Let I = {i1, …, im} be a set of binary attributes, called items, and T be a set of transactions. Each transaction t ∈ T is represented as a binary vector with t[k] = 1 if t contains item ik, otherwise t[k] = 0 for k = 1, …, m. A set of items is known as an item set. A k-item set is an item set consisting of k items. The support of an item set, X ⊂ I, is defined as the percentage of tuples containing X. The item set is frequent if its support is greater than or equal to the user-specified
687
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
minimum support. An association rule is defined as an implication of the form X → Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. The rule X → Y holds in T with support, which is defined as the percentage of tuples containing X and Y, and confidence, which is defined as the percentage of tuples containing Y given that they also contain X. An association rule is interesting if its support and confidence are greater than or equal to the user-supplied minimum support and minimum confidence, respectively. Since they are defined over binary data, they are usually referred to as Boolean association rules. The ensuing algorithms first find all frequent item sets in a database and then generate association rules from these frequent item sets. Since the former step consumes most of the computational resources, current research primarily focuses on speeding up the process of discovering frequent item sets (Agrawal et al., 1993; Agrawal & Srikant, 1994; Mannila et al., 1994; Park et al., 1995a; Savasere et al., 1995). Apriori (Agrawal & Srikant, 1994) is a wellknown algorithm for mining Boolean association rules. First of all, it generates a set of candidate 1-item sets, C1. It then scans all the transactions to obtain the support counts of the item sets in C1. The item sets whose supports satisfy the minimum support become frequent 1-item sets, F1. After that, Apriori generates a set of candidate 2-item sets, C2, from F1. It then examines the item sets in C2. If any subset of an item set in C2 is not in F1, it eliminates the item set from C2. It subsequently scans all the transactions to find the support counts of the item sets in C2. The item sets whose supports satisfy the minimum support become frequent 2item sets, F2. Apriori then continues to generate candidates and find frequent 3-item sets and so forth until no frequent item sets or candidate item sets are found. Different techniques, including those in Park et al. (1995a) and Savasere et al. (1995), have been proposed to improve the efficiency of the algorithm. Unlike Apriori and its variants, a method called FP-growth (J. Han, Pei, & Yin, 2000) is proposed to mine frequent item sets without candidate genera-
688
tion. It first compresses the database into an FP-tree, but retains the item set association information at the same time. It then divides the FP-tree into a set of conditional databases, each of which is associated with one frequent item, and it mines each such database separately. The FP-growth method transforms the problem of finding long frequent item sets to looking for shorter ones recursively and then concatenating the suffix (J. Han et al.). It is shown in J. Han et al. that this method is about an order of magnitude faster than Apriori. In order to handle very large databases, the serial approaches to mining Boolean association rules (e.g., Agrawal et al., 1993; Agrawal & Srikant, 1994; Mannila et al., 1994; Park et al., 1995a; Savasere et al., 1995) have been extended to take advantage of the scalability of parallel or distributed systems. Three algorithms, namely, count distribution, data distribution, and candidate distribution, which adopt Apriori in the distributed-memory architecture, are proposed in Agrawal and Shafer (1996). These algorithms divide a database into several horizontal partitions and assign them to different processors. In count distribution, every processor runs Apriori over its database partition with a modification in which it (a) exchanges the local support counts of candidate item sets in its database partition with all the other processors to find the global support counts in the entire database and (b) identifies frequent item sets based on the global support counts at each of the iterations. Data distribution partitions candidate item sets and assigns them to different processors in a round-robin fashion. At each of the iterations, every processor broadcasts its database partition to all the other processors to find the global support counts of its candidate item sets. Candidate distribution starts the data mining process by utilizing either count distribution or data distribution. At certain iterations, it divides the candidate item sets into several disjoint subsets and assigns different subsets to different processors. At the same time, the database is repartitioned in such a way that each processor can find the (global) support counts of
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
its candidate item sets in its database partition independent of other processors. To achieve this, parts of the database may be replicated on several processors. Each processor then generates candidate item sets and counts the supports of these candidate item sets independently at subsequent iterations. In addition to these three algorithms, other parallel algorithms based on Apriori are also proposed in the literature (Cheung et al., 1996a; E. H. Han et al., 1997; Park et al., 1995b; Shintani & Kitsuregawa, 1996). Regardless of whether an algorithm is serial or parallel, it determines if an association is interesting by means of the userspecified minimum support and minimum confidence thresholds. A weakness is that many users do not have any idea what the thresholds should be. If they are set too high, a user may miss some useful rules; if they are set too low, the user may be overwhelmed by many irrelevant ones (J. Han & Kamber, 2001; Hand et al., 2001). The techniques for mining Boolean association rules are extended to take discrete- and continuousvalued data into consideration. Association rules involving discrete and continuous attributes are known as quantitative association rules (Srikant & Agrawal, 1996). To handle continuous attributes, their domains are discretized into a finite number of intervals. The discretization can be performed as a part of the algorithms (e.g., Srikant & Agrawal) or as a preprocessing step before data mining (e.g., Liu et al., 1998). Both discrete and continuous attributes are handled in a uniform fashion as a set of pairs by mapping the values of discrete attributes to a set of consecutive integers and by mapping the discretized intervals of continuous attributes to consecutive integers, which preserve the order of the intervals (Srikant & Agrawal). Instead of having just one field for each attribute, it needs to use as many fields as the number of different attribute values. For example, the value of a Boolean field corresponding to would be 1 if attribute1 has value1 in the original record, and 0 otherwise (Srikant & Agrawal). After the mappings, both the serial and parallel algorithms for
mining Boolean association rules, such as those in Agrawal et al. (1993), Agrawal and Shafer (1996), Agrawal and Srikant, Cheung et al. (1996), H. H. Han et al. (1997), J. Han et al. (2000), Mannila et al. (1994), Park et al. (1995a, 1995b), Savasere et al. (1995), and Shintani and Kitsuregawa (1996), can be applied to the encoded data. Regardless of how the domains of continuous attributes are discretized, the intervals may not be concise and meaningful enough for human users to easily obtain nontrivial knowledge from the discovered patterns. To better handle continuous data, the use of fuzzy sets in the mining of association rules has recently been proposed in the literature (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001; Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999). These algorithms typically use a fuzzy partitioning methodology to generate fuzzy sets representing the domain of each continuous attribute as a preprocessing step. The fuzzy sets can be supplied by domain experts or generated automatically from the data by fuzzy partitioning approaches, such as in Au, Chan, and Wong (2006). For algorithms, such as in Delgado et al. (2003), Hong et al. (1999), Kuok et al. (1998), and Zhang (1999), to mine interesting associations from the fuzzified data, they adopt the support-confidence framework. Similar to their contemporary algorithms for mining (crisp) association rules, they encounter the same problem regarding how it is usually difficult to determine which values the user-specified minimum support and minimum confidence ought to be. Some meaningful relationships may not be found if they are set too high, whereas some misleading relationships may be revealed if they are set too low (J. Han & Kamber, 2001; Hand et al., 2001).
IDENTIFYING INTERESTING ASSOCIATIONS In the following subsections, we present (a) the definition of linguistic variables and linguistic terms, (b) an objective measure for identifying interesting 689
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
associations between different linguistic terms, and (c) the formation of fuzzy rules to represent the interesting associations and how to represent the uncertainty associated with the rules.
Linguistic Variables and Linguistic Terms Given a database relation D, each tuple t in D is composed of a set of attributes A = {A1, …, An}, where A1, …, An can be continuous or discrete. For any tuple, t ∈ D, t[Ai] denotes the value ai in t for attribute Ai. Let L = {L1, …, Ln} be a set of linguistic variables such that Li represents Ai. For any continuous attribute Ai, let dom(Ai) = [li, ui] ⊆ ℜ denote the domain of the attribute. Ai is represented by a linguistic variable Li whose value is one of the linguistic terms in T(Li) = {lij | j = 1, …, si}, where si denotes the number of linguistic terms and lij is a linguistic term characterized by a fuzzy set Fij that is defined on dom(Ai) and whose membership function is µ Fij such that µ Fij : dom( Ai ) → [0, 1] .
by
(1)
The fuzzy sets Fij, j = 1, …, si are represented
Fij = ∫
µ Fij (ai ) ai
dom ( Ai )
,
(2)
where ai ∈ dom(Ai). Fo r a ny d i s c r e t e a t t r i b u t e A i , le t dom( Ai ) = {a i1 , ..., a im i } denote the domain of Ai. Ai is represented by linguistic variable Li whose value is one of the linguistic terms in T(Li) = {lij | j = 1, …, mi}, where lij is a linguistic term characterized by a fuzzy set Fij such that Fij = ∑dom ( A ) i
690
µ Fij (a i ) ai
,
(3)
where ai ∈ dom(Ai). Regardless of whether Ai is discrete or continuous, the degree of compatibility of ai ∈ dom(Ai) with linguistic term lij is given by µ Fij (ai ) . In addition to handling discrete and continuous attributes in a uniform fashion, the use of linguistic variables to represent discrete attributes allows the fuzzy nature of real-world entities to be easily captured. For example, it may be difficult to discriminate the color orange from the color red. It is for this reason that an object, which is orange in color, may be perceived as red in color to a certain extent. Such kinds of fuzziness in linguistic variable Color can be represented by linguistic terms Red and Orange. Based on these linguistic terms, the color of an object can be compatible with the term Red to a degree of 0.7 and with the term Orange to a degree of 0.3. Interested readers are referred to Mendel (1995) and Yen and Langari (1999) for the details of linguistic variables, linguistic terms, fuzzy sets, and membership functions. Using the above technique, the original attributes in A = {Ai | i = 1, …, n} are represented by the linguistic variables in L = {Li | i = 1, …, n}. These linguistic variables are associated with a set of linguistic terms, l = {lij | i = 1, …, n, j = 1, …, si}. These linguistic terms are, in turn, characterized by a set of fuzzy sets, F = {Fij | i = 1, …, n, j = 1, …, si}. Given a tuple t ∈ D and a linguistic term lij ∈ l, which is characterized by a fuzzy set Fij ∈ F, the degree of membership of the values in t with respect to Fij is given by µ Fij (t[ Ai ]) . The degree to which t is characterized by lij, λlij (t ) , is defined as λlij (t ) = µ Fij (t[ Ai ]) .
(4)
For example, given a linguistic variable Height and a linguistic term Tall, we have λTall(t) = µTall(t[Height]). If λlij (t ) = 1 , t is completely characterized by the linguistic term lij. If λlij (t ) = 0 , t is undoubtedly not characterized by the linguistic term lij. If 0 < λlij (t ) < 1 , t is partially characterized by the
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
linguistic term lij. In the case where t[Ai] is unknown, λlij (t ) = 0.5 , which indicates that there is no information available concerning whether t is or is not characterized by the linguistic term lij. t can also be characterized by more than one linguistic term. Let ϕ be a subset of integers such that ϕ = {i1, …, ih}, where ϕ ⊆ {1, …, n} and |ϕ| = h ≥ 1. We also suppose that Aϕ is a subset of A such that Aϕ = {Ai | i ∈ ϕ}. Given any Aϕ, it is associated with a set of linguistic terms, T(L ϕ) = {lϕj | j = 1, …, sϕ = ∏ si }, where sϕ denotes i∈ϕ the number of linguistic terms and lϕj is represented by a fuzzy set, Fϕj, such that Fϕj = Fi1 j1 ∩ ... ∩ Fih jh , ik ∈ ϕ, j k ∈ sik . The degree to which t is characterized by the term lϕj, λlϕj (t ) , is given by λlϕj (t ) = min( µ Fi1 j1 (t[ Ai1 ]), ..., µ Fih jh
(t[ Aih ])) . (5)
For instance, given linguistic variables Height and Weight, and linguistic terms Tall and Heavy, we have λTall ∧ Heavy = min(µTall(t[Height]), µHeavy(t[Weight])). In fact, other t-norms (e.g., the multiplication operation) can also be used in the calculation of Equation 5. We use the minimum operation here because it is one of the most popular t-norms used in the literature (see, e.g., Mendel, 1995; Yen & Langari, 1999). Based on the linguistic variables and linguistic terms, we can use DFARM to discover fuzzy association rules that are represented in a manner that is more natural for human users to understand when compared to their crisp counterparts.
An Objective Interestingness Measure We define the fuzzy support of a linguistic term lϕk, which is represented by fsup(lϕk), as
fsup (lϕk ) =
∑λ t∈D sϕ
lϕk
∑∑ λ t∈D j =1
(t ) lϕj
(t )
,
(6)
where the nominator and the denominator are the cardinality of lϕk and that of all the linguistic terms defined in the same domain, respectively. Since ∑ fsup(l ) = 1 , fsup(lϕk) can be considered the probability that a tuple is characterized by linguistic term lϕk. In the rest of this chapter, the association between a linguistic term lϕk and another linguistic term lpq is expressed as lϕk → lpq, for example, Cheap ∧ Light → Best Seller, where Cheap, Light, and Best Seller are linguistic terms. We define the fuzzy support of the association lϕk → lpq, fsup(lϕk → lpq) as sϕ
k =1
ϕk
fsup (lϕk → l pq ) =
∑ min(λ t∈D sϕ s p
lϕk
(t ), λl pq (t )
∑∑∑ min(λ t∈D j =1 u =1
lϕj
,
(t ), λl pu (t )
(7)
where the nominator is the cardinality of Fϕk ∩ Fpq while the denominator is the cardinality of all the possible combinations of fuzzy sets defined in the same domain. Similarly, ∑∑ fsup(l → l ) = 1 , and hence fsup(lϕk → lpq) can be considered the probability that a tuple is characterized by both lϕk and lpq. Other t-norms, such as the multiplication operation, are also applicable to the calculation of the fuzzy support. We use the minimum operation in Equation 7 so as to be consistent with Equation 5. We, in turn, define the fuzzy confidence of the association lϕk → lpq, fconf(lϕk → lpq), as sϕ
sp
j =1 u =1
fconf (lϕk → l pq ) =
fsup (lϕk → l pq ) fsup (lϕk )
.
ϕj
pu
(8)
Intuitively, fconf(lϕk → lpq) can be considered the probability that a tuple is characterized by lpq given that it is also characterized by lϕk. To decide whether the association lϕk → lpq is interesting, we determine whether the difference between fconf(lϕk → lpq) and fsup(lpq) is significant.
691
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
The significance of the difference can be objectively evaluated using the adjusted residual (Haberman, 1973). It is defined in terms of the fuzzy support and the fuzzy confidence (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001) that reflects the difference in the actual and the expected degree to which a tuple is characterized by different linguistic terms. The adjusted residual d(lϕk → lpq) is defined in Haberman (1973) as
d (lϕk → l pq ) =
z (lϕk → l pq ) γ (lϕk → l pq )
,
(9)
where z(lϕk → lpq) is the standardized residual, which is defined as
s
z (l k → l pq ) =
sp
fsup(l k → l pq ) × ∑∑∑ min( t∈D j =1 u =1
l
j
(t ),
l pu
(t )) − e(l k → l pq )
e(l k → l pq )
.
(10)
e(lϕk → lpq) is the expected degree to which a tuple is characterized by lϕk and lpq and is calculated by s s e(l k → l pq ) = fsup (l k ) × fsup (l pq ) × ∑∑∑ min( l (t ), l (t )), p
t∈D j =1 u =1
j
pu
(11)
and γ(lϕk → lpq) is the maximum likelihood estimate of the variance of z(lϕk → lpq) and is given by
(l
k
→ l pq ) = (1 − fsup (l k ))(1 − fsup (l pq )).
(12)
The measure defined by Equation 9 can be used as an objective interestingness measure because it does not depend on any user’s subjective inputs. Since d(lϕk → lpq) is in an approximate standard normal distribution, if |d(lϕk → lpq)| > 1.96 (the 95th percentile of the standard normal distribution), we conclude that the difference between fconf(lϕk → lpq) and fsup(lpq) is significant at the 5% significance level. Specifically, if d(lϕk → lpq) > 1.96, the presence of lϕk implies the presence of lpq. In other words, whenever lϕk is found in a
692
tuple, the probability that lpq is also found in the same tuple is expected to be significantly higher than when lϕk is not found. We say that the association lϕk → lpq is positive. On the other hand, if d(lϕk → lpq) < –1.96, the presence of lϕk implies the absence of lpq. In other words, whenever lϕk is found in a tuple, the probability that lpq is also found in the same tuple is expected to be significantly lower than when lϕk is not found. We say that the association lϕk → lpq is negative. It is important to note that d(lϕk → lpq) is defined in terms of the fuzzy support and the fuzzy confidence of the linguistic terms. They are, in turn, defined in terms of the probabilities of the corresponding fuzzy events. As a result, the conclusion that the adjusted residual is in an approximate standard normal distribution (Haberman, 1973) is still valid.
Formation of Fuzzy Association Rules In the context of rule mining, the number of conditions in the antecedent of a rule is often referred to as its order (Smyth & Goodman, 1992; Wong & Wang, 1997). A first-order fuzzy association rule can be defined as a rule involving one linguistic term in its antecedent. A second-order fuzzy association rule can be defined as a rule involving two linguistic terms in its antecedent. A third-order fuzzy association rule can be defined as a rule involving three linguistic terms in its antecedent, and so on for other higher orders. Given that lϕk → lpq is interesting, we can form the following fuzzy association rule: lϕk → l pq [ w(lϕk → l pq )] ,
where w(lϕk → lpq) is the weight-of-evidence measure (Osteyee & Good, 1974), which is a confidence measure that represents the uncertainty associated with lϕk → lpq. This measure is defined as follows. Since the relationship between lϕk and lpq is interesting, there is some evidence for a tuple
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
to be characterized by lpq given that it has lϕk. The weight of evidence is defined in terms of an information-theoretic measure known as mutual information (see, e.g., MacKay, 2003). The mutual information measures the change of uncertainty about the presence of lpq in a tuple given that it has lϕk. It is defined as I (l pq : lϕk ) = log
fconf (lϕk → l pq ) fsup (l pq )
.
(13)
Based on mutual information, the weight of evidence is defined in Osteyee and Good (1974) as w(lϕk → l pq ) = I (l pq : lϕk ) − I ( (l pj : lϕk ) = log j ≠q
fsup (lϕk → l pq ) fsup (l pq )
fsup ( lϕk → l pj ) fsup ( l pj ) j ≠q
.
j ≠q
(14)
w(lϕk → lpq) can be interpreted intuitively as a measure of the difference in the gain in information when a tuple that is characterized by lϕk is also characterized by lpq as opposed to being characterized by other linguistic terms. Since lϕk is defined by a set of linguistic terms li1 j1 , ..., lih jh ∈ l , we have a high-order fuzzy association rule Li1 = li1 j1 ∧ .. ∧ Lih = lih jh → L p = l pq [ w(lϕk → l pq )] ,
where i1, …, ih ∈ ϕ.
lϕ ′k → l pq , where ϕ ′ = ϕ1 ∪ ϕ 2 , is more likely to be interesting if both the associations lϕ1k → l pq and lϕ 2 k → l pq are interesting vs. the case where only one or neither of them is interesting. Using this heuristic, DFARM evaluates the interestingness of only the associations between different combinations of conditions in lower order association rules. The details of DFARM are given in the following. In a distributed system comprising m sites S1, …, Sm, the database relation D is horizontally partitioned over the m sites into D1, …, Dm. Let the number of tuples in database partition Dj be Nj, j = 1, …, m. The fuzzy support count of linguistic term lϕk, where ϕ ⊆ {1, …, n} and |ϕ| = h ≥ 1, in D is given by count (lϕk ) = ∑ λlϕk (t ) ,
(15)
t∈D
where λlϕk (t ) is the degree to which t is characterized by lϕk, defined by Equation 4. Similarly, the fuzzy support count of lϕk in Dj is calculated by count j (lϕk ) =
∑λ
t∈D j
lϕk
(t ) .
(16)
It is obvious to note that m
count (lϕk ) = ∑ count j (lϕk ) .
(17)
j =1
DISTRIBUTED MINING OF FUZZY ASSOCIATION RULES In this section, we present DFARM, which is a distributed algorithm for discovering fuzzy association rules. It extends count distribution (Agrawal & Shafer, 1996) to (a) handle fuzzy data and (b) utilize the objective interestingness measure presented in the previous section to differentiate interesting and uninteresting associations. To handle the large combination of linguistic terms, it embraces a heuristic that the association
We refer to count(lϕk) as the global fuzzy support count of lϕk, and countj(lϕk) as the local fuzzy support count of lϕk at site Sj. Let us consider an hth-order association, lϕk → lpq. The fuzzy support count of lϕk → lpq in D is given by count (l
k
→ l pq ) = ∑ min( t∈D
l
k
(t ),
l pq
(t )), (18)
whereas the fuzzy support count of lϕk → lpq in Dj is calculated by
693
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
count j (l
k
→ l pq ) =
∑ min(
t∈D j
l
k
(t ),
l pq
(t )). (19)
m
count (lϕk → l pq ) = ∑ count j (lϕk → l pq ) . (20) j =1
We refer to count(lϕk → lpq) as the global fuzzy support count of lϕk → lpq, and countj(lϕk → lpq) as the local fuzzy support count of lϕk → lpq at site Sj. The fuzzy support of the linguistic term lϕk and that of the association lϕk → lpq in terms of fuzzy support counts are given by count (lϕk ) sϕ
∑ count (l j =1
ϕj
)
count (lϕk → l pq ) sϕ
su
∑∑ count (l j =1 u =1
Again, it is obvious to note that
fsup (lϕk ) =
fsup (lϕk → l pq ) =
(21)
ϕj
→ l pq )
respectively. Based on Equations 21 and 22, we can calculate d(lϕk → lpq), defined by Equation 9, to determine whether the association lϕk → lpq is or is not interesting. To mine fuzzy association rules, each site in the distributed system runs DFARM. Each site scans its database partition in each pass. At the hth iteration, each site Sj generates the candidate hth-order rules from the (h – 1)th-order rules. Site Sj then scans its database partition Dj to obtain the local fuzzy support counts of all the candidate hth-order rules. After that, site Sj exchanges the local fuzzy support counts with all the other sites to find the global fuzzy support counts. Subsequently, each site Sj evaluates the interestingness of the candidate hth-order rules to obtain the interesting ones
and
Figure 1. The DFARM algorithm /* Rh consists of hth-order rules */ if (h = 1) then { forall (lik, lpq ∈ l, i ≠ p) { scan Dj to find countj(lik), countj(lpq), and countj(lik → lpq); } exchange countj(lik), countj(lpq), and countj(lik → lpq) with all the other sites to calculate count(lik), count(lpq), and count(lik → lpq); R1 = {lik → lpq [w(lik → lpq)] | i ≠ p and d(lik → lpq) > 1.96}; } else { C = {each linguistic term in the antecedent of r | r ∈ Rh – 1} forall (lϕk comprising h linguistic terms in C) { forall (lpq, q = 1, …, sp) { scan Dj to find countj(lϕk), countj(lpq), and countj(lϕk → lpq); } } exchange countj(lϕk), countj(lpq), and countj(lϕk → lpq) with all the other sites to calculate count(lϕk), count(lpq), and count(lϕk → lpq); Rh = {lϕk → lpq [w(lϕk → lpq)] | d(lϕk → lpq) > 1.96}; }
694
, (22)
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
(i.e., the hth-order rules). Site Sj then generates the candidate (h + 1)th-order rules from the hth-order rules and this process repeats. The algorithm terminates when neither an hth-order rule nor a candidate (h + 1)th-order rule is found. Figure 1 shows this algorithm. Since each site in the distributed system exchanges its local fuzzy counts with all the other sites to calculate the global fuzzy counts, the (h – 1)th-order rules and hence the candidate hth-order rules, which are generated from the (h – 1)th-order rules, found at different sites are identical for all h. After the termination of DFARM, each site therefore discovers an identical set of fuzzy association rules. As an example, given a database comprising three attributes, Age, Marital Status, and Wage, let dom(Age) be represented by linguistic terms, Young, Middle Aged, and Old; dom(Marital Status) be represented by Unmarried and Married; and dom(Wage) be represented by Low and High. If DFARM finds first-order rules Young → Low, High → Middle Aged, and Married → High are interesting, it obtains R1 = {Young → Low, High → Middle Aged, Married → High} and C = {Young, High, Married}. It then computes the adjusted residuals of the candidate second-order rules (i.e., Young ∧ High → Unmarried, Young ∧ High → Married, Young ∧ Married → Low, Young ∧ Married → High, High ∧ Married → Young, High ∧ Married → Middle Aged, and High ∧ Married → Old) to determine whether they are interesting. It continues to find the candidate higher order rules and calculates their interestingness until neither an interesting rule nor a candidate rule is found.
EXPERIMENTAL RESULTS To evaluate the effectiveness of the proposed interestingness measure, as given by Equation 9, we applied it to two synthetic data sets. Furthermore, we implemented DFARM in a distributed system using PVM (Geist et al., 1994). To perform our experiments, we used a 100 Mb LAN (local area
network) to connect 10 Sun Ultra 5 workstations, each of which has 64 MB of main memory running Solaris 2.5.1. Each workstation has a local drive and its database partition was loaded on its local drive before each experiment started. We used a popular benchmark data set to evaluate the computation performance of DFARM.
An Evaluation of Effectiveness In this subsection, we used two synthetic data sets to evaluate the effectiveness of the proposed interestingness measure. One of the data sets is composed of only uniform random data, whereas the other is generated in a way that inherent relationships are present in the data.
Testing with Uniform Random Data In our first experiment, we used a set of uniform random data to evaluate the ability of the proposed interestingness measure to handle noises when the data contain no organization or structure. The data set consists of 2,000 pairs of values generated randomly and independently for two attributes, X and Y, in the range between 0 and 10. Since the data are generated randomly, no interesting associations ought to be discovered. Figure 2 shows the data. From the data set, we employed the fuzzy partitioning approach proposed in Au et al. (2006) to generate five and two fuzzy sets for X and Y, respectively. The membership functions of the fuzzy sets representing the domain of X are the following (see Equation 23). The membership functions of the fuzzy sets representing the domain of Y are the following (see Equation 24). The interestingness of the associations in the data is given in Table 1. Since the absolute values of the adjusted residuals of all the associations are less than 1.96 (the 95th percentile of the standard normal distribution), we conclude that all of them are uninteresting at the 5% significance level. It illustrates the ability of our interestingness measure to reject random noises present in the data.
695
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Equation 23
Equation 24
if x ≤ 1.97 1 µVery Small ( x) = (2.01 − x) 0.04 if 1.97 < x ≤ 2.01 0 otherwise ( x − 1.97) 0.04 if 1.97 < x ≤ 2.01 1 if 2.01 < x ≤ 3.99 µ Small ( x) = (4.02 − x) 0.03 if 3.99 < x ≤ 4.02 0 otherwise ( x − 3.99) 0.03 if 3.99 < x ≤ 4.02 1 if 4.02 < x ≤ 5.99 µ Medium ( x) = − (6.01 x ) 0.02 if 5.99 < x ≤ 6.01 0 otherwise ( x − 5.99) 0.02 if 5.99 < x ≤ 6.01 1 if 6.01 < x ≤ 8.00 µ Large ( x) = (8.01 − x) 0.01 if 8.00 < x ≤ 8.01 0 otherwise ( x − 8.00) 0.01 if 8.00 < x ≤ 8.01 µVery Large ( x) = 1 if x > 8.01 0 otherwise
if y ≤ 4.99 1 µ Small ( y ) = (5.02 − y ) 0.03 if 4.99 < y ≤ 5.02 0 otherwise ( y − 4.99) 0.03 if 4.99 < y ≤ 5.02 if y > 5.02 µ Large ( y ) = 1 0 otherwise
Nevertheless, if the minimum support is set to 10% and the minimum confidence is set to 50%, association rule mining algorithms based on the support-confidence framework will find 4 of the 10 associations interesting. When compared to the use of support and confidence measures that may accidentally mistake random patterns as interesting, the proposed measure is more effective.
Figure 2. Uniform random data 10
8
6
4
2
0 0
696
2
4
6
8
10
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Table 1. The interestingness of associations Association X = Very Small → Y = Small X = Small → Y = Small X = Medium → Y = Small X = Large → Y = Small X = Very Large → Y = Small X = Very Small → Y = Large X = Small → Y = Large X = Medium → Y = Large X = Large → Y = Large X = Very Large → Y = Large
Testing with Inherent Relationships Embedded in the Data In our second experiment, we tested the proposed measure for effectiveness when it is used to dis-
cover high-order associations. In this data set, each tuple is characterized by three attributes, namely, X, Y, and Z. Each of them can take on two values: T and F. The data set contains 1,024 tuples and we generated the data according to the following relationships.
Figure 3. The second data set
Y=T
Z=F Z=T
Y=F
X=F
X=T
697
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
X=F∧Y=F→Z=F X=F∧Y=T→Z=T X=T∧Y=F→Z=T X=T∧Y=T→Z=F
An Evaluation of Scalability
To further examine the performance of our interestingness measure in the presence of uncertainty, 25% of random noises were added to the data set by randomly changing the value of Z in 256 tuples (i.e., 25% of all tuples) from F to T and vice versa. Figure 3 shows the data. The association rules discovered based on the proposed interestingness measure together with their supports and confidences are given in Table 2. As shown in Table 2, our interestingness measure is able to discover the association rules that reflect exactly the inherent relationships embedded in the data. However, if the minimum support is set to 20% and the minimum confidence is set to 25%, as used in Srikant and Agrawal (1996), association rule mining algorithms adopting the support-confidence framework cannot find any of these associations. It demonstrates a weakness with the utilization of user-supplied thresholds. If a threshold is set too high, a user may miss some useful rules (e.g., the test with the data set presented in this subsection); if it is set too low, the user may be overwhelmed by many irrelevant ones (e.g., the test with the data set discussed in the last subsection).
The databases used in our experiments on scalability are synthetic data generated using the tool provided by IBM (IBM Quest Data Mining Project, 1996). Each tuple in the databases is characterized by nine attributes. Of the nine attributes, three are discrete and six are continuous. In order to evaluate the performance of DFARM, we also implemented count distribution in our test bed using PVM. We chose to implement count distribution because the experimental results presented in Agrawal and Shafer (1996) show that the performance of count distribution is superior to data distribution and candidate distribution. For each database, we discretized the domains of continuous attributes into several intervals, and mapped the values of discrete attributes and the intervals of discretized continuous attributes into integers. We then applied count distribution to the transformed data. Since count distribution finds frequent item sets based on the support constraint, we applied it to the databases using various minimum supports so as to evaluate how its performance is affected by the setting of minimum support.
Size-Up In our first experiment, we fixed the number of sites in the distributed system to 10. To evaluate
Table 2. The association rules discovered in the second data set Association Rule X=F∧Y=F→Z=F X=F∧Y=F→Z=T X=F∧Y=T→Z=F X=F∧Y=T→Z=T X=T∧Y=F→Z=F X=T∧Y=F→Z=T X=T∧Y=T→Z=F X=T∧Y=T→Z=T
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
the performance of DFARM and count distribution with respect to different database sizes, we increased the number of tuples from 1 million to 10 million in our experiment. Figure 4 shows the performance of DFARM and count distribution as the database size increases. In addition to the absolute execution times, we plot the size-up, which is the execution time normalized with respect to the execution time for 1 million tuples in Figure 4
(CD (x%) denotes running count distribution with minimum support x%). As shown in Figure 4, DFARM scales almost linearly in this experiment. When the database size increases, more and more I/O (input/output) and CPU (central processing unit) processing are required to (a) scan the database to obtain the fuzzy local counts and (b) compute the interestingness measure for identifying interesting associations.
Figure 4. The size-up performance
Execution Time (sec.)
7000
DFARM
6000
CD (2%)
5000
CD (1%)
CD (1.5%) CD (0.75%)
4000 3000 2000 1000 0 0
2
4
6
8
10
12
14
No. of Tuples (in millions)
(a) Execution time 12
DFARM CD (2%)
10
CD (1.5%) CD (1%)
Size-Up
8
CD (0.75%)
6 4 2 0 0
2
4
6
8
10
12
14
No. of Tuples (in millions)
(b) Size-up
699
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
The amount of execution time spent in communication is more or less the same regardless of the database size because the number of associations is independent of the database size and only its fuzzy local counts are exchanged between different sites in the distributed system. This characteristic of the algorithm results in the reduction of the percentage of the overall execution time spent in communication. Since the I/O and CPU processing in DFARM scales linearly with the database size, they show almost linear performance.
This experiment also shows that the performance of DFARM is superior to count distribution with respect to different database sizes. Specifically, DFARM is 2.8 times faster than count distribution with minimum support 2%, and 7.6 times faster than count distribution with minimum support 0.75%.
Speedup In our second experiment, we fixed the database size to 2 million tuples. To evaluate the performance
Figure 5. The speedup performance
Execution Time (sec.)
14000
DFARM
12000
CD (2%)
10000
CD (1%)
CD (1.5%) CD (0.75%)
8000 6000 4000 2000 0 0
2
4
6
8
10
12
14
No. of Sites
(a) Execution time 12
DFARM CD (2%)
10
CD (1.5%) CD (1%)
Speedup
8
CD (0.75%)
6 4 2 0 0
2
4
6
8 No. of Sites
(b) Speedup
700
10
12
14
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
of DFARM and count distribution with respect to different numbers of sites in the distributed system, we increased the number of sites from 1 to 10 in our experiment. Figure 5 shows their performance as the number of sites increases. In addition to the absolute execution times, we plot the speedup, which is the execution time normalized with respect to the execution time for a single site, in Figure 5. As shown in Figure 5, DFARM exhibits very good speedup performance in this experiment. In particular, when there are m sites in the distributed system, it shortens the execution time to about 1/m of the execution time for a single site. Nonetheless, given the same amount of data, the speedup performance deteriorates as the number of sites in the distributed system increases. It is because the communication time becomes a significant portion of the overall execution time in comparison to the relatively small processing time for the small amount of data to process at each site. This experiment also shows that DFARM outperforms count distribution with respect to different numbers of sites in the distributed system. Specifically, when there are two sites in the distributed system, DFARM is 2.7 times faster than count distribution with minimum support 2%, and 7.4 times faster than count distribution with minimum support 0.75%; when there are 10 sites in the distributed system, DFARM is 3 times faster than count distribution with minimum support 2%, and 8.3 times faster than count distribution with minimum support 0.75%.
Scale-Up In this experiment, we fixed the size of the database partition at a site to 1 million tuples. We increased the number of sites in the distributed system from 1 to 10. Figure 6 shows the performance of DFARM as the number of sites increases. In addition to the absolute execution time, we plot the scale-up, which is the execution time normalized with respect to the execution time for a single site, in Figure 6. As shown in Figure 6, DFARM has very good scale-up performance. Since the number of associa-
tions it finds does not change when the database size increases, the I/O and CPU processing at each site remains constant. The execution time increases slightly as the database size and the number of sites increase. The small increment in execution time is due to the increase in the communication overhead when there are more and more sites in the distributed system. This experiment also shows that DFARM can better handle larger databases when more processors are available when compared to count distribution. DFARM is 2.7 times faster than count distribution with minimum support 2%, and 7.4 times faster than count distribution with minimum support 0.75%.
CONCLUSION In this chapter, we introduce a new distributed algorithm, called DFARM, for mining fuzzy association rules from very large databases. DFARM employs an objective interestingness measure to discriminate interesting and uninteresting associations. It is shown to be very effective in identifying interesting associations in noisy data. It also has the advantage in that it does not require users to specify any thresholds, which can probably be found by trials and errors only. In addition to the ability to discover interesting associations in databases, DFARM exploits the high scalability of distributed systems to better handle very large databases. We implemented DFARM in a distributed system using PVM. We applied it to several databases to evaluate its effectiveness and scalability performance. The results of our experiments on scalability demonstrate that DFARM has very good size-up, speedup, and scale-up performance.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIG701
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Figure 6. The scale-up performance
Execution Time (sec.)
7000
DFARM
6000
CD (2%)
5000
CD (1%)
CD (1.5%) CD (0.75%)
4000 3000 2000 1000 0 0
2
4
6
8
10
12
14
No. of Sites
(a) Execution time 2
DFARM
Scale-Up
1.8
CD (2%)
1.6
CD (1.5%)
1.4
CD (1%)
1.2
CD (0.75%)
1 0.8 0.6 0.4 0.2 0 0
2
4
6
8
10
12
14
No. of Sites
(b) Scale-up
MOD International Conference on Management of Data (pp. 207-216).
20th International Conference on Very Large Data Bases (pp. 487-499).
Agrawal, R., & Shafer, J. C. (1996). Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6), 962-969.
Au, W.-H., & Chan, K. C. C. (2001). Classification with degree of membership: A fuzzy approach. In Proceedings of the First IEEE International Conference on Data Mining (pp. 35-42).
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the
702
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Au, W.-H., & Chan, K. C. C. (2003). Mining fuzzy association rules in a bank-account database. IEEE Transactions on Fuzzy Systems, 11(2), 238-248.
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann.
Au, W.-H., Chan, K. C. C., & Wong, A. K. C. (2006). A fuzzy approach to partitioning continuous attributes for classification. IEEE Transactions on Knowledge and Data Engineering, 18(5), 715-719.
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1-12).
Carrasco, R. A., Vila, M. A., Galindo, J., & Cubero, J. C. (2000). FSQL: A tool for obtaining fuzzy dependencies. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1916-1919). Chan, K. C. C., & Au, W.-H. (1997). Mining fuzzy association rules. In Proceedings of the Sixth International Conference on Information and Knowledge Management (pp. 209-215). Chan, K. C. C., & Au, W.-H. (2001). Mining fuzzy association rules in a database containing relational and transactional data. In A. Kandel, M. Last, & H. Bunke (Eds.), Data mining and computational intelligence (pp. 95-114). New York: Physica-Verlag. Delgado, M., Marín, N., Sánchez, D., & Vila, M.A. (2003). Fuzzy association rules: General model and applications. IEEE Transactions on Fuzzy Systems, 11(2), 214-225. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., & Sunderam, V. (1994). PVM: Parallel virtual machine. A user’s guide and tutorial for networked parallel computing. Cambridge, MA: MIT Press. Haberman, S. J. (1973). The analysis of residuals in cross-classified tables. Biometrics, 29(1), 205-220. Han, E.-H., Karypis, G., & Kumar, V. (1997). Scalable parallel data mining for association rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 277288).
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: The MIT Press. Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data mining. Proceedings of the IEEE, 87(9), 1575-1600. Hong, T. P., Kuo, C. S., & Chi, S. C. (1999). Mining association rules from quantitative data. Intelligent Data Analysis, 3(5), 363-376. IBM Quest Data Mining Project. (1996). Quest synthetic data generation code. Retrieved October 12, 2001, from http://www.almaden.ibm.com/cs/ quest/syndata.html Kuok, C.-M., Fu, A., & Wong, M. H. (1998). Mining fuzzy association rules in databases. SIGMOD Record, 27(1), 41-46. Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 80-86). MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge, United Kingdom: Cambridge University Press. Maimon, O., Kandel, A., & Last, M. (1999). Information-theoretic fuzzy approach to knowledge discovery in databases. In R. Roy, T. Furuhashi, & P. K. Chawdhry (Eds.), Advances in soft computing: Engineering design and manufacturing (pp. 315-326). London: Springer-Verlag. Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Efficient algorithms for discovering association
703
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
rules. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases (pp. 181192). Mendel, J. M. (1995). Fuzzy logic systems for engineering: A tutorial. Proceedings of the IEEE, 83(3), 345-377. Osteyee, D. B., & Good, I. J. (1974). Information, weight of evidence, the singularity between probability measures and signal detection. Berlin, Germany: Springer-Verlag. Park, J. S., Chen, M.-S., & Yu, P. S. (1995a). An efficient hash-based algorithm for mining association rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 175-186). Park, J. S., Chen, M.-S., & Yu, P. S. (1995b). Efficient parallel data mining for association rules. In Proceedings of the Fourth International Conference on Information and Knowledge Management (pp. 31-36). Savasere, A., Omiecinski, E., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (pp. 432-444). Shintani, T., & Kitsuregawa, M. (1996). Hash based parallel algorithms for mining association rules. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (pp. 19-30). Smyth, P., & Goodman, R. M. (1992). An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4(4), 301-316. Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1-12). Wong, A. K. C., & Wang, Y. (1997). High-order pattern discovery from discrete-valued data. IEEE
704
Transactions on Knowledge and Data Engineering, 9(6), 877-893. Yager, R. R. (1991). On linguistic summaries of data. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery in databases (pp. 347-363). Menlo Park, CA: AAAI/MIT Press. Yen, J., & Langari, R. (1999). Fuzzy logic: Intelligence, control, and information. Upper Saddle River, NJ: Prentice-Hall. Zhang, W. (1999). Mining fuzzy quantitative association rules. In Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence (pp. 99-102).
KEY TERMS Adjusted Residual: It is a statistic defined as a function of the difference in the actual and the expected number of tuples characterized by different linguistic variables (attributes) and linguistic terms (attribute values). It can be used as an interestingness measure. Associative Classification: It is a classification method based on association rules. An association rule with the class label as its consequent provides a clue that a tuple satisfying its antecedent belongs to a specific class. It can therefore be used as the basis of classification. Fuzzy Association Rule: A fuzzy association rule involves linguistic terms (fuzzy sets) in its antecedent and/or consequent. Fuzzy Partitioning: It is a methodology for generating fuzzy sets to represent the underlying data. Fuzzy partitioning techniques can be classified into three categories: grid partitioning, tree partitioning, and scatter partitioning. Of the different fuzzy partitioning methods, grid partitioning is the most commonly used in practice, particularly in system control applications. Grid partitioning forms a partition by dividing the in-
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
put space into several fuzzy slices, each of which is specified by a membership function for each feature dimension. Interestingness Measure: An interestingness measure represents how interesting an association is. The support is an example of interestingness measures.
Negative Association Rule: A negative association rule’s antecedent and consequent show a negative association. If its antecedent is satisfied, it is unlikely that its consequent will be satisfied. Positive Association Rule: A positive association rule’s antecedent and consequent show a positive association. If its antecedent is satisfied, it is likely that its consequent will be satisfied.
705
706
Chapter XXVIII
Applying Fuzzy Logic in Dynamic Causal Mining Yi Wang Nottingham Trent University, UK
Abstract This chapter applies fuzzy logic to a dynamic causal mining (DCM) algorithm and argues that DCM, a combination of association mining and system dynamics for discovering causality patterns, needs a potentially more substantive approach for the user to understand the nature of extracted rules and information in a variety of contexts. Furthermore, the author hopes that the use of fuzzy logic will not only assist the user to make better decisions, but also assist in a better understanding of future behaviour of the target system.
Introduction Dynamic causal mining (DCM) assists decision makers in controlling a system at decision points by converting data into policies. DCM searches for simultaneous dynamic causal relations in a database and discovers delay and feedback relationships between attributes based on separate time stamps. This makes the algorithm more suitable for dynamic modeling and enables the discovery of hidden dynamic structures, which can be applied to predict the future behaviour of a dynamic system.
Causality, in this chapter, denotes a relationship between two or more entities. There are two types of causality: static and dynamic. In marked basket analysis, an example of static causality could be that purchasing nails might cause the purchase of a hammer. An example of dynamic causality is that an increase in the purchase of chips might cause an increase in the purchase of soft drinks. However, an increase in the purchase of nails might not cause an increase in the purchase of hammers (one hammer is enough to nail all the nails), thus this is not a dynamic causality.
The DCM approach imposes problems such as accuracy and efficiency, and this chapter further suggests using fuzzy sets to solve these problems. Compared to quantitative rules, fuzzy rules correspond better to sharp boundaries between neighbour sets. In most real-life applications, databases contain many other attribute values other than 0 and 1. Quantitative attributes such as production volume and income take values from an ordinal scale. One way of dealing with a quantitative attribute is to divide the range of the original attributes into partitions, such as low, medium, and high. It is more intuitive to allow attribute values to vary from the interval [0, 1] (instead of just 0 or 1), indicating the degree of belonging. Thus, attributes are no longer binary but fuzzy. This chapter suggests relaxing the strict separation among polarity +, polarity -, and neutrality, and using more flexible linguistic terms like high increase, increase, high decrease, decrease, and neutral. Fuzzy sets can provide a reasonable representation using cognitive concepts in terms of natural language. These linguistic terms use graded statements rather than ones that are strictly true or false, and thus provide an approximate but effective way to describe the dynamic causal behaviour of systems (Zadeh, 1975a, 1975b, 1975c). This approach obtains not only a more humanunderstandable knowledge from the database, but also provides more compact and robust representations. The use of fuzzy partitions of the domains of quantitative attributes can avoid some undesirable threshold effects that are usually produced by crisp (nonfuzzy) partitions. The rest of this chapter is structured as follows. First, a brief description of dynamic causal mining and its component is presented. Second, the fuzzy approach is introduced. Then the detailed fuzzy algorithm is described. An illustrative example is used to show how the fuzzified DCM can be applied. It is followed with a real-life example. At last, the conclusion and future work is presented. This chapter will not give any detailed introduction about fuzzy data mining; another chapter of this book includes an introduction to fuzzy data
mining methods by Feil and Abonyi. Also in this volume, the reader can find one application of fuzzy data mining techniques to tourism by Carrasco, Araque, Salguero, and Vila.
Dynamic Causal Mining The DCM algorithm was discovered in 2005 (Pham, Wang, & Dimov, 2005) using only the counting algorithm to integrate with game theory. It was extended in 2006 (Pham, Wang, & Dimov, 2006) with delay and feedback analysis, and was further improved for analysis in game theory with formal concept analysis (Wang, 2007). DCM enables the generation of dynamic causal rules from data sets by integrating the concepts of systems thinking (Senge, Kleiner, Roberts, Ross, & Smith, 1994) and system dynamics (Forrester, 1961) with association mining (Agrawal, Mannila, Srikant, Toivonen, & Inkeri, 1996). The algorithm can process data sets with both categorical and numerical attributes. Compared with other association mining algorithms, DCM rule sets are smaller and more dynamically focused. The pruning is carried out based on polarities. This reduces the size of the pruned data set and still maintains the accuracy of the generated rule sets. The rules extracted can be joined to create dynamic policy, which can be simulated through software for future decision making. The rest of this section gives a brief review of association mining, fuzzy data mining, and system dynamics.
Association Mining Association mining was discovered by Agrawal (Agrawal et al., 1996). It was further improved in various ways, such as in speed (Agrawal et al.; Cheung, Han, Ng, & Fu, 1996) and with parallelism (Zaki, Parthasarathy, Ogihara, & Li, 1997), to find interesting associations and/or correlation relationships among large sets of data items. It shows attribute value conditions that occur frequently together in a given data set. It generates
707
Applying Fuzzy Logic in Dynamic Causal Mining
the candidate item sets by joining the large item sets of the previous pass and deleting those sub sets that are small in the previous pass without considering the transactions in the database. By only considering large item sets of the previous pass, the number of large candidate item sets is significantly reduced. The following is an example of what association mining does. Given the items milk (M), bread (B), cheese (C), and honey (H), and following purchase transactions, T1: {M, C, D}, T2: {B, C, H, 3: {M, B, C, H, 4: {B, H}, if given the threshold of 1, which means that frequency equal or below 1 is pruned away, then the following is the process of association mining. Scan 1: Scan for frequency of single items: {M}:2, {B}:3, {C}:3, {D}:1, {H}:3 Frequent items: {M}:2, {B}:3, {C}:3, {H}:3 Pruned item: {D} Scan 2: Scan for frequency of two items: {M,B}:1, {M,C}:2, {M,H}:1, {B,C}:2, {B,5}:3, {C,H}:2 Frequent item sets: {M,C}:2,{B,C}:2, {B,H}:3, {C,H}:2 Pruned item set: {M,H}:1 Scan 3: Scan for frequency of two items: {B, C, H}:2 Frequent item sets: {B, C, H}
Final association rules: B→C→H
The rule indicates that people who bought bread have a high chance of buying chocolate and honey, following the given transaction. Counting the support of candidates is a timeconsuming step in the algorithm; to reduce the number of candidates that need to be checked for a
708
given transaction, candidate item sets Ck are stored in a format such as a hash tree (Srikant & Agrawal, 1996) or lattice diagram (Zaki et al., 1997). When an item set is inserted in a hash tree, it is required to start from the root and go down the tree until a leaf is reached. Furthermore, Lk are stored in a hash table to make the pruning step faster.
Fuzzy Data Mining Real-world data are often filled with imprecise data that have to be normalized into well-defined and unambiguous data in order to be handled by the standard relational data model. Several extensions to the classical relational model have been proposed to support quantitative data (Agrawal et al., 1996; Brin, Motwani, Ullman, & Tsur, 1997; Srikant & Agrawal, 1996). The fuzzy association rule is of the form “If X is FX then Y is FY.” As in the binary association rule, “X is FX ” is called the antecedent of the rule while “Y is FY ” is called the consequent of the rule. X and Y are sets of attributes of the database and FX and FY are labels representing fuzzy sets that characterize X and Y respectively. The fuzzy approach represented a more robust solution for systems that lack flexibility. This approach not only obtains a more human-understandable knowledge from the database data, but also provides more compact and robust representations not weakened by “patched” data types based on a strong theoretical model (Zaki et al., 1997). The Boolean association rules have been adapted to handle quantitative data based on the fuzzy layout (Fu, Wong, Sze, Wong, Wong, Yu, 1998; Hong, Kuo, Chi, Wang, 2000; Kuok, Fu, & Wong, 1998), reusing all the previous research and algorithms without the need to discover new techniques. In Raju and Majumdar (1988), a deep analysis of different relations in the fuzzy domain was established, which was used in many of today’s mining algorithms. Fu et al. (1998) suggested that an external entity, usually the end user or an expert, should create the
Applying Fuzzy Logic in Dynamic Causal Mining
fuzzy sets of the quantitative attributes and the membership functions since the algorithm relies quite crucially on the appropriateness of the fuzzy sets to the given data.
System Thinking System thinking is based on the belief that the component parts of a system will act differently when isolated from their environment or other parts of the system. Systems thinking is about the interrelated actions that provide a conceptual framework or a body of knowledge that makes the pattern clearer (Senge et al., 1994). It is a combination of many theories such as the soft systems approach and system theory (Flood, 1999). Systems thinking seeks to explore things as wholes through patterns of interrelated actions.
System Dynamics System dynamics can be defined as “a qualitative and quantitative approach to describe model and design structures for managed systems in order to understand how delays, feedback, and interrelationships among attributes influence the behaviour of the systems over time” (Coyle, 1996), or a model whose purpose “is to solve a problem, not simply to model a system. The model must simplify the system to a point where the model replicates a specific problem” (Sterman, 1994). System dynamics is a tool to visualize and understand such patterns of dynamic complexity, which is built up from a set of system archetypes based on principles in system thinking. System dynamics visualizes complex systems through causal loop diagrams. A causal loop diagram consists of a few basic shapes (Sterman, 2000) that together describe the action modeled. System dynamics addresses two types of behaviour: sympathetic and antipathetic (Pham et al., 2006). Sympathetic behaviour indicates an initial quantity of a target attribute that starts to grow, and the rate of growth increases. Antipathetic
behaviour indicates an initial quantity of a target attribute that starts either above or below a goal level and over time moves toward the goal.
Time Stamp To find dynamic causality among a set of attributes means to identify correlation and interdependencies between them. The DCM algorithm is a way of describing the state of a target system as it evolves in time. It discovers dynamic causality in a data set by matching the dynamic behaviour between separated attributes. Definition 1. A dynamic time stamp is created from two time stamps. Consider two time stamps ti and ti+1. The dynamic time stamp ∆ti is equal to the difference between two consecutive time stamps. ∆ti = ti+1 – ti
(1)
Time stamps are used for identifying the range of variables. The size of each time stamp is selected by the specific need and may vary in different situations. For instance, a time stamp for an increase in production may be in the order of months, while for a change in a cell may be in the order of milliseconds. The time stamp also can help to determine how detailed the variables need to be. The attribute may increase or decrease dramatically if the time stamp is in the order of seconds; however, it may be assumed to be constant if the time stamp is in the order of years. All the time stamps should be of uniform length. In order to carry out DCM, the time stamps are summarised or partitioned into equal-sized time stamps. A time stamp is useful for describing and prescribing changes to the systems and objects.
Data Definition 2. A dynamic attribute is the change or the difference between two attribute values with consecutive time stamps. The two types of value do not have the same nature. Let D denote a data set
709
Applying Fuzzy Logic in Dynamic Causal Mining
that contains a set of n records with attributes {A1, A2, A3,… Am}, where each attribute is of a unique type (for example, sale price, production volume, inventory volume, etc.). Each attribute is associated with a time stamp ti, where i ={1,2,3,…n}. Let Dnew be a new database constructed from D such that dynamic attribute ∆Am ,∆ti in Dnew is given by
ing of systems therefore necessitates that delays are properly represented in order that real-life behaviour can be replicated. A delayed dynamic causal relationship between two attributes implies a change of attribute values as shown in Figure 1. A dynamic attribute A1 at a time point t1 causes a change to attribute A2 at a time point t2.
∆Am ,∆ti = Am ,ti +1 − Am ,ti
Definition 4. Feedback is a counter effect from another source to the original source of the effect.
,
(2)
where m identifies the attribute of interest.
Delay and Feedback Delay and feedback play central roles in many processes and systems. It takes time to create a product, manage a service, execute an operation, or build a facility. Delays are important parts of any system. They represent the time between a change occurring in one part of the system and the cause of change in the other part.
Feedback deals with the control and determination of deviations from a desired state and executes corrective action regarding these deviations. Feedback refers to the method of controlling a system by reinserting the results from its past performance.
Figure 1. Delayed dynamic causal relation
Definition 3. A delay is the time difference between two dynamic causal events occurring in different parts of the same system. The dynamic causal event is represented by the dynamic causal attribute, and the system characteristic is represented by the database. Delays play important roles in deciding the dynamic causal behaviour of systems. The model-
Table 1. The creation of dynamic database Time 1 2 3 4 5
A1 9 17 10 4 7
A2 2 3 12 16 24
Original database D
710
∆t
∆A1
∆A2
∆t1
+8
+1
∆t2
-7
+9
∆t3
-6
+4
∆t4
+3
+8
Derived database Dnew
Applying Fuzzy Logic in Dynamic Causal Mining
A feedback relation means that the change in A2 at t2, due to A1 at t1, would in turn cause the value of A1 to alter at t3. Let D denote a database that contains a set of n records with attributes {A1, A2, A3, …, Am}, where each attribute is of a unique type (e.g., sale price, production volume, inventory volume, etc.). Each attribute is associated with a time stamp t. The records are arranged in a temporal sequence (t1, t2, …, tn). Table 1 is an example of such a database. Since the sequence t1, t2, …, tn is arranged in ascending numerical order, ∆ti is always increasing. Table 1 illustrates the new database Dnew derived from database D. So far the database contains records with time interval of equal length. In case more than two values of attribute Am exist between ti and ti+1, an average of the values is selected depending on the preprocessing phase.
Measurements Since the input of the DCM algorithm can be quite large, it is important to prune away the redundant attributes. Definition 5. A polarity indicates the direction of a change of an attribute. There are three types of polarity (+, -, 0), where + indicates an increase, - indicates a decrease, and 0 indicates neutrality, that is, no change at all. Definition 6. A dynamic support is defined as the ratio of the number of records of a given polarity combination to the total number of records in the database based on respect time stamps. There are eight polarity combinations: (+,+,+), (-,-,-), (+,-,-), (-,+,+), (+,+,-), (-,-,+), (+,-,+), and (-,+,-). Measures are used to identify dynamic causal rules (or relationships). These are fully sympathetic support, fully antipathetic support, self-sympathetic support, self- antipathetic support, and single support, which is an increase or decrease, or is neutral. For database Dnew and any two attributes
ΔA1 and ΔA2, the different types of support are defined as follows. fully sympathetic support (ΔA1, ΔA2, ΔA1) = freq (+ ∆ti , + ∆ti+1 , + ∆ti+2 ) n freq (− ∆ti , − ∆ti+1 , − ∆ti+2 ) n
where n is the total number of dynamic time stamps and m identifies the attribute of interest. freq (+ ∆t ,+ ∆t ,+ ∆t ) is a function of the number of times where an increase in ΔA1 is followed by an increase in ΔA2, which induces another increase in ΔA1 with respect to the time stamps ∆ti, ∆ti+1, and i
i +1
i+2
711
Applying Fuzzy Logic in Dynamic Causal Mining
Table 2. Derived database Dnew with arrows indicating support counting direction ∆t ∆t1 ∆t2 ∆t3 ∆t4
∆A2 +1 +9 +4 +8
∆A1 +8 -7 -6 +3
(+ ∆ti ,+ ∆ti +1 ,+ ∆ti + 2 ) , (− ∆ti ,− ∆ti +1 ,− ∆ti + 2 ) , (− ∆ti ,+ ∆ti +1 ,+ ∆ti + 2 ) , (+ ∆t ,− ∆t ,− ∆t ) , (+ ∆ti ,− ∆ti +1 ,+ ∆ti + 2 ) , (− ∆ti ,+ ∆ti +1 ,− ∆ti + 2 ) , (+ ∆t ,+ ∆t ,− ∆t ) , and (− ∆ti ,− ∆ti +1 ,+ ∆ti + 2 ) . This polarity representation differs from that used in classical causal rules (either + or -), which are too simple to model dynamic behaviours in real-world systems. i
i +1
i+2
i
i +1
i+2
An Illustrative Example ∆ti+2. Similarly, freq (− ∆t ,− ∆t ,− ∆t ) is a function of the number of times where a decrease in ΔA1 is followed by a decrease in ΔA2, which induces another decrease in ΔA1 with respect to the time stamps ∆ti, ∆ti+1, and ∆ti+2. The neutral support indicates the frequency of value 0 in a derived attribute. This support is used to prune ineffectual attributes. Table 2 shows the derived database Dnew with arrows indicating the direction in which supports are counted. In this example, the neutral support is 0 since there is no record of value 0 in ∆A1 and ∆A2. The other supports are counted by following the direction of the arrows. A left-to-right arrow indicates the causal relation ∆A1,∆ti → ∆A2,∆ti +1 , and a right-to-left arrow indicates ∆A2,∆ti +1 → ∆A1,∆ti + 2 . The result is shown in Table 3. Dynamic causal rules are used to predict future dynamic behaviour. Each causal rule is assigned a polarity combination with a time stamp. Each polarity is assigned a time stamp. According to Equations 3 to 13 and as shown in Table 3, there are eight polarity combinations of interest, namely, i
i +1
i+2
This section uses a simple example to illustrate the DCM. This is a classical example, without fuzzy logic. Table 4 shows a database where the first column indicates the time instant, which could be hours, weeks, or years. The numbers in the columns have the same units. They could, for example, be purchase prices or sales levels, and so forth. The first row of Table 4 can therefore be interpreted as to mean in Week 1, Company 1 decides to produce nine units, Company 2 decides to make two units, and so on. Dynamic causal mining is to be applied to these data to derive any dynamic causal relationships between these production volumes in order to assist a company in deciding its future manufacturing strategy. Table 5 shows the database derived after the difference calculation using Equation 2. Table 6 illustrates the pruned database. Pruning is carried out to remove columns (attributes) where the level of neutral support is below a set minimum. In this example, columns with seven or more zeros (meaning with seven or more records with neutral polarities) are removed.
Table 3. Counting result
Sup (ΔA1, ΔA2, ΔA1) Sup (ΔA1,ΔA2, ΔA1)
712
(+,+,+) 0 (+,-,+) 0
(-,-,-) 0 (-,+,-) 0
(+,-,-) 0 (+,+,-) 1/4
(-,+,+) 1/4 (-,-,+) 0
Applying Fuzzy Logic in Dynamic Causal Mining
Table 4. Original database Time 1 2 3 4 5 6 7 8 9 10
In general, when the number of zeros in a column is high with respect to the total number of entries, the corresponding attributes can be regarded as unaffected by attributes represented in the other columns. Even if a few of the remaining nonzero entries are large in magnitude, their effect on the sympathetic and antipathetic support counts will be small. Table 7 shows the supports for the attributes in Table 6 taken in pairs. The supports are calculated according to Equations 3 to 13. Suppose that the
ΔA1 +8 -7 -6 +3 -1 0 +5 +9 +1
ΔA2 +1 +9 +4 +8 -6 0 +3 -9 -5
ΔA4 +6 -1 -5 -8 -3 -6 0 -4 +7
ΔA7 0 -7 -4 +8 -5 -1 +5 -7 +6
support threshold is set to 0.1, which means any attribute pair with support larger than or equal to 0.1 is considered dynamically causally related. The obtained results are shown in Table 8. Thus, for the given database, the strong self-sympathetic rules are (ΔA1&ΔA2) and (ΔA1&ΔA7). The only strong self-antipathetic rule is (ΔA2&ΔA7). The only strong fully antipathetic rule is (ΔA2&ΔA7). The derived rules reveal to decision makers that changes in attribute A1 will be reinforced and that change in attribute A2 will tend to be opposed. Such
a finding would not have been possible without considering delayed and feedback relationships.
Fuzzy Approach This section gives details of how fuzzy logic is applied on DCM. Only the trapezoid membership function is used as illustration since it is the simplest.
714
Definition 7. A fuzzy dynamic attribute is identified by some membership functions that are a generalization of the characteristics of a dynamic attribute ∆Am in Dnew. Let L be a linguistic term expressing one of these characteristics. Then, the associated membership function specifies the membership degree of each value in ∆Am to L. The membership degrees are taken from the unit interval [0,1]; that is, a membership function is a mapping ∆Am →
Applying Fuzzy Logic in Dynamic Causal Mining
[0,1]. µL( ∆Am ,∆ti ) denotes the membership degree of the dynamic attribute ∆Am ,∆ti at time stamp i. µ L ( ∆Am ,∆ti ) ∈ [0, 1]
fuzzy support = F (∑
L
(∆A1, ∆ti ), ∑
L
(∆A2, ∆ti +1 )∑
L
(∆A1, ∆ti + 2 )), (15)
L
(∆A1, ∆ti ), ∑
L
(∆A2, ∆ti +1 )∑
L
(∆A1, ∆ti + 2 ))
where
(14)
Figure 2 shows five possible membership functions used in DCM. Given an attribute ∆Am ,∆ti , F1 = Min (- ∆Am ,∆ti ), F2 = Average(- ∆Am ,∆ti ), F3 = Average(+ ∆Am ,∆ti ), and F4 = Max (+ ∆Am ,∆ti ). F1 is the smallest value in ∆Am ,∆ti , F2 is the average of the sum of the negative values in ∆Am ,∆ti , F3 is the average of the sum of the positive value in ∆Am ,∆ti , and F4 is the largest value in ∆Am ,∆ti . If Dnew in Table 1 is presented, the membership function can be calculated for both ∆A1 and ∆A2. For ∆A1, F1 = -7, F2 = - 6.5, F3 = 5.5, and F4 = 8. For ∆A2, F3 = 5.5 and F4 =9. Thus, the fuzzy support can be defined as:
F (∑
is the average function giving the sum of the membership degree functions for a dynamic change in ΔA1, which is followed by another dynamic change in ΔA2, which induces another dynamic change in ΔA1 with respect to the time stamps ∆ti, ∆ti+1, and ∆ti+2. The neutral support indicates the sum of the degree membership functions for the linguistic term neutral in a derived attribute. Using the five membership functions (high decrease, decrease, neutral, increase, high increase), they can be mathematically defined as the following.
Figure 2. The proposed fuzzy partitions
1
Membership degree Function High Decrease decrease
-
F1
Neutral
Increase
0
F3
F2
High increase
F4
+ Polarity
Figure 3. Membership functions ∆A1 and ∆A2 Membership degree Function
1
High decrease Decrease
-
-7
-6.5
Membership degree Function Neutral
0
Increase
High increase
Increas
High increase
1 1
5.5
8
+
5.5
9
715
Applying Fuzzy Logic in Dynamic Causal Mining
µNeutral ( ∆Am ,∆ti ) = if ∆Am ,∆ti ≤ F2 0, ∆A F − 2 m ,∆ti , if , 0 ≥ ∆Am , ∆ti > F2 F2 (16) F − ∆A m , ∆ti 3 , if , 0 < ∆Am , ∆ti ≤ F3 F3 ∆ > 0 , , i f A F m , ∆ti 3
0, if ∆Am ,∆ti ≥ 0 ∆A m , ∆ti , if 0 > ∆Am ,∆ti ≥ F2 F2 F − ∆A (19) m , ∆ti 1 , if F1 ≤ ∆Am ,∆ti < F2 F1 − F2 0, if ∆Am ,∆ti > 0
µHigh decrease ( ∆Am ,∆ti ) = µIncrease ( ∆Am ,∆ti ) = 0, if ∆Am ,∆ti ≤ 0 ∆A m , ∆ti , if 0 < ∆Am ,∆ti ≤ F3 F3 F − ∆A (17) m , ∆ti 4 , if F3 < ∆Am ,∆ti ≤ F4 F4 − F3 0 , i f A F ∆ > 3 m , ∆ti
µHigh increase ( ∆Am ,∆ti ) = 0, if ∆Am ,∆ti ≤ F2 ∆A m ,∆ti − F3 , if F3 < ∆Am ,∆ti (18) F 4 − F3
0, if ∆Am ,∆ti ≥ F2 ∆A − F 1 m ,∆ti F − F , if F1 > ∆Am ,∆ti 2 1
Table 9 shows the mapping result from the crisp values from Dnew. The max operator is then used to derive the frequent polarity. It can be read that F(∆A1) at ∆t3 has 0.92 decrease and 0.08 neutral, thus 0.92>0.08 and D is selected. Table 10 shows the result when the max operator is applied.
fuzzy algorithm
µDecrease ( ∆Am ,∆ti ) =
The strategy consists of the following steps. 1.
Fuzzify all dynamic attributes according to the membership function using Equations 16 to 20. Prune all dynamic attributes where the counted single attribute support is below the user-defined threshold.
2.
Table 9. Fuzzy database ∆t ∆t1 ∆t2 ∆t3 ∆t4
716
Hd 0 1 0 0
D 0 0 0.92 0
(20)
F(∆A1) N I 0 0 0 0 0.08 0 0.45 0.55
Hi 1 0 0 0
N 0.82 0 0.27 0
F(∆A2) I 0.19 0 0.73 0.29
Hi 0 1 0 0.71
Applying Fuzzy Logic in Dynamic Causal Mining
Table 10. The result ∆t ∆t1 ∆t2 ∆t3 ∆t4
3.
F(∆A1) 1Hi 1Hd 0.92D 0.55I
F(∆A2) 0.82N 1Hi 0.73I 0.71Hi
Prune all dynamic attribute pairs where the counted pairwise support is below the userdefined threshold.
The mining process uses a separate pruning method instead of brute force, which implies checking the delay and feedback separately. First, the algorithm checks the delay relationships and prunes away the redundant attribute sets, and then it checks the feedback relationships in the remaining sets. This reduces the numbers of scans in each pass. This algorithm works under the assumption that the support of a subset is always at least the support of any superset, similar to DCM. The process contains two parts. Part 1: Preprocessing. Removal of the least causal data from the database Input: The original database (numerical database), and the values of the pruning threshold for the neutral, sympathetic, and antipathetic supports Step 1: Calculate F1, F2, F3, and F4. Step 2: Initialize a new database with dynamic attributes based on the fuzzy membership functions derived from F1, F2, F3, and F4. Step 3: Sum the membership degree for each polarity with respect to each dynamic attribute. Step 4: Prune away all the dynamic attributes with supports above the input thresholds.
Step 5: Apply the max operator to the fuzzified attributes to identify the single polarity. Part 2: Mining. Formation of a rule set that covers all training examples with minimum number of rules Input: The preprocessed database, and the values of the pruning threshold for the supports of the polarity combinations Step 1: Initialize a counter for the attribute pair. Step 2: Initialize an empty database. Step 3: For each pair of attributes, Summarise the average degree of combinations of polarity for the attribute pair, only ∆Am ,∆ti → ∆Am ,∆ti +1 , and store the nonneutral linguistic terms with polarity combination above the input threshold into the empty database. Step 4: For each pair of attributes in the new database Summarise the average degree of combinations of polarity for the attribute pair with feedback, and prune the pair with polarity combination below the input threshold into the empty database. The subroutine accepts the database, finds out and returns the complete dynamic set of the database, and prunes away all attributes below the neutral support threshold. The algorithm generates a new transformed database from the derived database by specifying a membership function that translates the dynamic attributes into fuzzy form. The single fuzzy attribute based on the membership function is first generated from the derived database. If the sum of the membership degree is above the user-specified support level, then the attribute will be kept; otherwise, it will be pruned. In this subroutine, the fuzzy database is scanned in the same way as in the section “An Illustrative Example,” and the fuzzy support based on the polarity
717
Applying Fuzzy Logic in Dynamic Causal Mining
is selected as the neutral support threshold. This means that the attributes with sum of the membership degree function neutral above or equal to 2 will be pruned. Table 12 shows the result after µL (∆A1) is pruned. After the neutral pruning, the max operator is applied to the fuzzy data sets. Table 12 shows the result.
combination is counted. The dynamic sets with support larger than or equal to minsup are generated following the separate pruning strategy. If Table 5 is presented, then the membership functions can be generated and are depicted in Figures 4 to 7. Table 11 shows the fuzzified database. The first step of pruning is based on the neutral support; 2 Figure 4. Membership functions for ΔA1 Membership degree Function Decrease
1
-
Neutral
-7
Increase
High increase
5.2
0
+
9
Figure 5. Membership functions for ΔA2 Membership degree Function High Decrease
Decrease
Neutral
1
-
-8
Increase
0
-6.3
High increase
5
+
9
Figure 6. Membership functions for ΔA4 Membership degree Function
1
High decrease
-
718
-8
Decrease
-4.5
Neutral
0
Increase
6.5
High increase
7
+
Applying Fuzzy Logic in Dynamic Causal Mining
Figure 7. Membership functions for ΔA7 Membership degree Function
Finally, the support is calculated and the result is shown in Table 13.
Experiment This section contains two parts. The first part compares the fuzzy applied DCM algorithm with the crisp applied DCM algorithm. The second shows in detail how DCM is applied on the realworld data.
Algorithm Testing The algorithm was tested against seven data sets, where five of the data sets were taken from the UCI Machine Learning Repository and two were taken from the real world (the ones with the ® sign). The data is explained in Appendix A. Table 14 shows the comparison between the numbers of strong rules generated by DCM with support level and DCM with fuzzy sets. Three support thresholds were selected for the comparison. The result shows
Table 12. Pruned results MAX(F(∆A2)) 0.80D Hi 0.80I 0.89Hi 0.95D 0 0.60 I Hd 0.79D
∆t ∆t1 ∆t2 ∆t3 ∆t4 ∆t5 ∆t6 ∆t7 ∆t8 ∆t9
MAX(F(∆A4)) 0.93 I 0.78N 0.89 D HA 0.67 D 0.75 D 0 0.88 D Hi
MAX(F(∆A7)) 0 Hd 0.83 D Hi 0.91 D 0.80N 0.79 I Hd 0.95 I
Table 13. Attribute combination ∆A2,∆A4,∆A2 Hi 0.89D 0.80I Hd 0.95D 0.75D 0.60I 0.88D
that the numbers of rules by the DCM algorithm are greater than those by fuzzified DCM (FDCM) in certain circumstances, and are lesser in other circumstances. This experiment was performed to compare various numbers of attributes, and to identify the relationships between the numbers of rules mined and minimum support values. To assess the execution time of the algorithms, both FDCM and DCM with support levels were run on different data sets with different numbers of attributes. A comparison between the times spent running the data sets was then carried out. Table 15 shows the running time of the algorithms spent on different data sets. The result shows that the execution time of the fuzzy algorithm is longer than the DCM without and with support levels plus normal values with a small margin. It is clear that the number of rules mined increase along with an increase of the numbers of attributes for a certain minimum support threshold. Execution times also increase along with an increase of numbers of attributes. The fuzzy method gives better experimental results than that with crisp partitions as it is shown here. However, the run time is longer due to the fuzzification.
the time stamp, and so forth. The rest of the static data, such as the weight and the cost of the product, can be removed. After cleaning the data, the dynamic attribute is calculated by finding the difference between sales amounts in one month and sales amounts in the previous month. In the next step, the neutral attributes are pruned. The idea of pruning is to remove redundant dynamic attributes; thus, fewer sets of attributes are required when generating rules. The first pruning is based on the single attribute support. In this case, support is defined to be 0.5, which means an attribute with polarity +, -, or 0 occurring in more than half of the total dynamic time stamps will be pruned. In this case, 429 attributes remain for the rule generation. FDCM is applied on the data set ASW to extract the dynamic causal rules. The fuzzifying process is done by Excel with the following code.
Table 14. Comparison of algorithm Data Set Adult Bank Cystine Market Basket Mclosom® ASW® Weka Base
Real-Life Example The original data were given as shown in Figure 16. The only data of interest are the data with changes, for example, sale amounts of a product,
DCM 0, x < F3), (F3-x)/ F3, (x –F2)/ F2), 0) Decrease fuzzy membership: IF (AND(x => F1, x < 0), IF (AND(x > 0, x => F2), x/ F2, (F1 – x)/ F1 – F2), 0) High decrease fuzzy membership: IF ((x >= F2), (x- F2/ F1 –F2), 0) The IF statement is the exact code applied in excel. The generated rules are shown in Table 17 with respect to their behaviour.
Table 16. The original data sets
722
Table 17 indicates that a rule such as {A05005008, C15276179} has 0.90% occurrence on the whole database. This rule indicates that an increase in production of metal type A05005008 will lead to either an increase or decrease in the production of C15276179 in the next time stamp, which will again lead to a decrease in production of A05005008. This means A05005008 will keep to a constant level even if one increases the production of it due to its causal relation with C15276179. So, for a manager, instead of looking into what can be done for A05005008, he or she should look into C15276179.
Conclusion and Future Lines In this chapter, the algorithm of fuzzy dynamic causal mining is proposed, which can handle transaction data sets with quantitative values and discover interesting rules and patterns among them. The rules thus mined exhibit quantitative regularity on multiple levels and can be used to provide suggestions to appropriate supervisors. Compared to the DCM methods for quantitative data, the FDCM approach gets smoother mining results due to the fuzzification of the data sets.
Applying Fuzzy Logic in Dynamic Causal Mining
This chapter also shows that fuzzy logic has a number of advantages over classical crisp analysis in resolving DCM problems. Fuzzy logic theory makes it evident that the theoretical difficulty of realizing dynamic causality is related to the binary classification of concepts such as binary polarity. This makes it difficult to link the results to empirical behaviour observed in manufacturing and in other research areas. The mined rules are expressed in linguistic terms with a more natural form. The representation of the knowledge discovered is represented easily and understandably for human users. When compared to fuzzy mining methods, which take all fuzzy regions into consideration, the method achieves better time complexity since only the most important fuzzy terms are used for each attribute. If all fuzzy terms are considered, the possible combinational searches are too large. Therefore, a trade-off exists between rule completeness and time complexity. Although there are more complexities in the computation, the fuzzified DCM algorithm is able to use arrays as its data structure. The model is designed to assist users in making decisions and to determine the impact of certain decisions and characteristics on the system. The model is used to evaluate the performance of the target-system types subject to the domain knowledge.
This chapter only introduces a novel approach to FDCM with a few simple membership functions. In the future, more memberships can be applied to test the efficiency and accuracy. It is also the author’s hope to expand the usage of FDCM in fields such as chemistry, bioscience, and so forth.
References Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (pp. 207-216). Washington, DC: ACM Press. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Inkeri, A. (1996). Fast discovery of association rules. In Advances in knowledge discovery and data mining (pp. 307-328). The Association for the Advancement of Artificial Intelligence & the MIT Press. Brin, S., Motwani, R., Ullman, J. D., & Tsur, S. (1997). Dynamic item set counting and implication rules for market basket data. In Proceedings for International Conference on Management of Data, Tucson, AZ (pp. 255-264).
Table 17. The result generated by the FDCM algorithm
Cheung, D., Han, J., Ng, V. T., & Fu, A. W. (1996). Fast distributed algorithm for mining association rules. In International Conference on Parallel and Distributed Information Systems, Tokyo (pp. 31-42).
Raju, K. V. S. V. N., & Majumdar, A. K. (1988). Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Transactions on Database Systems, 13(2), 129-166.
Coyle, R. G. (1996). System dynamics modelling: A practical approach. London: Chapman and Hall.
Senge, P., Kleiner, A., Roberts, C., Ross, R., & Smith, B. (1994). The fifth discipline fieldbook. New York: Doubleday.
Flood, R. (1999). Rethinking the fifth discipline: Learning within the unknowable. London: Routledge.
Srikant, R., & Agrawal, R. (1996). Mining quantitive association rules in large relational tables. In Proceedings of International Conference on Management of Data (pp. 1-12).
Forrester, J. W. (1961). Industrial dynamics: A major breakthrough for decision makers. Harvard Business Review, 37-66. Fu, A., Wong, M. H., Sze, S. C., Wong, W. C., Wong, W. L., & Yu, W. K. (1998, October 14-16). Finding fuzzy sets for the mining of fuzzy association rules for numerical attributes. In First International Symposium on Intelligent Data Engineering and Learning, Hong Kong (pp. 263-268). Hong, T. P., Kuo, C. S., Chi, S. C., & Wang, S. L. (2000). Mining fuzzy rules from quantitative data based on the AprioriTid algorithm. In Association for Computing Machinery Symposium on Applied Computing 2000, Como, Italy (pp. 490-495). Kuok, C. M., Fu, A., & Wong, M. H. (1998). Mining fuzzy association rules in databases. In Special Interest group on Management of Data (Vol. 27, pp. 41-46). Pham, D. T., Wang, Y., & Dimov, S. (2005). Intelligent manufacturing strategy selection. In Proceedings of First International Virtual Conference on Intelligent Production Machines and Systems (pp.� 312-318). Oxford, United Kingdom: Elsevier. Pham, D. T., Wang, Y., & Dimov, S. (2006). Incorporating delay and feedback in intelligent manufacturing strategy selection. Proceedings of Second International Virtual Conference on Intelligent Production Machines and Systems (pp� 246-252). Oxford, United Kingdom: Elsevier.
724
Sterman, J. D. (1994). Learning in and about complex systems. System Dynamics Review, 10(2-3), 291-330. Sterman, J. D. (2000). Business dynamics: Systems thinking and modelling for a complex world. Boston: Irwin McGraw-Hill. Wang, Y. (2007). Integration of data mining with game theory. Journal of International Federation for Information Processing, 207, 275-280. Zadeh, L. A. (1975a). The concept of linguistic variable and its application to approximate reasoning: Part I. Information Sciences, 8, 199-249. Zadeh, L. A. (1975b). The concept of linguistic variable and its application to approximate reasoning: Part II. Information Sciences, 8, 301-357. Zadeh, L. A. (1975c). The concept of linguistic variable and its application to approximate reasoning: Part III. Information Sciences, 9, 43-80. Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery, 1(4), 343-37.
Key Terms Antipathetic Rule: An antipathetic rule represents an adjustment to achieve a certain goal
Applying Fuzzy Logic in Dynamic Causal Mining
or objective. It indicates a system attempting to change from its current state to a goal state. This implies that if the current state is above the goal state, then the system forces it down. If the current state is below the goal state, the system pushes it up. An antipathetic rule provides useful stability but resists external changes. Association Mining: It is the discovery of patterns discovered in data that includes the concepts of transaction, basket, and group. A common example of a transaction is the set of items someone buys during a supermarket trip. Not all data sets have a transaction-based structure, for example, a database of names, ages, and addresses includes no obvious transactions and will not yield association rules (http://www.wikipedia.com). Dynamic Causal Mining: It is an iterative and continual process of mining rules, formulating policies, testing, and revising models. The stages are as follow. •
•
• •
•
Stage 1: Problem definition. In this phase, the problem is identified and the key variables are issued. Also, the time horizon is defined so that the cause and effects can be identified. Stage 2: Data preparation. Data are collected from various sources and a homogeneous data source is created to eliminate the representation and encoding differences. Stage 3: Data mining. This stage involves transforming data into rules by applying data mining tools. Stage 4: Policy formulation. Policies are groups of the rules extracted by mining techniques. Policies improve the understanding of the system. The interactions of different policies must also be considered since the impact of combined policies is usually not the sum of their impacts alone. These interactions may reinforce or have an opposite effect. The policy can be used for behaviour simulation to predict the future outcome. Stage 5: Model simulation. This stage tests the accuracy of the policies. The policies will predict results for new cases so the manag-
ers can alter the policy to improve future behaviour of the system. It is necessary to capture the appropriate data and generate a prediction in real time so that a decision can be made directly, quickly, and accurately. Dynamic Causal Rule: A dynamic causal rule consists of variables connected by arrows denoting the causal influences among the attributes. Two attributes A1 and A2 are linked by a causal arrow. Each causal link is assigned a polarity, and the link indicates the direction of the change. A dynamic causal rule is derived from a frequent dynamic attribute set. Polarity: It indicates the direction of a change of an attribute. There are three types of polarity (+, -, 0); + indicates an increase, - indicates a decrease, and 0 indicates neutrality, that is, no change at all. Sympathetic Rule: It causes an increase or decrease in the output of a target system. It reinforces a change with more change in the same direction. System Dynamics: System dynamics is an approach to understanding the behaviour of a target system over time. It deals with internal feedback loops and time delays that affect the behaviour of the entire system. Computer software is used to simulate a system dynamics model of the situation being studied. Running what-if simulations to test certain policies on such a model can greatly aid in understanding how the system changes over time. System dynamics is very similar to systems thinking and constructs the same causal loop diagrams of systems with feedback (http://www. wikipedia.com). System Thinking: It is an approach to analysis that is based on the belief that the component parts of a system will act differently when isolated from their environment or other parts of the system. Because the whole is greater than the sum of its parts (the relationship between the parts is what should be under observation), any atomistic analysis is considered reductionist. Standing in contrast to
725
Applying Fuzzy Logic in Dynamic Causal Mining
Descartes’ and others’ reductionism, it proposes to view systems in a holistic manner (http://www. wikipedia.com).
Market Basket: It involves classical association data mining, used in WeKa analysis. It consists of 100 different transactions.
Appendix A: Data Sets
Mclosom: It is a manufacturing database for logistics, including 72 time stamps and 50 attributes for five different classes.
ASW: These data consist of real-life data. The data set contains 65,536 attributes of metal manufacturing, with eight records in each attribute. Bank Data: These data include 600 instances of bank transactions and 12 attributes in each instance. Cystine Database: The data here arose from a large study to examine EEG (electroencephalograph) correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second.
726
Weka Base: This data set contains time series sensor readings of the Pioneer-1 mobile robot. The data are multivariate time series. A few are binary coded 0.0 and 1.0. Two categorical variables are included to delineate the trials within the data sets. The data are broken into “experiences” in which the robot takes action for some period of time and experiences a controlled interaction with its environment.
727
Chapter XXIX
Fuzzy Sequential Patterns for Quantitative Data Mining Céline Fiot University of Montpellier II – CNRS, France
ABSTRACT The explosive growth of collected and stored data has generated a need for new techniques transforming these large amounts of data into useful comprehensible knowledge. Among these techniques, referred to as data mining, sequential pattern approaches handle sequence databases, extracting frequently occurring patterns related to time. Since most real-world databases consist of historical and quantitative data, some works have been done for mining the quantitative information stored within such sequence databases, uncovering fuzzy sequential patterns. In this chapter, we first introduce the various fuzzy sequential pattern approaches and the general principles they are based on. Then, we focus on a complete framework for mining fuzzy sequential patterns handling different levels of consideration of quantitative information. This framework is then applied to two real-life data sets: Web access logs and a textual database. We conclude on a discussion about future trends in fuzzy pattern mining.
INTRODUCTION The amount of generated and collected data has been rapidly increasing in the last decades; these huge data and information collections are far outpacing our abilities to analyse, summarize, and extract knowledge. This explosive growth in stored data has generated a need for new techniques that can help in transforming these large quantities of
data into useful comprehensible knowledge. These techniques, referred to as data mining, consist of automatically extracting patterns representing knowledge implicitly contained in large databases. Some of these approaches use principles of the fuzzy set theory; an introduction to such fuzzy data mining methods by Feil and Abonyi is included as a chapter of this book.
Fuzzy Sequential Patterns for Quantitative Data Mining
Among these data mining techniques, only few of them can handle an easy and complete processing of databases containing sequences of ordered events. The most appropriate one is sequential pattern mining, which consists of searching for frequently occurring patterns related to time or other order between records. Since many real-world databases—demographic phenomena, telecommunication records, medical monitoring, production processes—are highly time-correlated data, sequential pattern mining is useful to analyse such databases for targeted marketing, failure prediction, fraud detection, and so on. However, most studies focus on symbolic or qualitative sequential patterns, and the quantitative information also stored within a sequence database is often ignored. Several methods have been proposed to mine sequential patterns within historically stamped quantitative data, most of them based on the fuzzy set theory. Within the context of a supermarket basket analysis, such patterns would be, for instance, “60% of people purchase a lot of candies and few video games, and buy later a lot of toothpaste.” These patterns are characterized by their frequency, which is by definition the proportion of objects (customers) in the database that have recorded (bought) these sequences (of products). A survey of these fuzzy sequential pattern approaches shows that they are all based on the same principle: first, converting the quantitative data into fuzzy sets and membership degrees, then using these degrees to extract the frequent sequences. Most of these approaches consider the fuzzy set theory as a tool to process quantitative information without the drawbacks of using crisp intervals. However, the use of fuzzy sets can bring more accurate analysis tools. More precisely, they can help in mining for gradual information while computing the frequency of sequential patterns. For this reason, these approaches have been generalized into a complete framework that formalizes and extends the fuzzy sequence mining methods. This framework is a complete, efficient, and scalable fuzzy approach for sequential pattern mining that enables the processing of quantitative data, with different
728
levels of consideration of quantitative information. The end user is thus allowed to choose between the speed of result extraction and the accuracy of the obtained frequent patterns. In this chapter, we present an introduction to the fundamental basis of sequential pattern mining, generally defining the concepts of sequence and frequency. Then we briefly describe the different proposals of fuzzy sequential pattern mining methods. In the following part, we introduce the framework generalizing these approaches. We detail how a sequence frequency can be assessed and how, within the context of the fuzzy set theory, this frequency can be computed in several ways. We also explain how different levels of information can be extracted using the different frequency computations. After having theoretically detailed fuzzy sequential patterns, we show how to use this method for real-life database mining. The first application is the analysis of Web access logs. The second one consists of mining for word composition within a textual database. We finally end this chapter with a short discussion on future trends in fuzzy pattern mining.
BACKGROUND Agrawal and Srikant (1995) initially introduced sequential patterns as a temporal extension of association rules, which have been adapted to fuzzy sequential patterns in order to handle quantitative data by several proposals.
Sequential Pattern Mining Let DB be a set of object records where each record R consists of three information elements: an object ID, a record time stamp, and a set of binary attributes in the record. An attribute can be either present, meaning its value is 1 or true, or it can be absent, meaning its value is 0 or false. In the context of crisp sequence mining, algorithms only consider the presence of an attribute; that is, each time that
Fuzzy Sequential Patterns for Quantitative Data Mining
the attribute value is 1, such attribute will be called an item. An item set is a nonempty set of items i, denoted by (i1i2 … ik). It is a nonordered representation. A sequence s is a nonempty ordered list of item sets s, denoted by <s1 s2 ... sp>. An n-sequence is a sequence of n items (or of size n). Example 1: Let us consider an example of market basket analysis. An object is a customer, and records are the transactions made by this customer. Time stamps are the date of transactions. The items are the products bought. A customer purchases products 1,2, 3, 4, and 5 according to the sequence s = . All items of the sequence were bought separately, except Products 2 and 3, which were purchased at the same time. In this example, s is a 5-sequence. One sequence <s1 s2 ... sp> is a subsequence of another one <s’1 s’2 ... s’m> if there are integers l1 < l2 < ...