Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6918
Ladjel Bellatreche Filipe Mota Pinto (Eds.)
Model and Data Engineering First International Conference, MEDI 2011 Óbidos, Portugal, September 28-30, 2011 Proceedings
13
Volume Editors Ladjel Bellatreche Ecole Nationale Supérieure de Mécanique et d’Aérotechnique Laboratoire d’Informatique Scientifique et Industrielle Téléport 2 - avenue Clément Ader 86961 Futuroscope Chasseneuil Cedex, France E-mail:
[email protected] Filipe Mota Pinto Instituto Politécnico de Leiria Escola Superior Tecnologia e Gestão de Leiria Departamento Engenharia Informática Rua General Norton de Matos Leiria 2411-901, Portugal E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-24442-1 e-ISBN 978-3-642-24443-8 DOI 10.1007/978-3-642-24443-8 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011936866 CR Subject Classification (1998): H.3, H.4, D.2, D.3, I.2, I.6, F.1, H.5 LNCS Sublibrary: SL 2 – Programming and Software Engineering
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The First International Conference on Model and Data Engineering (MEDI 2011) was held in Obidos, Portugal, during September 28–30. MEDI 2011 was a forum for the dissemination of research accomplishments and for promoting the interaction and collaboration between the models and data research communities. MEDI 2011 provided an international platform for the presentation of research on models and data theory, development of advanced technologies related to models and data and their advanced applications. This international scientific event, initiated by researchers from Euro-Mediterranean countries, aimed also at promoting the creation of north–south scientific networks, projects and faculty/student exchanges. The conference focused on model engineering and data engineering. The scope of the papers covered the most recent and relevant topics in the areas of advanced information systems, Web services, security, mining complex databases, ontology engineering, model engineering, and formal modeling. These proceedings contain the technical papers selected for presentation at the conference. We received more than 67 papers from over 18 countries and the Program Committee finally selected 18 long papers and 8 short papers. The conference program included three invited talks, namely, “Personalization in Web Search and Data Management” by Timos Sellis, Research Center “Athena” and National Technical University of Athens, Greece; “Challenges in the Digital Information Management Space,” Girish Venkatachaliah, IBM India, and “Formal Modelling of Service-Oriented Systems,” Ant´ onia Lopes, Faculty of Sciences, University of Lisbon, Portugal. We would like to thank the MEDI 2011 Organizing Committee for their support and cooperation. Many thanks are due to Selma Khouri for providing a great deal of help and assistance. We are very indebted to all Program Committee members and outside reviewers who very carefully and timely reviewed the papers. We would also like to thank all the authors who submitted their papers to MEDI 2011; they provided us with an excellent technical program. September 2011
Ladjel Bellatreche Filipe Mota Pinto
Organization
Program Chairs Ladjel Bellatreche Filipe Mota Pinto
LISI-ENSMA, France Polytechnic Institute of Leiria, Portugal
Program Committee El Hassan Abdelwahed Yamine A¨ıt Ameur Reda Alhajj Franck Barbier Maurice ter Beek Ladjel Bellatreche Boualem Benattallah Djamal Benslimane Moh Boughanem Athman Bouguettaya Danielle Boulanger Azedine Boulmakoul Omar Boussaid Vassilis Christophides Christine Collet Alain Crolotte Alfredo Cuzzocrea Habiba Drias Todd Eavis Johann Eder Mostafa Ezziyyani Jamel Feki Pedro Furtado Faiez Gargouri Ahmad Ghazal Dimitra Giannakopoulou Matteo Golfarelli Vivekanand Gopalkrishnan Amarnath Gupta Mohand-Said Hacid Sachio Hirokawa
Cadi Ayyad University, Morocco ENSMA, France Calgary University, Canada Pau University, France Istituto di Scienza e Tecnologie dell’Informazion, Italy LISI-ENSMA, France University of New South Wales, Australia Claude Bernard University, France IRIT Toulouse, France CSIRO, Australia Lyon-Jean Moulin University, France FST Mohammedia, Morocco Eric Lyon 2 University, France ICS-FORTH Crete, Greece INPG, France Teradata, USA ICAR-NRC, Italy USTHB, Algeria Concordia University, Canada Klagenfurt University, Austria University of Abdelmalek Essˆ adi, Morocco Sfax University, Tunisia Coimbra University, Portugal Sfax University, Tunisia Teradata, USA Nasa, USA University of Bologna, Italy Nanyang Technological University, Singapore University of California San Diego, USA Claude Bernard University, France Kyushu University, Japan
VIII
Organization
Eleanna Kafeza Anna-Lena Lamprecht Nhan Le Thanh Jens Lechtenborger Yves Ledru Li Ma Mimoun Malki Nikos Mamoulis Patrick Marcel Tiziana Margaria Brahim Medjahed Dominique Mery Mohamed Mezghiche Mukesh Mohania Kazumi Nakamatsu Paulo Novais Carlos Ordonez Aris Ouksel ¨ Tansel Ozyer Heik Paulheim Filipe Mota Pinto Li Qing Chantal Reynaud Bernardete Ribeiro Manuel Filipe Santos Catarina Silva Alkis Simitsis Veda C. Storey David Taniar Panos Vassiliadis Virginie Wiels Leandro Krug Wives Robert Wrembel
Athens University of Economics and Business, Greece TU Dortmund, Germany Nice University, France M¨ unster University, Germany Grenoble 1 University, France Chinese Academy of Science, China Sidi Bel Abbs University, Algeria University of Hong Kong, China Tours University, France Potsdam University, Germany University of Michigan - Dearborn, USA LORIA and Universit´e Henri Poincar´e Nancy 1, France Boumerdes University, Algeria IBM India University of Hyogo, Japan Universidade do Minho, Portugal Houston University, USA Illinois University, USA Tobb Economics and Technology University, Turkey SAP, Germany Polytechnic Institute of Leiria, Portugal City University of Hong Kong, China LRI INRIA Saclay, France Coimbra University, Portugal Universidade do Minho, Portugal Institute of Leiria, Portugal HP, USA Georgia State University, USA Monash University, Australia University of Ioannina, Greece Onera, France Federal University of Rio Grande do Sul, Brazil Poznan University, Poland
Table of Contents
Keynotes Personalization in Web Search and Data Management . . . . . . . . . . . . . . . . Timos Sellis
1
Challenges in the Digital Information Management Space . . . . . . . . . . . . . Girish Venkatachaliah
2
Formal Modelling of Service-Oriented Systems . . . . . . . . . . . . . . . . . . . . . . . Antonia Lopes
3
Ontology Engineering Automatic Production of an Operational Information System from a Domain Ontology Enriched with Behavioral Properties . . . . . . . . . . . . . . . Ana Simonet
4
Schema, Ontology and Metamodel Matching - Different, But Indeed the Same? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petko Ivanov and Konrad Voigt
18
A Framework Proposal for Ontologies Usage in Marketing Databases . . . Filipe Mota Pinto, Teresa Guarda, and Pedro Gago
31
Proposed Approach for Evaluating the Quality of Topic Maps . . . . . . . . . Nebrasse Ellouze, Elisabeth M´etais, and Nadira Lammari
42
Web Services and Security BH : Behavioral Handling to Enhance Powerfully and Usefully the Dynamic Semantic Web Services Composition . . . . . . . . . . . . . . . . . . . . . . . Mansour Mekour and Sidi Mohammed Benslimane
50
Service Oriented Grid Computing Architecture for Distributed Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Santos, Wesley Mathew, and Filipe Pinto
62
Securing Data Warehouses: A Semi-automatic Approach for Inference Prevention at the Design Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salah Triki, Hanene Ben-Abdallah, Nouria Harbi, and Omar Boussaid
71
X
Table of Contents
Advanced Systems F-RT-ETM: Toward Analysis and Formalizing Real Time Transaction and Data in Real-Time Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mourad Kaddes, Majed Abdouli, Laurent Amanton, Mouez Ali, Rafik Bouaziz, and Bruno Sadeg Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache for Flash Memories: A Case Study . . . . . . . . . . . . . . . . . . . . . . Jalil Boukhobza, Ilyes Khetib, and Pierre Olivier Toward a Version Control System for Aspect Oriented Software . . . . . . . . Hanene Cherait and Nora Bounour AspeCis: An Aspect-Oriented Approach to Develop a Cooperative Information System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Amroune, Jean-Michel Inglebert, Nacereddine Zarour, and Pierre-Jean Charrel
85
97 110
122
Knowledge Management An Application of Locally Linear Model Tree Algorithm for Predictive Accuracy of Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Siami, Mohammad Reza Gholamian, Javad Basiri, and Mohammad Fathian
133
Predicting Evasion Candidates in Higher Education Institutions . . . . . . . Remis Balaniuk, Hercules Antonio do Prado, Renato da Veiga Guadagnin, Edilson Ferneda, and Paulo Roberto Cobbe
143
Search and Analysis of Bankruptcy Cause by Classification Network . . . . Sachio Hirokawa, Takahiro Baba, and Tetsuya Nakatoh
152
Conceptual Distance for Association Rules Post-Processing . . . . . . . . . . . . Ramdane Maamri and Mohamed said Hamani
162
Manufacturing Execution Systems Intellectualization: Oil and Gas Implementation Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stepan Bogdan, Anton Kudinov, and Nikolay Markov Get Your Jokes Right: Ask the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joana Costa, Catarina Silva, M´ ario Antunes, and Bernardete Ribeiro
170 178
Model Specification and Verification An Evolutionary Approach for Program Model Checking . . . . . . . . . . . . . . Nassima Aleb, Zahia Tamen, and Nadjet Kamel
186
Table of Contents
Modelling Information Fission in Output Multi-Modal Interactive Systems Using Event-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linda Mohand-Oussa¨ıd, Idir A¨ıt-Sadoune, and Yamine A¨ıt-Ameur Specification and Verification of Model-Driven Data Migration . . . . . . . . . Mohammed A. Aboulsamh and Jim Davies
XI
200 214
Models Engineering Towards a Simple Meta-Model for Complex Real-Time and Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yassine Ouhammou, Emmanuel Grolleau, Michael Richard, and Pascal Richard Supporting Model Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R´emi Delmas, David Doose, Anthony Fernandes Pires, and Thomas Polacsek Modeling Approach Using Goal Modeling and Enterprise Architecture for Business IT Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karim Doumi, Salah Ba¨ına, and Karim Ba¨ına
226
237
249
MDA Compliant Approach for Data Mart Schemas Generation . . . . . . . . Hassene Choura and Jamel Feki
262
A Methodology for Standards-Driven Metamodel Fusion . . . . . . . . . . . . . . Andr´ as Pataricza, L´ aszl´ o G¨ onczy, Andr´ as K¨ ovi, and Zolt´ an Szatm´ ari
270
Metamodel Matching Techniques in MDA: Challenge, Issues and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lamine Lafi, Slimane Hammoudi, and Jamel Feki
278
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
287
Personalization in Web Search and Data Management Timos Sellis Research Center "Athena" and National Technical University of Athens, Greece
[email protected] (joint work with T. Dalamagas, G. Giannopoulos and A. Arvanitis)
Abstract. We address issues on web search personalization by exploiting users' search histories to train and combine multiple ranking models for result reranking. These methods aim at grouping users' clickthrough data (queries, results lists, clicked results), based either on content or on specific features that characterize the matching between queries and results and that capture implicit user search behaviors. After obtaining clusters of similar clickthrough data, we train multiple ranking functions (using Ranking SVM model), one for each cluster. Finally, when a new query is posed, we combine ranking functions that correspond to clusters similar to the query, in order to rerank/personalize its results. We also present how to support personalization in data management systems by providing users with mechanisms for specifying their preferences. In the past, a number of methods have been proposed for ranking tuples according to user-specified preferences. These methods include for example top-k, skyline, top-k dominating queries etc. However, neither of these methods has attempted to push preference evaluation inside the core of a database management system (DBMS). Instead, all ranking algorithms or special indexes are offered on top of a DBMS, hence they are not able to exploit any optimization provided by the query optimizer. In this talk we present a framework for supporting user preference as a rst-class construct inside a DBMS, by extending relational algebra with preference operators and by appropriately modifying query plans based on these preferences.
L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011
Challenges in the Digital Information Management Space Girish Venkatachaliah IBM New Delhi, India
Abstract. This keynote will address the challenges in the digital information management space with specific focus on analyzing, securing and harnessing the information, what needs to be done to foster the ecosystem and the challenges/gaps that exist and the progress that is crying to be made in the coming decade.
L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 2, 2011. © Springer-Verlag Berlin Heidelberg 2011
Formal Modelling of Service-Oriented Systems Antonia Lopes Department of Informatics, Faculty of Sciences, University of Lisbon, Portugal
[email protected] Abstract. In service-oriented systems interactions are no longer based on fixed or programmed exchanges between specific parties but on the provisioning of services by external providers that are procured on the fly subject to a negotiation of service level agreements (SLAs). This research addresses the challenge raised on software engineering methodology by the need of declaring such requirements as part of the models of service-oriented applications, reflecting the business context in which services and activities are designed. In this talk, we report on the formal approach to service-oriented modelling that we have developed which aimed at providing formal support for modelling service-oriented systems in a way that is independent of the languages in which services are programmed and the platforms over which they run. We discuss the semantic primitives that are being provided in SRML (SENSORIA Reference Modelling Language) for modelling composite services, i.e. services whose business logic involves a number of interactions among more elementary service components as well the invocation of services provided by external parties. This includes a logic for specifying stateful, conversational interactions, a language and semantic model for the orchestration of such interactions, and an algebraic framework supporting service discovery, selection and dynamic assembly.
L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, p. 3, 2011. © Springer-Verlag Berlin Heidelberg 2011
Automatic Production of an Operational Information System from a Domain Ontology Enriched with Behavioral Properties Ana Simonet Agim laboratory, Faculté de Médecine, 38700 La Tronche, France
[email protected] Abstract. The use of a domain ontology and the active collaboration between analysts and end-users are among solutions aiming to the production of an Information System compliant with end-users’ expectations. Generally, a domain ontology is used to produce only the database conceptual schema, while other diagrams are designed to represent others aspects of the domain. In order to produce a fully operational Information System, we propose to enrich a domain ontology by behavioral properties deduced from the User Requirements, expressed by the input and output data necessary to the realization of the end-users’ business tasks. This approach is implemented by the ISIS (Information System Initial Specification) system. The behavioral properties make it possible to deduce which concepts must be represented by objects, literals or indexes in the generated Information System, where the Graphical User Interface enables the users to validate the expressed needs and refine them if necessary. Keywords: Ontology, Information System, User Requirements, Database Design.
1 Introduction The design of Information Systems (IS) is confronted with more and more complex domains and more and more demanding end-users. To enable a better acceptation of the final system, various solutions have been proposed, among which the active collaboration between analysts and end-users [6] and the use of a domain ontology [17]. Such collaboration during the design of an Information System favors a better understanding of the user requirements by analysts and thus limits the risks of the rejection of the IS. However, in the short term, such collaboration increases the global cost of the project as it requires a higher availability of both parties, which may question the very feasibility of the project [4]. It also requires a common language, mastered by both parties, in order to limit the ambiguities in the communication. Ontologies have been proposed to support the communication between the various actors in a given domain: a domain ontology (in short an ontology) expresses an agreement of the actors of a domain upon its concepts and their relationships. Reusing the knowledge represented in an ontology allows the analyst to define a more L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 4–17, 2011. © Springer-Verlag Berlin Heidelberg 2011
Automatic Production of an Operational Information System
5
comprehensive and consistent database conceptual schema more quickly [17]. Moreover, as an ontology plays the role of semantic referential in a given domain, it is easier to make the resultant IS collaborate with other IS, especially when they are based on the same referential. This property is particularly true when the link between the ontology and the IS is explicitly maintained, as in [8]. However, the use of an ontology is generally limited to the sole design of the conceptual schema of the database (class diagram or E-R schema) of the IS. According to the analysis method used (e.g., UML), other diagrams (e.g., use case diagram, state transition diagram, sequence diagram, activity diagram, …) have to be designed to represent other aspects of the domain under study. A common language cannot rely on such methods, because of the number and the complexity of the models end-users have to master in order to collaborate with the analyst [12]. We have chosen a binary relational model, the ISIS data model, as the support to the common language, because such models have a limited number of meta-concepts [1] [18], easily mastered by non-computer scientists. Moreover, in order to limit the number of meta-concepts, rather than using several models to represent various aspects of a domain, we chose to use a single model and enrich it with Use Cases modeling the user requirements. The ISIS data model has three meta-concepts: concept, binary relation and ISA relation. In ISIS a domain ontology is represented by a graph, named Ontological Diagram (OD), where the nodes represent concepts and the arcs represent the relations between concepts1. This graph is enriched with constraints (e.g., minimal and maximal cardinalities of relations, relations defining the semantic key of the instances of a concept). The criticity and the modifiability of the relations of an OD are the two main behavioral properties we have identified in order to automatically transform concepts of the ontological level into computer objects of the implementation level. The criticity of a relation expresses that this relation is necessary for at least one of the Use Cases modeling the user requirements. The modifiability of a relation with domain A and range B expresses that there exists at least one instance of A where its image in B changes over time (non-monotonicity). However, deciding on the criticity or the modifiability of the relations of an OD is outside the capabilities of end-users collaborating in the IS design, as their knowledge is centered on the data and rules they need to perform their business tasks. This data, made explicit in the ISIS methodology through the input and output parameters of each Use Case, allows us to infer which relations are critical and/or modifiable. We then deduce the concepts that – for a given set of Use Cases – can be omitted, thus leading to a sub-ontology of the original OD. The concepts that should be represented as objects, values or indexes in the implementation of the application are proposed. Following the designer’s choices, ISIS proceeds to the automatic generation of the database, the API and a prototype GUI of the IS. These software artifacts enable endusers to verify the adequacy of the IS to their needs and refine them if necessary. The paper is organized as follows. We firstly present some notions of the ontological and the implementation level, then the ISIS project, its model and some properties necessary to the production of an operational system. Finally, we present the ISIS methodology and platform through an example. 1
Nodes model classes and attribute domains of a class diagram; arcs model attributes and roles.
6
A. Simonet
2 From Ontological Level to Implementation Level The classical design of an IS systems entails the conceptualization of the domain under study and the representation in several models of the analyst’s perception of the static and dynamic phenomena [3]. For example, in UML, these models are: 1) an object model that supports the representation of classes, attributes, relationships… of the entities of the domain under study; 2) dynamic models that support the description of valid object life cycles and object interactions; and 3) a functional model that captures the semantics of the changes of the object state as a consequence of a service occurrence [10] These representations allow the analyst to increase his understanding of the domain, help communication between developers and users and facilitate the implementation of the information system. Domain ontologies represent agreed domain semantics and their fundamental asset is their independence of particular applications: an ontology consists of relatively generic knowledge that can be reused by different kinds of application [7][16]. Unlike ontologies, the design of an object2 diagram takes into account the application under consideration. In order to support the production of the target system, the database designer has chosen the entities of the domain which must be represented as (computer) objects and those which must be represented by literals [5]. In the ISIS project, we attempted to establish the behavioral properties that support automatic transformations leading from a domain ontology to an operational IS. To identify these properties we relied on the ODMG norm [5]. This norm specifies the criteria to distinguish between two categories of entities: objects and literals (also named values); objects as well as literals are entities whose values can be atomic, structured or collection. Contrary to an object, whose value can change during its lifetime3, the value of a literal is immutable. In order to automatically choose the entities that must be represented by objects, the computer system must know how to decide if the value of an entity is or is not modifiable. This raises an issue that is rarely considered as such: « what does value of an entity mean? » In a binary relational model, the value of an instance of a concept A is an element of the Cartesian product of the concepts which are the ranges of all the binary relations having A as domain. In short, the value of an instance of A is given by the set of binary relations with domain A. Consequently, the value of an instance of A is modifiable iff at least one of the binary relations with domain A is modifiable. Thus, the problems we have to solve are: 1.
2.
Among all the binary relations “inherited” from a domain ontology or defined for the application, which ones are actually necessary to model the user requirements? Such relations are called critical. Which binary relations are modifiable?
An Ontological Diagram enriched with the critical and modifiable properties enables the ISIS system to deduce which concepts should be represented as values, as objects, and which ones are potential indexes. However, to product the prototype GUI we also need to consider the sub-graphs proper to each Use Case. 2 3
Also called data models. “mutable” is the term used by the ODMG group to qualify values that change over time. In the following, we use the term “modifiable” as a synonym of “mutable”.
Automatic Production of an Operational Information System
7
3 The ISIS Project ISIS is the acronym for Information Systems Initial Specification. It has two models, a data model and a Use Case model. It offers a methodology and a tool for the design of an Information System (database, API and prototype GUI), from a Domain Ontology and a set of Use Cases modeling the user requirements. All the IS design methodologies use a central model4. In this article, we will refer to such a model as a conceptual model. Contrary to other approaches that use different models of a method to represent different aspects of a domain, we use a unique model – the so-called conceptual model – to express the static properties of the entities of the domain as well as their dynamic (behavioral) properties. Naming the concepts (e.g., vendor, customer, product, price, quantity, address …) of the domain and their interrelationships is the first step in the design of an IS following the ISIS approach. This is ontological work, and the input to ISIS can be an existing domain ontology or a micro-ontology of the considered application domain that is built at the time of the design. In both situations, the ontological structure that constitutes the input to the design process is called an Ontological Diagram (OD). The behavioral properties are deduced from the user requirements, expressed by the Use Cases needed for the users’ business tasks. We have classified the Use Cases into two categories: those whose objective is to consult existing enterprise data (e.g., for a patient, date of its consultations and name of the doctor), and those in which objects can be created, modified or suppressed (e.g., create a new patient). We call the former Sel-UC (for Selection Use Case) and the latter Up-UC (for Update Use Case). Usually, a Use Case can be modeled by a single query. Each query has input and output data, represented by concepts of the OD. For example, the above Sel-UC is interpreted as (input: patient, output: consultation date, name of doctor). In the context of an OD, the set of Sel-UC enables ISIS to determine the subgraph of the OD that is actually needed to produce the database physical schema and the API of the functional kernel of the application. The whole set of Use Cases enables ISIS to produce the prototype GUI and the operational system. 3.1 The ISIS Data Model The ISIS model belongs to the family of binary relational models [1] [18]. Its metaconcepts are concept, binary relation (or simply relation), and subsumption (or specialization) relation. The concepts of a given domain and their relationships are represented through the OD graph. Definitions - A concept is an intensional view of a notion whose extensional view is a set of instances. - A binary relation R between two concepts A (domain) and B (range), noted R(A,B), is considered in its mathematical sense: a set of pairs (a, b) with a ∈ A and b ∈ B. - the image of x through R, noted R(x) is the set of y such that R(x,y). R(x)t is the image of x through R at time t. 4
class diagram in object methods, conceptual schema in E-R models, logical schema in relational databases.
8
A. Simonet
- An association is a pair of binary relations, reverse of one another. - A subsumption relation holds between two concepts A and B (A subsumes B) iff B is a subset of A. In an OD, static properties (or constraints) of concepts and relations are given. Among these, only the minimal (generally 0 or 1) and maximal (generally 1, * or n) cardinalities of relations and unicity constraints are mandatory. Other constraints, such as Domain Constraints and Inter-Attribute Dependencies may be considered for the production of Intelligent Information Systems under knowledge-based models such as Description Logics [13] [14]. In ISIS we consider three categories of concepts: predefined concepts, primary concepts and secondary concepts. Predefined concepts correspond to predefined types in programming languages, e.g., string, real, integer. Primary and secondary concepts are built to represent the concepts specific to a domain. Primary concepts correspond to those concepts whose instances are usually considered as atomic; a secondary concept corresponds to concepts whose instances are « structured ». We name valC the relation whose domain is a primary concept C and whose range is a predefined concept (e.g., valAge, valName). Fig. 1 represents the ISIS complete diagram designed to model persons with a name and an age. STRING
valName 1..1
1..1
NAME 1..*
PersonWithName
valAge
ageOf
nameOf
1..1
1..1
INTEGER
AGE
PERSON 1..*
PersonWithAge
Fig. 1. ISIS complete OD modeling person name and age
Representing the predefined concepts of an OD increases its complexity. Thus predefined concepts and their relations with primary concepts are masked in the external representation of the OD (see the ISIS diagrams that follow). The static constraints govern the production of the logical database schema. Behavioral constraints are necessary to automatically produce the physical database schema and the associated software. In ISIS, the main behavioral properties5 that are considered are the criticity and the modifiability6 of a relation [15]. 3.2 Critical Relations The ultimate purpose of an IS is expressed through its selection queries, hence our choice of these queries to decide which relations are critical. Our main criterion in selecting the critical relations is to consider the relations participating in at least one selection query of Sel-UC (critical query). However, the designer can decide to make any relation critical, independently from critical queries. The update queries are needed to ensure that, at every moment, data in the IS are complying with data in the real world; they are not considered in the determination of the critical relations.
5 6
In an object model the behavioral properties are expressed as class methods. A modifiable relation is a non-monotonous relation.
Automatic Production of an Operational Information System
9
Definitions - A selection query is critical iff it is part of Sel-UC. - A selection query Q is defined by a triple (I, O, P) where: • I is the set of input concepts of Q • O is the set of output concepts of Q • P is a set of paths in the OD graph. - The triple (I, O, P) defines a subgraph of the OD. - A path p(i, o) in a query (I, O, P) is an ordered set of relations connecting i∈I to o∈O. - A binary relation is critical iff it belongs to at least one critical selection query or if it has been explicitly made critical by the designer. - Given a concept CC, domain of the critical relations r1, r2,…, rn, and C1, C2, …, Cn the range concepts of r1, r2,…, rn. The value of an instance cck ∈ CC is an element of the Cartesian product C′1 x C′2 x … x C′n, where C′i = Ci if ri is monovalued (max. card. = 1) and C′i = P(Ci)7 if ri is multivalued (max. card. ≥ 1). - An association in an OD is critical iff at least one of its relations is critical. 3.3 Modifiable Relations and Concepts Definition - Given a relation R(A, B), R is modifiable iff there exists a ∈ A such that R(a)t is different from R(a)t+1. To express the modifiability property we had to extend the classical binary relational model, which has only two categories of nodes8, to a model with three types of nodes (§3.1). This difference is illustrated by Fig. 1 and Fig. 2. Fig. 2, extracted from [1], represents the Z0 schema of a set person with two access function9, ageOf and nameOf, where the notion of access function is derived from that of relation. This representation induces a representation of ageOf and nameOf as attributes of a class/entity (in the object/E-R model) or of a table (in the relational model) person. nameOf
STRING
1..1
Person-with-name
1..1
PERSON
ageOf
INTEGER
Person-with-age 1..*
1..*
Fig. 2. Z0 schema modeling persons with name and age [1]
In ISIS (Fig. 1), person, age and name are concepts, and so are string and integer. Thanks to the behavioral properties, ISIS will propose that a given concept be represented as an object or as a literal, according to the ODMG classification [5], and among the objects propose those candidate to become database indexes. 7 8
9
P(E) represents the set of parts of E. E.g., in Z0, concrete (structured) and abstract (atomic) sets; lexical and not-lexical in Niam [18]. « An access function is à function which maps one category into the powerset of another ».
10
A. Simonet
Let us consider the concept person represented in Fig. 1 and Fig. 2, and the Use Case change the age of a person in the context of Korea and other Asian countries, where a person changes his age on Jan. 1st at 0h [11]. In common design situations, representing age by a literal or by an object is (manually) decided by the designer. If he is conscious that an age update on Jan. 1st will concern millions or billions of persons he will choice an object representation for age, which leads to at most 140 updates (if the age ranges from 0 to 140) instead of millions or billions with a representation as an attribute. This «best» solution cannot be automatically produced from the diagram of Fig. 2 where the only relation10 that may be modifiable is ageOf. In the ISIS representation (Fig. 1) two relations are potentially modifiable: ageOf and valAge. Making ageOf modifiable models the update of one person, whereas making valAge modifiable models the update of all the persons with a given age. The best modeling for Korea is then to consider valAge as a modifiable relation, but the best modeling for Europe is to consider ageOf as the modifiable relation. Distinguishing primary concepts and predefined concepts is necessary to differentiate these two ways of modeling the update of the age of a person. The same reasoning applies to primary concepts such as salary: either change the salary of one person (who has been promoted) or change the salary of all the persons belonging to a given category. Definition -
A concept is said to be a concept with instances with modifiable values, or simply modifiable concept11, iff it is the domain of at least a binary relation that is both critical and modifiable. A primary concept CP has a semantic identifier iff its relation valCp is not modifiable.
Considering the example of Fig. 1, we can model person as a modifiable concept for European countries, whereas in the Korea case the modifiable concept is age. In a European context, age has a semantic identifier, whereas in the Korean one it has not. 3.4 Object/Literal Deduction Conceptually, a computer object is represented by a pair , which provides a unique representation of the value of an object, as the oid is non modifiable and is used to reference the object wherever it is used. We present two of the rules used to infer which concepts should be represented by computer objects in an application. Rule 1: A modifiable concept is an object concept. Considering the object/value duality, only an object has an autonomous existence. A value does not exist by itself but through the objects that « contain » it, hence our second proposal. Rule 2: A concept t domain of a partial12 relation rel, which belongs to a critical association, is an object concept. A concept that is not an object concept is a value concept. 10
Called access function in Z0. Note that it is not the concept itself that is mutable but its instances. 12 Minimal cardinality equals 0 11
Automatic Production of an Operational Information System
11
3.5 Deduction of Potential Indexes An index is usually perceived as an auxiliary structure designed to speed up the evaluation of queries. Although its internal structure can be complex (B-tree, Bitmap, bang file, UB-tree … [9]) an index can be logically seen as a table with two entries: the indexing key, i.e., the attribute(s) used to index a collection of objects, and the address of the indexed object (tuple, record …). However, a deeper examination of indexing structures reveals a more complex situation. The component that manages an index: 1. is a generic component instantiated every time a new indexing is required. 2. contains procedures to create, modify, and delete objects from an index, i.e., objects of the indexing structure. These procedures are implicitly called by the procedures and functions that create, modify or delete objects from the class/table being indexed. 3. contains procedures to retrieve objects of the indexed class/table in an efficient manner. Again, a programmer does not call explicitly such procedures and functions. From the moment at which the programmer asks for the creation of an index, for example on a table, the computer system fully manages this auxiliary internal structure. Therefore an index can be seen as a generic class of objects transparently managed by the computer system. The generic class index disposes of two attributes: the first one, indexValue: indexingValue, provides its identifier, and the second, indexedElement: SetOf (indexedConcept), gives the address of the indexed objet (or the indexed objects if doubles are accepted). In order to allow the automatic management of the generic class « index », its identifier must be a semantic identifier. When the index models a simple index, indexingValue is a primary concept and the (implicit) relation valIndexigValue is not modifiable. For example, if one wants to index the table person by age, valAge, with domain age and range integer, must not be modifiable, in order to play the role of semantic key of indexingValue, and, transitively, of index. When the index is a complex one, indexingValue represents an implicit secondary concept that is domain of the relations r1, r2… rn whose ranges are the primary concepts that define the value of the indexing key. Each of these primary concepts must have a semantic identifier. The Cartesian product of these semantics identifiers constitutes the semantic key of the index concept. Definition A concept c is a Potential Index concept iff 1) it has a semantic identifier and 2) it is the domain of one and only one critical relation (apart from the relation defining its semantic identifier). Following this definition, the primary concept indexingValue whose relation valIndexingValue is non modifiable, satisfies the first condition. They are Potential Index concepts if the second condition of the definition is also satisfied. Considering Fig. 1 and a European application, age may be a Potential Index because the relation valAge is a non modifiable relation. It is effectively a Potential Index iff the relation personWithAge is a critical relation. In a Korean IS [11], where valAge is a modifiable relation, age cannot be a Potential Index concept because it does not have a semantic identifier.
12
A. Simonet
4 ISIS Methodology and Tool through an Example The first step in the ISIS approach is the design or the import of an OD. The OD can be checked for well-formedness (absence of cycle, no relation between two primary concepts …). Fig. 3 shows a simplified OD of a concert management application.
Fig. 3. OD of a concert management application
On the OD of Fig. 3 are represented: 1) primary concepts (e.g., style, name, date …); 2) secondary concepts (e.g., group, contact, concert …); 3) ISA relations (pastConcert is a subconcept of concert); and 4) binary relations (e.g., 13
concertOfGroup , styleOfGroup).
Pairs such as (1,1) or (0,*) represent respectively the minimal and maximal cardinalities of a binary relation. concertOfGroup and groupOfConcert is a binary association. The relation defining the semantic identifier (key) of a secondary concept is represented by an arc with black borders and a key symbol attached to the concept. Representation of Use Cases To enable ISIS to deduce the behavioral properties one annotates the OD with the input and output parameters of each Use Case (UC). Let us consider: UC1 – concerts given by a group: given a group (identified by its name), find the concerts he
has given; for each concert, display its date, its benefit, its number of spectators, the name of the concert place, of the town and of the country. UC2 – planned concerts of a group: given a group (name), display its style, the date and the price of each concert, the name of the city and its access, the name of the country and its currency. UC3 – information about a concert place: for a concert place, display its max number of spectators, location price, phone and email, city and country, and the concerts (date, name and style of group). UC4 – groups of a style: for a given style display the name of the groups and their contact. UC5 - new group: create a new group. UC6 - new concert: create a new concert. 13
The name aaOfBb is automatically produced by ISIS for a relation with domain aa and range bb. This name can be changed. The name of a relation is optionally shown on the OD.
Automatic Production of an Operational Information System
13
UC7 - new concert place: create a new concert place. UC8 - update concert: update the benefit and the number of spectators of a concert.
The first four are Sel-UC and the last four are Up-UC. For each UC the designer identifies the concepts it concerns and annotates them as input or output14 concepts. UC1: in {name(group)}, out {date, benefit, nbSpectators, name(concertPlace), name(city), name(country)}; UC2: in { name(group)}, out {style, date, ticketPrice, name(city), name(country), currency}; UC3: in {name(concertPlace)},out {maxSpectators, locationPrice, phone, email, name(city), name(country), date, name(group), style}; UC4: in {style}, out {name(group), name(contact)}; UC5: in {group}; UC6: in {concert}; UC7: in {concertPlace}; UC8: in {concert} out {benefit, nbSpectators}.
Fig. 4 illustrates UC1. The designer annotates the concept name(group) as input concept (downward arrow ) and the concepts date, benefit, nbSpectators, name(place) and name(city) as output concepts (upward arrow ). ISIS calculates and presents the paths between the input and the output concepts. The intermediate concepts, such as concert, are automatically annotated with a flag ( ). All the relations of a path in a query of Sel-UC are automatically annotated by ISIS as critical. In Fig. 4, groupOfName, concertOfGroup, dateOfConcert, … are critical. When different paths are possible between an input concept and an output concept, in order to ensure the semantics of the query, the designer must choose the intermediate concepts by moving the flag(s). When there are several relations between two concepts, the designer must select one of them.
Fig. 4. Subgraph of the Use Case UC1
Sub-Ontology Extraction From the annotations of all the queries in Sel-UC ISIS deduces a diagram that is the smallest subgraph that contains the subgraphs of all the queries of Sel-UC and proposes to suppress the concepts that are not needed. For example, if the only query of the IS is the one presented in Fig. 4, the relations of the associations between group-contact or between group-style are not critical. Thus, the designer must decide 14
Input and output concepts correspond to input and output concepts of the procedures of the functional kernel of the IS.
14
A. Simonet
either to suppress these associations or to make critical one of their relations. When a concept becomes isolated from the other concepts of the OD, it is suppressed. This constitutes the first phase of the simplification process where the objective is to determine the sub-ontology of an application. Diagram Simplification Before generating the IS, ISIS proceeds to a second phase of simplification, by proposing to eliminate the concepts that do not bear information significant for the business process. For example, considering again the query of Fig. 4 as the only query of the IS, the concept region can be eliminated, as it acts only as an intermediate concept linking city to country. Contrary to the first simplification phase, the result of the second one may depend on the order of the choices. Fig. 5 shows the simplified sub-ontology obtained by taking into account the whole set of Sel-UC. The concepts contact, region, address, name(region) … have been suppressed and will not appear in the generated application, because they are not used in any of the four Sel-UC of this example and the designer has agreed with their suppression. Object, Value and Index Deduction From the update queries of Up-UC, ISIS deduces which relations are modifiable (cf. § 3.3). For example, UC6 (new concert) enables ISIS to deduce that the relations concertOfGroup and concertOfConcertPlace are modifiable. Considering the critical modifiable relations, ISIS deduces which concepts should be represented as values or as objects, and among the latter which ones are proposed to become indexes of the generated database. Again, the designer may decide to make other choices. Fig. 5 also shows the object-value-index deductions on the example.
Fig. 5. Simplified OD with Object-Value-Index deductions for the concert example
• group, concert, pastConcert, concertPlace, city and are object (secondary) concepts; access is an object (primary) concept. • name(group), style, date, name(concertPlace) and name(country) are potential indexes. • The other concepts are value concepts.
Automatic Production of an Operational Information System
15
As country is not the domain of any modifiable critical relation, ISIS does not propose to implement it as an object but the designer can decide to make a different choice. If the ISIS proposal is accepted, name(country) becomes a potential index of city. Generation of Software Artifacts In the last step ISIS generates the application, i.e., the database, the API (i.e., the code of the queries of the Use Cases) and a prototype GUI. Fig. 6 shows the GUI corresponding to UC1 (concerts given by a group) in the PHP-MySQL application that is automatically generated.
Fig. 6. Prototype GUI: screen copy of the window generated for UC1
The prototype GUI has Spartan ergonomics: first the monovalued attributes are presented in alphabetical order, then the multivalued attributes if any. In spite of these basic ergonomics, it enables users to verify the items and their type. They can also check whether the dynamics of windows corresponds to the needs of their business process.
5 Conclusion and Perspectives The reuse of a domain ontology and the collaboration between analysts and end-users during the design phase of the IS are two of the solutions proposed to favor a better acceptation of the final system. Generally the domain ontology is only used to support the design of the conceptual schema of the IS database [8][17]. From our experience, a conceptual database schema (e.g., UML class diagram or E-R schema) concerns analysts rather than end-users, whose knowledge is not sufficient to master the metaconcepts that are used and, consequently, are only able to validate the terms used. Moreover, as they interpret them in their own cultural context, two users validating the same schema may actually expect different systems. An active collaboration between designers and end-users necessitates a common language, mastered by both parties, in order to enable them to quickly identify possible misunderstandings [6]. It also requires a high degree of availability of both parties in order that user requirements and business rules be understood by the designer [4]. To avoid increasing the cost of the project, we propose a common
16
A. Simonet
language based on a single model and at the automatic production of an operational IS that can be immediately tested by end-users. The common language is based on a binary relational model, which has a limited number of meta-concepts. Contrary to other methods that propose several models to represent the static and the dynamic properties of the entities of a domain, in ISIS we chose to enrich the ontological diagram with the Use Cases representing the user requirements. This enrichment allows deducing the subgraph proper to each functionality of the IS. It also allows the deduction of the behavioral properties of the concepts of a domain, properties which, in an object model, are expressed by the methods of the business classes. The main two behavioral properties we have identified are the criticity and the modifiability of a relation [15]. However, deciding which relations are critical or modifiable is outside the capabilities of end-users, whereas they know the data they use for their business tasks. This data is made explicit in ISIS through the input and output parameters of the queries of the Use Cases. From these parameters ISIS infers which relations are critical and/or modifiable. ISIS then deduces and proposes the concepts that should be omitted. For the concepts belonging to the sub-ontology of the application, ISIS proposes the concepts that should be represented as values, objects, or indexes at the implementation level. The designer can accept or refuse these proposals. ISIS then proceeds to the automatic generation of the database, the API and a prototype GUI of the IS. This approach leads to a reduction of the cycle « expression-refinement of needs / production of target system / validation », during the analysis process. Consequently, the number of these cycles can grow without increasing the global cost of the project and the final result can be close to the real needs of the users. The current ISIS tool has been developed in Java with a dynamic web interface. It generates a PHP-MySQL application. A console also enables the programmer to write SQL code, which makes it possible to write more complex queries. ISIS is currently being used for the design of an ontological diagram of « quality » in computerassisted surgery [2]. It will support the design of an IS to study the « quality » of an augmented surgery device. Future work encompasses the introduction of constraints as pre-conditions of a query in order to model the relationship between a Use Case and the state of the objects it uses, the generation of UML and E-R diagrams [15], and the use of the ISIS methodology for the integration of heterogeneous databases. Integrating linguistic tools to help the designer select the input and output concepts necessary for the Use Cases is also a future step of the ISIS project. Acknowledgments. The author wants to thank Michel Simonet, who played a central role in the gestation and the development of the ISIS project. She also thanks Eric Céret, who designed the current web version of ISIS, and Loïc Cellier who continues its development. She is grateful to Cyr-Gabin Bassolet, who designed and implemented the early prototypes and participated actively in the first phases of the project.
References 1. Abrial, J.R.: Data Semantics. In: Klumbie, J.W., Koffeman, K.I. (eds.) Database Management, pp. 1–59. North-Holland, Amsterdam (1974)
Automatic Production of an Operational Information System
17
2. Banihachemi, J.-J., Moreau-Gaudry, A., Simonet, A., Saragaglia, D., Merloz, P., Cinquin, P., Simonet, M.: Vers une structuration du domaine des connaissances de la Chirurgie Augmentée par une approche ontologique. In: Journées Francophones sur les Ontologies, JFO 2008, Lyon (2008) 3. Burton-Jones, A., Meso, P.: Conceptualizing Systems for Understanding: An Empirical Test of Decomposition Principles in Object-Oriented Analysis. Information Systems Research 17(1), 38–60 (2006) 4. Butler, B., Fitzgerald, A.: A case study of user participation in information systems development process. In: 8th Int. Conf. on Information Systems, Atlanta, pp. 411–426 (1997) 5. Cattell, R.G.G., Atwood, T., Duhl, J., Ferran, G., Loomis, M., Wade, D.: Object Database Standard: ODMG 1993. Morgan Kaufmann Publishers, San Francisco (1994) 6. Cavaye, A.: User Participation in System Development Revisited. Information and Management (28), 311–323 (1995) 7. Dillon, T., Chang, E., Hadzic, M., Wongthongtham, P.: Differentiating Conceptual Modelling from Data Modelling, Knowledge Modelling and Ontology Modelling and a Notation for Ontology Modelling. In: Proc. 5th Asia-Pascific Conf. on Conceptual Modelling (2008) 8. Fankam, C., Bellatreche, L., Dehainsala, H., Ait Ameur, Y., Pierra, G.: SISRO: Conception de bases de sonnées à partir d’ontologies de domaine. Revue TSI 28, 1–29 (2009) 9. Housseno, S., Simonet, A., Simonet, M.: UB-tree Indexing For Semantic Query Optimization of Range Queries. In: International Conference on Computer, Electrical, and Systems Science, and Engineering, CESSE 2009, Bali, Indonesia (2009) 10. Isfran, I., Pastor, O., Wieringa, R.: Requirements Engineering-Based Conceptual Modelling. Requirements Engineering 7, 61–72 (2002) 11. Park, J., Ram, S.: Information Systems: What Lies Beneath. ACM Transactions on Information Systems 22(4), 595–632 (2004) 12. Pastor, O., Gomez, J., Insfran, E., Pelechano, E.: The OO-Method for information system modeling: from object-oriented conceptual modeling to automated programming. Information Systems 26, 507–534 (2001) 13. Roger, M., Simonet, A., Simonet, M.: A Description Logic-like Model for a Knowledge and Data Management System. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873, p. 563. Springer, Heidelberg (2000) 14. Simonet, A., Simonet, M.: Objects with Views and Constraints: from Databases to Knowledge Bases. In: Patel, D., Sun, Y., Patel, S. (eds.) Object-Oriented Information Systems, OOIS 1994, pp. 182–197. Springer, London (1994) 15. Simonet, A.: Conception, Modélisation et Implantation de Systèmes d’Information. Habilitation à Diriger des Recherches. Université de Grenoble (2010) 16. Spyns, P., Meersman, R., Jarrar, M.: Data modeling versus Ontology engineering. SIGMOD Record 31(4), 12–17 (2002) 17. Sugumaran, V., Storey, V.C.: The role of domain ontologies in database design: An ontology management and conceptual modeling environment. ACM Trans. Database Syst. 31, 1064–1094 (2006) 18. Weber, R.: Are Attributes Entities? A Study of Database Designers’ Memory Structures. Information Systems Research 7(2), 137–162 (1996)
Schema, Ontology and Metamodel Matching Different, But Indeed the Same? Petko Ivanov and Konrad Voigt SAP Research Center Dresden, Chemnitzer Strasse 48, 01187 Dresden, Germany {p.ivanov,konrad.voigt}@sap.com
Abstract. During the last decades data integration has been a challenge for applications processing multiple heterogeneous data sources. It has been faced across the domains of schemas, ontologies, and metamodels, inevitably imposing the need for mapping specifications. Support for the development of such mappings has been researched intensively, producing matching systems that automatically propose mapping suggestions. Since an overall relation between these systems is missing, we present a comparison and overview of 15 systems for schema, ontology, and metamodel matching. Thereby, we pursue a structured analysis of applied state-of-the art matching techniques and the internal models of matching systems. The result is a comparison of matching systems, highlighting their commonalities and differences in terms of matching techniques and used information for matching, demonstrating significant similarities between the systems. Based on this, our work also identifies possible knowledge sharing between the domains, e. g. by describing techniques adoptable from another domain.
1
Introduction
For the last decades data integration has been a well-known challenge for applications processing multiple heterogeneous data sources [1]. The fundamental problem concerns exchange of data and interoperability between two systems being developed independently of each other. Usually, each system uses its own data format for processing. To avoid a reimplementation of a system, a mapping between different system formats is needed. The specification of such mapping is the task of matching, i. e. the specification of semantic correspondences between the formats’ elements. This task is a tedious, repetitive, and error-prone one, if performed manually, therefore support by semi-automatic calculation of such correspondences has been proposed [2]. Several systems have been developed to support the task of schema, ontology, and metamodel matching by the calculation of correspondences. Although all systems tackle the problem of meta data matching, they were and are researched in a relatively independent manner. Therefore, we want to provide an overview on matching systems of all three domains. This overview facilitates the choice of L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 18–30, 2011. c Springer-Verlag Berlin Heidelberg 2011
Schema, Ontology and Metamodel Matching
19
a matching system in case of a given matching problem. Furthermore, it identifies how knowledge from other domains can be reused in order to improve a matching system. Prior to studying matching systems, one needs to clarify the relation of the domains of schemas, ontologies, and metamodels. In this work, we adopt the perspective of Aßmann et al. [3], who studied the relation of ontologies and metamodels. We extended the perspective by including XML schemas. Schemas, ontologies, and metamodels provide vocabulary for a language and define validity rules for the elements of the language. The difference lies in the nature of the language, it is either prescriptive or descriptive. Thereby, schemas and metamodels are restrictive specifications, i. e. they specify and restrict a domain in a data model and systems specification, hence they are prescriptive. As a complement, ontologies are descriptive specifications and as such focus on the description of the environment. Therefore, using a similar vocabulary made of linguistic and structural information, the three domains differ in their purpose. Having a different purpose, matching systems for the three domains of schema, ontology, and metamodel matching have been developed independently. First, (1) schema matching systems have been developed to mainly support business and data integration, and schema evolution [2,4,5,6,7]. Thereby, the schema matching systems take advantage of the explicit tree structure defined. Second, with the advent of the Semantic Web, (2) ontology matching systems are dedicated to ontology evolution and merging, as well as semantic web service composition and matchmaking [8,9,10,11,12,13]. They are especially of use in the biological domain for aligning large taxonomies as well as for classification tasks. Finally, in the context of MDA, an area in which refinement and model transformation are required (3), metamodel matching systems are concerned with model transformation development, with the purpose of data integration as well as metamodel evolution [14,15,16,17,18]. In this paper, we investigate 15 matching systems from the three domains of schema, ontology, and metamodel matching showing their commonalities. We have a closer look at the matching techniques of each system, arranging the systems in an adopted classification and present an analysis of the matching systems’ data models to answer questions of their similarity and differences. This allows to compare the matching systems and analyze transferable matching techniques. These are such techniques from one domain, which may be adopted by another. Moreover, we also derive from an overview on the state-of-the-art systems’ internal data models commonalities in these models and conclude with the transferability of matching techniques. We organize our paper as follows: in Sect. 2 and 3 we introduce our approach on selecting and comparing the matching systems. In the subsequent Sect. 4 we present the classification of matching techniques and internal models and arrange the matching systems accordingly. Thereby, we highlight cross domain matching techniques as well as the overall matching technique distribution. We conclude our paper in Sect. 5 by giving a summary and an outlook on open questions as well as future work.
20
2
P. Ivanov and K. Voigt
Analysis Approach
Numerous matching systems, which try to deal with the matching problem in different domains, have evolved. The systems apply various matching strategies, use different internal representations of data being matched, and apply different strategies to aggregate and to select final results. The present variety of matching systems confronts the user with a difficult choice as to which systems to use in which case. Aiming at an outline of commonalities and differences between the systems, we performed a systematic comparison of applied matching techniques and used internal data models by the matching systems. The comparison consists of several steps: 1. Selection of matching systems. In the schema and the ontology domain alone there are more than 50 different matching systems. Therefore, we base our selection of matching systems on the following criteria: – Matching domain. The selected systems have representatives from all three domains where the matching problem occurs, namely the schema, ontology, and metamodel domain. We group the systems according to their main domain of application. It has to be noted that there exist several systems which can be applied in more than one domain, which is addressed in Sect. 4. – Availability/Actuality. The selected systems from all domains are systems that were either developed after 2007 or are still being actively worked on. – Quality. The selection of the systems is based on their provided matching quality, if available. For example in the ontology domain, where evaluation competitions exist, only systems that give the best results were selected. – Novel approaches. Additionally, we selected systems that represent approaches different from the classical one to deal with the matching problem. 2. Applied matching techniques. To cover the functionality of the matching systems we also studied the matching techniques that they apply. For this purpose we adopted an existing classification of matching techniques by Euzenat and Shvaiko [19]. The classification is based on and extends the classification of automated schema matching approaches by Rahm and Bernstein, presented in [20]. It considers different aspects of the matching techniques and defines basic groups of techniques that exist nowadays. We arrange the selected systems according to the classification, additionally pointing out the domain of application, thus showing not only the applicability of the systems in the different domains but also main groups of matching techniques that are shared between the systems. More details about this step are presented in Sect. 4.1. 3. Classification of data models. The internal data representation, also called internal data model or shortly data model, of a matching system influences what kind of matching techniques could be applied by the matching system
Schema, Ontology and Metamodel Matching
21
depending on the information that the model represents. To examine the similarities and differences of the models, we extracted the information that an internal model could provide for matching and arranged the data models of the selected systems according to it. More information about this step is given in Sect. 4.2.
3
Selection of Matching Systems.
This section describes the performed selection of fifteen matching systems from three domains based on the criteria as described in Sect. 2. Schema Matching Systems. The selection in this domain is based on a survey of schema-based matching approaches [21], where three systems applicable in the schema matching domain were selected as representatives of the group. The selected systems are COMA++ [5], Cupid [6], and the similarity flooding algorithm [4]. Another system, whose main application domain is the schema domain and has been actively developed during the last years is GeRoMe [7]. It applies a novel approach by using a role-based model which was another reason to include it in this work. It has to be noted that COMA++ was extended to be also applicable in the ontology domain. GeRoMe has a generic role model and is also applied in the ontology domain. Similarity flooding was first implemented for schema matching, but nowadays is adapted and used in systems in other domains, e.g. [10,7]. Ontology Matching Systems. The selection of matching systems that are primarily used in the ontology domain is based on their performance in the Ontology Alignment Evaluation Initiative (OAEI) contest1 . The goals of OAEI are to assess the strengths and weaknesses of matching systems. The contest includes a variety of tests, ranging from a series of benchmark and anatomy tests to tests over matching of large web directories, libraries and very large cross-lingual resources. Since the creation of the OAEI in 2004, more than 30 different systems have taken part. For the purpose of the presented work, six systems were chosen that performed best in the benchmark tests in 2007, 2008, 2009, and 2010 contests, namely Anchor-Flood [11], Agreement Maker[12], ASMOV [8], Lily [9], OLA2 [13], and RiMOM [10]. Metamodel Matching Systems. Although the research area of metamodel matching is developing, it does not yet have as many matching systems as the ontology and schema matching domains. It has to be noted that we consider only metamodel matching systems, and no model differencing tools as described in [22]. Five metamodel matching systems were included in the study, namely the Atlas Model Weaver [15] (extended by AML [23]), GUMM [17], MatchBox [18], ModelCVS [16] (implicitly applying COMA++), and SAMT4MDE [14]. 1
Ontology Alignment Evaluation Initiative - http://oaei.ontologymatching.org/
22
4
P. Ivanov and K. Voigt
Comparing Matching Techniques and Internal Models of Selected Systems
In this section we show the commonalities that matching systems from different domains share, based on the matching techniques that they apply as well as on the information that their internal models represent. We adopt an existing classification [19] of matching techniques from the ontology domain and show that systems from other domains apply, to a large extent, similar techniques. Furthermore, we examine the internal models of the matching systems and classify the information that they provide for matching. Based on this classification, we point out the similarity of information provided by internal models of different matching systems. 4.1
Matching Techniques in the Systems
The classification of matching techniques that we adopt [19] is taken from the ontology domain. The classification is based on and extends the classification of automated schema matching approaches by Rahm and Bernstein [20], thus further giving signs of commonalities of the matching techniques in both domains. More details about the classification can be found in [19]. For convenience, we show the graphical representation of the classification with some naming adjustments in Fig. 1. Here, we use the term mapping reuse instead of alignment reuse as we consider the term “mapping” more general than the specific term “alignment” for the ontology domain. Upper level domain specific ontologies we change to the more general domain specific information and we call model based techniques – semantically grounded techniques, as in the different domains “model” can be an ambiguous term. The classification of matching techniques gives a detailed overview of the different matching techniques, the information used by them, and the way it is interpreted. We arranged the selected matching systems according to the basic matching techniques as to be seen in Tab. 1. Furthermore, the table denotes the
Meta data Matching Techniques Element Syntactic
Structure External
Syntactic
External
Semantic
String- Language Constraint Linguistic Mapping Domain spec. Data analysis Graph- Taxonomy Repository Sem. based -based -based resources Reuse information & statistics based -based of structures grounded Linguistic Internal Terminological
Extensional
Structural
Relational Semantic
Meta data Matching Techniques
Fig. 1. Classification of matching techniques adopted from [19]
Schema, Ontology and Metamodel Matching
23
applicability of a system in the different existing matching domains. The systems are organized in groups, depending on the primary domain in which they are applied. In each group the systems are arranged alphabetically. The upper part of Tab. 1 shows the classification of used basic matching techniques by the selected matching systems from the schema domain. All schema matching systems [4,5,6,7] apply string-, constraint-, and taxonomy-based matching techniques. Half of the systems apply language-based techniques [5,6]. Three out of four use graph-based techniques [5,4,7]. The majority [5,6,7] applies also external linguistic resources, such as WordNet to receive mapping results. Only one system [5] applies mapping reuse techniques.
3.
GeRoMe
4.
Similarity flooding
ϱ͘
Aflood
ϲ͘
AgrMaker
ϳ͘
ASMOV
ϴ͘
Lily
9.
OLA2
10
RiMOM
11.
AMW
12.
GUMM
13.
MatchBox
14.
ModelCVS
15.
SAMT4MDE
Semantically grounded
Graph-based
Data analysis and statistics
Linguistic resources
String-based
Language-based
Schema Domain
Repository of structures
Taxonomy-based
Cupid
Mapping reuse
COMA++
2.
Constraint-based
1.
Metamodel Domain
Ontology Domain
000000000000000000000000000000000000000000000000000000000000000000000000000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00000000000000000000000000000000000000000000000000000000000000000000000000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 000000000000000000000000000000000000000000000000000000000000000000000000000 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000000000000000000000000000000000
Domain specific information
Table 1. Basic matching techniques applied by analyzed systems in different domains of application – ontology, schema, and metamodel matching domains
The middle part of Tab. 1 represents the classification of the selected ontology matching systems. It can be seen that all of the ontology matching systems [8,9,10,11,12,13] (primary domain is the ontology domain) exploit string-based, language-based, linguistic resources, and taxonomy-based matching techniques. Almost all apply also the constraint-based and the graph-based techniques.
24
P. Ivanov and K. Voigt
WŽƌƚŝŽŶŽĨƐLJƐƚĞŵƐƵƐŝŶŐ ƐƉĞĐŝĨŝĐƚĞĐŚŶŝƋƵĞ
One system uses domain specific information and mapping reuse techniques to produce mappings [8]. Only two systems apply semantically grounded techniques [9,10]. The lower part of Tab. 1 shows the classification of basic matching techniques used by matching systems in the metamodel domain. All metamodel matching systems [14,15,16,17,18] apply constraint-, graph-, and taxonomy-based techniques. The majority [15,16,17,18] also applies string-based techniques. Only one system [16] applies mapping-reuse techniques. If we take a look on the applied matching techniques, there are several things to point out. As a direct consequence, that the classification is taken from the ontology domain, there are several techniques that are used only by ontology matching systems, such as semantically grounded and data analysis and statistics as to be seen in Tab. 1. Semantically grounded techniques produce results based usually on reasoners, so these techniques are specific for the ontology domain. Using domain specific information in the form of upper level ontologies is also applied only to the ontology domain. None of the systems apply data analysis and statistical approaches due to lack of appropriate object samples. Nevertheless, if such input were available, this type of matching technique could be applied in every domain. Very few systems make use of mapping reuse [5,8,16] and repository of structures [5,16,14] techniques as these approaches are relatively recent. ϭϬϬй ϴϬй ϲϬй ϰϬй ϮϬй Ϭй
^ĐŚĞŵĂĚŽŵĂŝŶ KŶƚŽůŽŐLJĚŽŵĂŝŶ DĞƚĂŵŽĚĞůĚŽŵĂŝŶ DĂƚĐŚŝŶŐƚĞĐŚŶŝƋĞƐ
Fig. 2. Portion of systems applying a matching technique in the system’s domain
Some matching techniques are rarely used due to their recentness, or lack of appropriate input, or their specificity. These techniques show high potential for further investigation, to see how they could be adapted and reused in other domains. It is to be noted that the majority of matching techniques are applied across all three domains. Fig. 2 shows the portion of systems in each domain that apply a certain basic matching technique. As it can be seen from the distribution, there are several matching techniques that are applied by most of the systems, and these are string-based, language-based, linguistic resources, constraint-based, taxonomy-, and graph-based techniques. The logical question that follows is “Why exactly these techniques are common for the different domains?”. We found the answer in the internal model representation of the systems and the information that they expose for matching being the same across the
Schema, Ontology and Metamodel Matching
25
domains. In the following subsection we examine the information provided by an internal data model for matching and classify this information. Based on this we show the commonalities of internal models of matching systems from different domains. 4.2
Data Models for Matching Systems
The internal data model of a matching system affects the overall capabilities of the system as it may provide only specific information for matching and thus may influence the applicability of certain matching techniques. In order to extract the features of the information, provided from an internal model to different matching techniques, it is helpful to consider the existing classification of matching approaches. To extract this information, those basic matching techniques need to be selected that only use information provided by the data model. All matching techniques that are classified as external are excluded from this case, because they do not actually use information coming from the internal representation but from external resources. For that reason, matching techniques from the groups of mapping reuse, linguistic resources, domain specific information, and repository of structures are not considered. Analyzing the remaining matching techniques results in a classification of the information, provided by internal models of matching systems. The classification is shown in Fig. 3. The information that internal models provide can be divided into two main groups:
Data model information Entity information Label
Annotation
Structural information Value Data type
Internal structure
Cardinality
ID/key
Relational structure
Inheritance
Containment
Association
Attribute Def.
Fig. 3. Classification of information provided by internal data model
– Entity information. This is the information that entities of an internal data model provide to matching techniques. An entity is any class, attribute, property, relationship or an instance (individual) that is part of a model. Entities may provide textual information through their names or labels and optionally annotations if available. Annotations can be considered as additional documentation or meta information attached to an entity. Instance entities provide information about their values. Information coming from entities is usually exploited by terminological and extensional matching techniques, such as string-based, language-based, and data analysis and statistics matching techniques.
26
P. Ivanov and K. Voigt
– Structural information. This is the information, provided by the structure of an internal data model. The structural information can be divided into internal and relational structures. Information provided by the internal structure includes data type properties of attributes, cardinality constraints, or identifiers and keys. Internal structure information is provided by the structure of the entities themselves, not considering any relation with other entities. In contrast, the relational structure information is such information that considers the different types of relationships between entities. This can be inheritance, containment or an association relationship, as well as the relationship between an attribute and its containing class. Although the relationship between a class and an attribute can be considered to be a containment relationship in some domains, in others, such as the ontology domain these entities are decoupled, which was the reason to also introduce the attribute definition type of relationship. To explore internal structure information constraintbased techniques are applied, while graph- and taxonomy-based matching techniques are used to utilize relational structure information. Table 2. Classification of the information, provided by internal models, used in studied matching systems
OL-Graph [13]
OWL [8-12]
Role-based model [7]
ID/Key
Definition of properties/ attributes
Genie [24]
Association relationship
Containment relationship
Relational structure Inheritance relationship
Ecore [14,15]
Cardinality
Structural Information Internal structure Data Type
DLG [4-6,16,17]
Value
Annotation
Name/Label
Entity Information
Table 2 shows the different internal models that have been used in the selected matching systems and classifies them according to the information that is actually provided by each model to the matching techniques. The models are arranged alphabetically in the table. A short summary of which model is used by which system is given below. Grey fields denote that a certain model could
Schema, Ontology and Metamodel Matching
27
support this type of information but due to the main application domain or applied matching techniques this information is not represented. Directed Graphs – GUMM and the Similarity Flooding approach are the systems that use Directed Labeled Graphs (DLG) as internal models. GUMM relies mainly on the similarity flooding approach, by reusing it in metamodel matching. COMA++ (and ModelCVS, as it implicitly applies COMA++) uses a variation of DLGs, namely Directed Acyclic Graphs (DAG) by putting a constraint that there should be no cycles within the built graph. Cupid uses a simplification of DAGs, representing internally the input as trees. The concept of a DLG is a very generic representation and can be reused and utilized with different types of data, which is why the different matching systems have their own graph representation as internal model. The minimal set of information that is presented from the different systems is marked with check marks as it is denoted in Table 2, grey fields show that the whole set of information can actually be represented by a DLG. Ecore – SAMT4MDE and AMW are the two metamodel matching systems that do not apply specific internal data models on their own, but directly use Ecore. Applied in the area of metamodel matching, the systems operate directly over the input format of the data, namely Ecore. Genie – MatchBox introduces its own data model Genie (GENeric Internal modEl). The model was designed to be a generic model, that covers the whole set of information for matching. For further details, see [24]. OL-Graph – OLA 2 introduces the OL-Graph model and uses it as its internal data representation. The OL-Graphs are similar to the idea of directed labeled graphs. In the OLA graphs, the nodes represent different categories from ontology entities, as well as additional ones like tokens or cardinality. The OLA graph is specifically designed to serve ontology matching. OWL – all of the ontology matching systems except OLA 2 directly use OWL as their internal model. Anchor flood claims to have its own memory model, where the lexical description of entities is normalized, but no further details are available [11]. ASMOV also claims to have a model of its own, namely a streamlined model of OWL [8], but again no further information about the model is published. Role-based model – GeRoMe takes the approach that entities within a model actually do not have an identity on their own, but play different roles to form their identity. Thus, GeRoMe uses a role-based model as its internal data model. More information can be found in [25]. It can be seen from the analysis that most of the models cover the information used by matching techniques to a high degree. Additionally, all models allow for extension to cover the whole set of information. This shows that all models cover
28
P. Ivanov and K. Voigt
the same set of information that can be used for matching, independently from the application domain. OWL and Genie are the two models that cover the full spectrum of information. Ecore provides all structural and almost all entity information, except values of instance entities as the Ecore model does not deal with instances. GeRoMe’s role based model also covers almost all information, except annotations. OL-graphs do not cover annotation information as well as the internal structural information about identifiers. DLGs, as applied in the different systems, do not cover the whole range of information that could be possibly provided, but it has to be noted that the concept of presenting models as graphs is very generic and thus it is theoretically possible to represent all information from the classification.
5
Conclusion and Further Work
This paper presents an overview of the applied matching techniques and the internal models of fifteen state-of-the-art matching systems from three different domains of matching. The overview pointed out a set of matching techniques that are shared among systems, independently from the domain in which they are applied. This standard set of matching techniques includes string-based, languagebased, linguistic resources, constraint-based, taxonomy-, and graph-based techniques. We conclude that the three analyzed domains share a lot of commonalities in the applied matching techniques. Looking into the reasons why to such large extent most techniques are shared among the systems from the different domains, we analyzed what information is provided for matching from the internal data models of the systems. We classified this information and pointed out that the models have a lot in common which indicates that although the systems were developed for different domains, the core information used for matching is the same and the systems can benefit from knowledge sharing across the domains. A second issue we identified during our comparison is that further studies w.r.t. result quality, level of matching, and architecture across the domains are missing. Consequently, we see the following further work to be done: 1. Knowledge sharing. (a) Transfer of matching techniques. Matching techniques such as semantically grounded or a repository of structures are not applied in every domain and thus are worth to investigate to be shared from one domain to another. Additionally, it is also of interest to apply a promising system of one domain in another to see which improvements its techniques may yield. (b) Research of matching techniques. Some techniques, e.g. mapping reuse and statistics, are not very common and thus show a lot of potential for promising future work. It would be interesting to deeper examine these techniques.
Schema, Ontology and Metamodel Matching
29
2. Further studies. (a) Result Quality. In this work, we did not examine whether the same techniques perform with similar results, in terms of quality, under the different domains. As a direct consequence, it is necessary to develop a common platform and common test cases to cover the quality of the matching results. Similar initiative has already been started in the ontology domain, but it needs to be extended to cover all three domains. As we proved that models cover indeed the same information and use to large extent the same techniques, such extension of the test cases should be possible. (b) Level of matching. Furthermore, we point out that our approach is limited to meta data matching and does not consider the area of object matching. Therefore, it is worth providing an overview in this area as well. (c) Matching System Architecture and Properties. In this work, we examined the internal models and the matching techniques, but we did not focus on other architectural feature of the matching systems, namely how results form different matchers are combined within a system. It would be interesting to see how different systems perform this task and whether same similarities between the domains can be revealed in this aspect. Matching in general is a very active area in all three domains of schema, ontology, and metamodel matching, thus cooperation and adoption of insights between domains are quite beneficial.
References 1. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: The teenage years. In: VLDB 2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 9–16 (2006) 2. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001) 3. Aßmann, U., Zschaler, S., Wagner, G.: Ontologies, Meta-models, and the ModelDriven Paradigm, pp. 249–273 (2006) 4. Melnik, S., Garcia-molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE 2002: Proceedings of the 18th International Conference on Data Engineering (2002) 5. Do, H.H., Rahm, E.: COMA – a system for flexible combination of schema matching approaches. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB Endowment, pp. 610–621 (2002) 6. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. The VLDB Journal, 49–58 (2001) 7. Kensche, D., Quix, C., Li, X., Li, Y.: GeRoMeSuite: a system for holistic generic model management. In: VLDB 2007: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment, pp. 1322–1325 (2007) 8. Jean-Mary, Y.R., Kabuka, M.R.: ASMOV: Results for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010)
30
P. Ivanov and K. Voigt
9. Wang, P., Xu, B.: Lily: Ontology alignment results for OAEI 2009. In: OM 2009: Proceedings of the 5th International Workshop on Ontology Matching (2009) 10. Zhang, X., Zhong, Q., Li, J., Tang, J.: RiMOM results for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010) 11. Hanif, M.S., Aono, M.: Anchor-Flood: Results for OAEI 2009. In: OM 2009: Proceedings of the 4th International Workshop on Ontology Matching (2009) 12. Cruz, I.F., Antonelli, F.P., Stroe, C., Keles, U.C., Maduko, A.: Using AgreementMaker to align ontologies for OAEI 2010. In: OM 2010: Proceedings of the 5th International Workshop on Ontology Matching (2010) 13. Kengue, J.F.D., Euzenat, J., Valtchev, P.: OLA in the OAEI 2007 Evaluation Contest. In: OM 2007: Proceedings of the 2nd International Workshop on Ontology Matching (2007) 14. de Sousa Jr, J., Lopes, D., Claro, D.B., Abdelouahab, Z.: A step forward in semiautomatic metamodel matching: Algorithms and tool. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2009. LNBIP, vol. 24, pp. 137–148. Springer, Heidelberg (2009) 15. Fabro, M.D.D., Valduriez, P.: Semi-automatic model integration using matching transformations and weaving models. In: SAC 2007: Proceedings of the 25th Symposium on Applied Computing, pp. 963–970 (2007) 16. Kappel, G., Kargl, H., Kramler, G., Schauerhuber, A., Seidl, M., Strommer, M., Wimmer, M.: Matching metamodels with semantic systems – an experience report. In: BTW 2007: Proceedings of Datenbanksysteme in Business, Technologie und Web (March 2007) 17. Falleri, J.R., Huchard, M., Lafourcade, M., Nebut, C.: Metamodel matching for automatic model transformation generation. In: Busch, C., Ober, I., Bruel, J.-M., Uhl, A., V¨ olter, M. (eds.) MODELS 2008. LNCS, vol. 5301, pp. 326–340. Springer, Heidelberg (2008) 18. Voigt, K., Ivanov, P., Rummler, A.: MatchBox: Combined meta-model matching for semi-automatic mapping generation. In: SAC 2010: Proceedings of the 2010 ACM Symposium on Applied Computing (2010) 19. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007) 20. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10, 334–350 (2001) 21. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal on Data Semantics 4, 146–171 (2005) 22. Kolovos, D.S., Ruscio, D.D., Pierantonio, A., Paige, R.F.: Different models for model matching: An analysis of approaches to support model differencing. In: CVSM 2009: Proceedings of 2009 ICSE Workshop on Comparison and Versioning of Software Models, pp. 1–6 (2009) 23. Garc´es, K., Jouault, F., Cointe, P., B´ezivin, J.: Managing model adaptation by precise detection of metamodel changes. In: ECMDA 2009: Fifth European Conference on Model-Driven Architecture Foundations and Applications (2009) 24. Voigt, K., Heinze, T.: Meta-model matching based on planar graph edit distance. In: Tratt, L., Gogolla, M. (eds.) ICMT 2010. LNCS, vol. 6142, pp. 245–259. Springer, Heidelberg (2010) 25. Kensche, D., Quix, C., Chatti, M.A., Jarke, M.: GeRoMe: A generic role based metamodel for model management. Journal on Data Semantics 82 (2005)
A Framework Proposal for Ontologies Usage in Marketing Databases Filipe Mota Pinto1, Teresa Guarda2, and Pedro Gago1 1
Computer Science Department of Polytechnic Institute of Leiria, Leiria, Portugal {fpinto,pgago}@ipleiria.pt 2 Superior Institute of Languages and Administration of Leiria, Leiria, Portugal
[email protected] Abstract. The knowledge extraction in databases has being known as a long term and interactive project. Nevertheless the complexity and different options for the knowledge achievement here is a research opportunity that could be explored, throughout the ontologies support. This support may be used for knowledge sharing and reuse. This work describes a research of an ontological approach for leveraging the semantic content of ontologies to improve knowledge discovery in marketing databases. Here we analyze how ontologies and knowledge discovery process may interoperate and present our efforts to prose a possible framework for a formal integration. Keywords: Ontologies, Marketing , Databases, Data Mining.
1 Introduction In artificial intelligence, ontology is defined as a specification of a conceptualization [14]. Ontology specifies at a higher level, the classes of concepts that are relevant to the domain and the relations that exist between these classes. Indeed, ontology captures the intrinsic conceptual structure of a domain. For any given domain, its ontology forms the heart of the knowledge representation. In spite of ontology-engineering tools development and maturity, ontology integration in knowledge discovery projects remains almost unrelated. Knowledge Discovery in Databases (KDD) process is comprised of different phases, such as data selection, preparation, transformation or modeling. Each one of these phases in the life cycle might benefit from an ontology-driven approach which leverages the semantic power of ontologies in order to fully improve the entire process [13]. Our challenge is to combine ontological engineering and KDD process in order to improve it. One of the promising interests in use of ontologies in KDD assistance is their use for process guidance. This research objective seems to be much more realistic now that semantic web advances have given rise to common standards and technologies for expressing and sharing ontologies [3]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 31–41, 2011. © Springer-Verlag Berlin Heidelberg 2011
32
F.M. Pinto, T. Guarda, and P. Gago
There are three main operations of KDD can take advantage of domain knowledge embedded in ontologies: At the data understanding and data preparation phases, ontologies can facilitate the integration of heterogeneous data and guide the selection of relevant data to be mined, regarding domain objectives; During the modeling phase, domain knowledge allows the specification of constraints (e.g., parameters settings) for guiding data mining algorithms by, (e.g. narrowing the search space); finally, to the interpretation and evaluation phase, domain knowledge helps experts to visualize and validate extracted units. KDD process is usually performed by experts. They use their own knowledge for selecting the most relevant data in order to achieve domain objectives [13]. Here we explore how the one ontology and its associated knowledge base can assist the expert at KDD process. Therefore, this document describes a research approach to leveraging the semantic content of ontologies to improve KDD. This paper is organized as follows: after this introductory part we present related background concepts. Then, we present related work on this area following the presentation and discussion of ontological assistance. The main contribution is presented in terms of ontological work, experiments and deployment. Finally we draw some conclusions and address further research based on this research to future KDD data environment projects.
2 Background 2.1 Predictive Model Markup Language Predictive model markup language (PMML) is an XML-based language that provides a way for applications to define statistical and data mining models and to share these models between PMML compliant applications (Data Mining Group). Furthermore, the language can describe some of the operations required for cleaning and transforming input data prior to modeling. Since PMML is an XML based standard, its specification comes in the form of an XML schema that defines language primitives as follows [5]: Data Dictionary; Mining schema; Transformations (normalization, categorization; value conversion, aggregation; functions); Model statistics and Data mining model. 2.2 Ontology Web Language Ontologies are used to capture knowledge about some domain of interest. Ontology describes the concepts in the domain and also the relationships that hold between those concepts. Different ontology languages provide different facilities. Ontology Web Language (OWL) is a standard ontology language from the World Wide Web Consortium (W3C ). An OWL ontology consists of: Individuals (represent domain objects); Properties (binary relations on individuals - i.e. properties link two individuals together); and Classes (interpreted as sets that contain individuals). Moreover, OWL enables the inclusion of some expressions to represent logical formulas in Semantic Web rule language (SWRL) [16]. SWRL is a rule language that combines OWL with the rule markup language providing a rule language compatible with OWL.
A Framework Proposal for Ontologies Usage in Marketing Databases
33
2.3 Semantic Web Language Rule To the best of our knowledge there are no standard OWL-based query languages. Several RDF -based query languages exist but they do not capture the full semantic richness of OWL. To tackle this problem, it was developed a set of built-in libraries for Semantic Web Rule Language (SWRL) that allow it to be used as a query language The OWL is a very useful means for capturing the basic classes and properties relevant to a domain. However, these domain ontologies establish a language of discourse for eliciting more complex domain knowledge from subject specialists. Due to the nature of OWL, these more complex knowledge structures are either not easily represented in OWL or, in many cases, are not representable in OWL at all. The classic example of such a case is the relationship uncleOf(X,Y). This relation, and many others like it, requires the ability to constrain the value of a property (brotherOf) of one term (X) to be the value of a property (childOf) of the other term (Y); in other words, the siblingOf property applied to X (i.e., brotherOf(X,Z)) must produce a result Z that is also a value of the childOf property when applied to Y (i.e., childOf(Y,Z)). This “joining” of relations is outside of the representation power of OWL. One way to represent knowledge requiring joins of this sort is through the use of the implication () and conjunction (AND) operators found in rule-based languages (e.g., SWRL). The rule for the uncleOf relationship appears as follows: brotherOf(X,Z)AND childOf(Y,Z)→uncleOf(X,Y)
3 Related Work A KDD assistance through ontologies should provide user with nontrivial, personalized “catalogs” of valid KDD-processes, tailored to their task at hand, and helps them to choose among these processes in order to analyze their data. In spite of the increase investigation in the integration of domain knowledge, by means of ontologies and KDD, most approaches focus mainly in the DM phase of the KDD process [2] [3] [8] while apparently the role of ontologies in other phases of the KDD has been relegated. Currently there are others approaches being investigated in the ontology and KDD integration, like ONTO4KDD [13] or AXIS [25]. In the literature there are several knowledge discovery life cycles, mostly reflect the background of their proponent’s community, such as database, artificial intelligence, decision support, or information systems [12]. Although scientific community is addressing ontologies and KDD improvement, at the best of our knowledge, there isn’t at the moment any fully successful integration of them. This research encompasses an overall perspective, from business to knowledge acquisition and evaluation. Moreover, this research focuses the KDD process regarding the best fit modeling strategy selection supported by ontology.
34
F.M. Pinto, T. Guarda, and P. Gago
4 Ontological Work This research work is a part of one much larger project: Database Marketing Intelligence supported. by ontologies and knowledge discovery in databases. Since this research paper focuses the KDD process ontological assistance, we mainly focus this research domain area. In order to develop our data preparation phases ontology we have used the METHONTOLOGY methodology [12][10][4]. This methodology best fits our project approach, since it proposes an evolving prototyping life cycle composed of development oriented activities. 4.1 Ontology Construction Through an exhaustive literature review we have achieved a set of domain concepts and relations between them to describe KDD process. Following METHONTOLOGY we had constructed our ontology in terms of process assistance role. Nevertheless, domain concepts and relations were introduced according some literature directives [4][24]. Moreover, in order to formalize all related knowledge we have used some relevant scientific KDD [1] [21] and ontologies [17] [18] published works. However, whenever some vocabulary is missing it is possible to develop a research method (e.g., through Delphi method [7] [6] [19] [20]) in order to achieve such a domain knowledge thesaurus. At the end of the first step of methontology methodology we have identified the following main classes (Figure 1). Our KDD ontology has three major classes: 1. Resource class relates all resources needed to carry the extraction process, namely algorithms and data. 2. ProcessPhase is the central class which uses resources (Resource class) and has some results (ResultModel class). 3. ResultModel has in charge to relate all KDD instance process describing all resources used, all tasks performed and results achieved in terms of model evaluation and domain evaluation. Analysing the entire KDD process we have considered four main concepts below the ProcessPhase concept (OWL class): Data Understand; Data Preprocessing; Modeling; In order to optimize efforts we have introduced some tested concepts from other data mining ontology (DMO) [17], which has similar knowledge base taxonomy. Here we take advantage of an explicit ontology of data mining and standards using the OWL concepts to describe an abstract semantic service for DM and its main operations. In the DMO, for simplicity reasons, there are two defined types of DM-elements: settings and results, which in our case correspond to Algorithm and Data classes. The settings represent inputs for the DM-tasks, and on the other hand, the results represent outputs produced by these tasks.
A Framework Proposal for Ontologies Usage in Marketing Databases
35
Fig. 1. KDD ontology class taxonomy (partial view)
There is no difference between inputs and outputs because it is obvious that an output from one process can be used, at the same time, as an input for another process. Thus, we have represented above concept hierarchy in OWL language, using protégé OWL software. …
36
F.M. Pinto, T. Guarda, and P. Gago
Following Methontology, the next step is to create domain-specific core ontology, focusing knowledge acquisition. To this end we had performed some data processing tasks, data mining operations and also performed some models evaluations. Each class belongs to a hierarchy. Moreover, each class may have relations between other classes (e.g., PersonalType is-a InformationType subclass). In order to formalize such schema we have defined OWL properties in regarding class’ relationships, generally represented as: Modeling^ has Algorithm(algorithm) In OWL code: Each new attribute is presented to the ontology, it is evaluated in terms of attribute class hierarchy, and related properties that acts according it. In our ontology Attribute is defined by a set of three descriptive items: Information Type, Structure Type and allocated Source. Therefore it is possible to infer that, Attribute is a subclass of Thing and is described as a union of InformationType, StructureType and Source. StructureType(Date) hasMissingValueTask hasOutliersTask hasAttributeDerive
A Framework Proposal for Ontologies Usage in Marketing Databases
37
Attribute InformationType (Personal) & Attribute PersonalType(Demographics) hasCheckConsistency
5 Proposed Framework One of the promising interest of ontologies is they common understand for sharing and reuse. Hence we have explored this characteristic to effectively assist the KDD process. Indeed, this research presented the KDD assistance at two levels: Overall process assistance based on ModelResult class and, - KDD phase assistance. Since our ontology has a formal structure related to KDD process, is able to infer some result at each phase.
user
Ontology
Objective Definition objective Objectives Type Objective Type
Data Understanding
Initial Database
Objective Type
Data Undertanding
Data Understand Task
Data Understand Task
Data Pre-Processing
Data Selected
Attribute Pre Processing Tasks
Pre-Processing Task Pre-Processed Data Set
Modeling
Modeling Objectives
Data Pre-Processing Task
Working Data
Algorithm Selection
Algorithm
Algorithm Data specification
Data Prepared
Algorithm Working Data
Model construction Model
Evaluation & Deployment Model Evaluation
Evaluation Tasks
Evaluation Model Deployment
Deployment Tasks
Evaluation Tasks
Deployment Tasks
Deployment Result Model
Fig. 2. KDD ontological assistance sequence diagram
38
F.M. Pinto, T. Guarda, and P. Gago
To this end, user need to invoke the system rule engine (reasoner) indicating some relevant information, e.g., at data preprocessing task: swrl:query hasDataPreprocessingTask(?dpp,”ds”), where hasDataPreProcessingTask is an OWL property which infers from ontology all assigned data type preprocessing tasks (dpp) related to each attribute type within the data set “ds”. Moreover, user is also assisted in terms of ontology capability index, through the ontology index - precision, recall and PRI metrics. Once we have a set of running KDD process registered at the knowledge base, whenever a new KDD process starts one the ontology may support the user at different KDD phases.As example to a new classification process execution the user interaction with ontology will follow the framework as depicted in Figure 2. The ontology will lead user efforts towards the knowledge extraction suggesting by context. That is the ontology will act accordingly to user question, e.g., at domain objective definition (presented by user) the ontology will infer which is type of objectives does the ontology has. All inference work is dependent of previous loaded knowledge. Hence, there is an ontology limitation – only may assist in KDD process which has some similar characteristics to others already registered.
6 Experiments Our system prototype operation follows general KDD framework [9] and uses the ontology to assist at each user interaction, accordingly as despicted in figure 2. Our experimentation was developed over a real oil company fidelity card marketing database. This database has three main tables: card owner; card transactions and fuel station. To carry out this we have developed an initial set of SWRL rules. Since KDD is an interactive process, these rules deal at both levels: user and ontological levels. The logic captured by these rules is this section using an abstract SWRL representation, in which variables are prefaced with question marks. Domain objective: customer profile Modeling objective: description Initial database: fuel fidelity card; Database structure: 4 tables; The most relevant rule extracted from above data algorithms use was: if (age Evaluation (?m,?ev)] Each evaluation depends on e.g., model type or algorithms used. INSERT record KNOWLEDGE BASE hasAlgorithm(J48) AND hasModelingObjectiveType(classification) AND hasAlgorithmWorkingData ({idCard; age; carClientGap; civilStatus; sex; vehicleType; vehicleAge; nTransactions; tLiters; tAmountFuel; tQtdShop; 1stUsed; 2stUsed; 3stUsed }) AND Evaluation(67,41%; 95,5%) AND hasResultMoldel (J48;classification; “wds”,PCC;0,84;0,29) Once performed the evaluation, the system automatically updates the knowledge base with a new record. The registered information will serve for future use – knowledge sharing and reuse. Moreover, ontology is also being evaluated through the index precision and recall.
8 Conclusions and Further Research This work strived to improve KDD process supported by ontologies. To this end, we have used general domain ontology to assist the knowledge extraction from databases with KDD process. The KDD success is very much user dependent. Throughout our framework, it is possible to suggest a valid set of tasks which better fits in KDD process design. However, it is still missing the capability to automatically run the data, to develop modeling approaches and to apply algorithms. Nevertheless, there are four main operations of KDD that can take advantage of domain knowledge embedded in ontologies: During the data preparation phase; During the mining step; During the deployment phase; With knowledge base ontology may help analyst to choose the best modeling approach based on knowledge base ranking index.
40
F.M. Pinto, T. Guarda, and P. Gago
Future research work will be devoted to expand the use of KDD ontology through knowledge base population with more relevant concepts about the process. Another interesting direction to investigate is to represent the whole knowledge base in order to allow its automatic reuse.
References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data SIGMOD 1993, pp. 207–216 (1993) 2. Anand, S.S., Grobelnik, M., Herrmann, F., Lingenfelder, N., Wettschereck, D.: Knowledge discovery standards. Artificial Intelligence Review 27(1), 21–56 (2007) 3. Bernstein, A., Provost, F., Hill, S.: Toward intelligent assistance for a data mining process. IEEE Transactions on knowledge and data engineering 17(4) (2005) 4. Blazquez, M., Fernandez, M., Gomez-Perez, A.: Building ontologies at the knowledge level In Knowledge Acquisition wks, Voyager Inn, Banff, Alberta, Canada (1998) 5. Brezany, P., Janciak, I., Tjoa, A.M.: Data Mining with Ontologies: Implementations, Findings, and Frameworks. Information Science, 182–210 (2008) 6. Chu, H.-C., Hwang, G.-J.: A delphi-based approach to developing expert systems with the cooperation of multiple experts. Expert Systems with Applications 34, 2826–2840 (2008) 7. Delbecq, A.L., Ven, A.H.V.D., Gustafson, D.H.: Group Techniques for Planning- A Guide to Nominal Group and Delphi Processes. Scott (1975) 8. Domingos, P.: Prospects and challenges for multi-relational data mining. SIGKDD Explorer Newsletter 5(1), 80–83 (2003) 9. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. Magazine, A 17, 37–54 (1996); American Ass. Artificial Intelligence 10. Fernandez, M., Gomez-Perez, A., Juristo, N.: Methontology: From ontological art towards ontological engineering. Technical report, AAAI (1997) 11. Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological engineering, 2nd edn. Springer, Heidelberg (2004) 12. Gottgtroy, P., Kasabov, N., MacDonell, S.: An ontology driven approach for knowledge discovery in biomedicine (2004) 13. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5, 199–220 (1993) 14. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufman, San Francisco (2001) 15. Horrocks, I., Patel-Schneider, S., Grosof, B., Dean, M.: Swrl: A semantic web rule language - combining owl and ruleml. Technical report, W3C (2004) 16. Nigro, H.O., Cisaro, S.G., Xodo, D.: Data Mining with Ontologies: Implementations, Findings and Frameworks. Information Science Reference. Information Science Reference (2008) 17. Phillips, J., Buchanan, B.G.: Ontology-guided knowledge discovery in databases. In: ACM (ed.) International Conference on Knowledge Capture, pp. 123–130 (2001) 18. Pinto, F.M., Gago, P., Santos, M.F.: Marketing database knowledge extraction. In: IEEE 13th International Conference on Intelligent Engineering Systems (2009a) 19. Pinto, F.M., Marques, A., Santos, M.F.: Database marketing process supported by ontologies. In: Filipe, J., Cordeiro, J. (eds.) ICEIS 2009. LNBIP, vol. 24. Springer, Heidelberg (2009b)
A Framework Proposal for Ontologies Usage in Marketing Databases
41
20. Quinlan, R.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 21. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Analysis of recommendation algorithms for e-commerce. In: Proc. 2nd ACM Conf. Electronic Commerce (2000) 22. Seaborne, A.: Rdql - a query language for rdf. Technical report, W3C (2004) 23. Smith, R.G., Farquhar, A.: The road ahead for knowledge management: An ai perspective. American Association for Artificial Intelligence 1, 17–40 (2008) 24. da Silva, A., Lechevallier, Y.: Axis Tool for Web Usage Evolving Data Analysis (ATWEDA), INRIA-France (2009)
Proposed Approach for Evaluating the Quality of Topic Maps Nebrasse Ellouze1,2, Elisabeth Métais1, and Nadira Lammari1 1 Laboratoire Cedric, CNAM 292 rue Saint Martin, 75141 Paris cedex 3, France
[email protected], {metais,lammari}@cnam.fr 2 Ecole Nationale des Sciences de l’Informatique, Laboratoire RIADI Université de la Manouba, 1010 La Manouba
[email protected] Abstract. Topic Maps are used for structuring contents and knowledge provided from different information sources and different languages. They are defined as semantic structures which allow organizing all the subjects they represent. They are intended to enhance navigation and improve information search in these resources. In this paper, we propose to study the quality of Topic Maps. Topic Map quality covers various aspects, some of them are common with conceptual schemas, others are common with information retrieval systems and some other aspects are specific to the problem of Topic Maps. In this paper, we have limited our work to treat the aspect of quality related to the volume of the Topic Map. In fact, Topic Maps are usually very big and voluminous, since they can contain thousands of Topics and associations. This large volume of information and complexity can lead to a bad organization of the Topic Map, so searching information using the Topic Map structure will be a very difficult task and users cannot find easily what they want. In this context, to manage the volume of the Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which reflects its relevance over the time and the second meta-property indicates the level to which belongs the Topic in the Topic Map. Keywords: Topic Map (TM), quality, meta-properties, Topic Map visualization.
1 Introduction Topic Maps [1] are used for structuring contents and knowledge provided from different information sources and different languages. They are defined as semantic structures which allow organizing all the subjects they represent. They are intended to enhance navigation and improve information search in these resources. In our previous works, we have defined CITOM [2], an incremental approach to build a multilingual Topic Map from textual documents. We have validated our approach with a real corpus from the sustainable construction domain [3]. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 42–49, 2011. © Springer-Verlag Berlin Heidelberg 2011
Proposed Approach for Evaluating the Quality of Topic Maps
43
In this paper, we propose to study the quality of the generated Topic Map. Topic Map quality covers various aspects, some of them are common with conceptual schemas, others are common with information retrieval systems and some other aspects are specific to the problem of Topic Maps. In this paper, we have limited our work to treat the aspect of quality related to the volume of the Topic Map. In fact, Topic Maps are usually very big and voluminous, since they can contain thousands of Topics and associations. This large volume of information and complexity can lead to a bad organization of the Topic Map, so searching information using the Topic Map structure will be a very difficult task and users cannot find easily what they want. In this context, to manage the volume of the Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which reflects its relevance over the time and the second meta-property indicates the level to which belongs the Topic in the Topic Map. This paper will be structured as follows: In section 2, we present a brief stat of the art on Topic Map quality management. Section 3 is devoted to the presentation of our approach for Topic Map quality evaluation. At least, in section 4, we conclude and give some perspectives for this work.
2 Proposals for Topic Maps Quality Management Quality is considered as an integral part of every information system, especially with the large volume of data which is continuously increasing over the time and the diversity of applications. Many research works have been proposed in this domain. They concern different aspects of quality: quality of data (for example the european project DWQ, Data Warehouse Quality proposed by [4]), quality of conceptual models (such as QUADRIS project, Quality of Data and Multi-Source Information Systems), proposed by [5] which aims at defining a framework for evaluating the quality of multisource information systems, quality of development process, quality of data treatment process, quality of business process, etc. The study of Topic Maps quality should a priori consider several works realized in the area of ontologies and conceptual models quality. Within the context of this paper, we will discuss quality of Topic Maps; related literature, our approach to evaluate the quality of Topic Maps and future directions. Based on the literature, we note that very few works [6], [7], [8], [9] [10] have been proposed to evaluate the quality of Topic Maps. We propose to classify the existing approaches proposed to evaluate the quality of Topic Maps in two different classes: those interested to evaluate the quality of the Topic Map representation and those who propose to evaluate the quality of search through the Topic Map. 2.1 Proposed Approaches to Manage Quality of Topic Map Representation In this class of approaches, we expose for example the method presented in [6] who propose to use representation and visualization techniques to enhance users' navigation and make easier the exploration of the Topic Map. These techniques consist on filtering and clustering data in the Topic Map using conceptual
44
N. Ellouze, E. Métais, and N. Lammari
classification algorithms based on Formal Concept Analysis and Galois lattices. In their work, [6] provide also representation and navigation techniques to facilitate the exploration of Topic Map. The idea is to represent Topic Maps as virtual cities; users can move and navigate in these cities to create their own cognitive map. [7] propose to use Topic Maps for visualizing heterogeneous data sources. Their approach aims at improving the display of Topic Maps because of the diversity and the big volume of information they represent. The idea is to use the notions of cluster and sector using the TM Viewer tool (Ontologies for Education Group: http://iiscs.wssu.edu/o4e/). The whole Topic Map is visualized with different levels so that users can manage the large number of Topics. This project was inspired from the work proposed by [8] which consists in implementing a tool, called TopicMaker [8], for viewing Topic Maps in a 3D environment and at several levels. 2.2 Proposed Approaches to Manage Quality of Search through the Topic Map In this class of approaches, we cite for example the method presented in [9] who work on performance search using Topic Maps. This method consists on evaluating a web application based on the Topic Maps model and developed for information retrieval; this application is implemented and tested in the field of education. [9] propose to compare this application with a traditional search engine using the measures of recall and precision calculated for both tools. In addition, in their evaluation process, the authors take into account the points of views of some students and teachers who have tested both systems. The comparison study showed that information search based on Topic Maps gives better results than the search engine. The same idea is adopted by [10] for evaluating their search system based on Topic Maps, The purpose of the study is to compare the performance between a Topic Maps-Based Korean Folk Music (Pansori) Retrieval System and a representative Current Pansori Retrieval System. The study is an experimental effort using representative general users. Participants are asked to carry out several predefined tasks and their own queries. The authors propose objective measures (such as the time taken by the system to find searched information) and subjective measures (like completeness, ease of use, efficiency, satisfaction, appropriateness, etc.) to evaluate the performance of the two systems.
3 Our Approach to Manage the Volume of the Topic Map Based on the state of the art on the quality of ontologies, conceptual models and Topic Maps, we notice that Topic Maps quality has not been studied enough with regards to several works proposed on ontologies and conceptual schemas quality. Moreover, the notion of Topic Map quality is not the same when we compare it to ontologies and conceptual schemas quality. This is because of the differences between these models. Indeed, Topic Maps are dedicated to be used directly by users; they reflect the content of documents, while an ontology is a formal and explicit specification of a domain that allow information exchange between applications.
Proposed Approach for Evaluating the Quality of Topic Maps
45
In our work, we are interested to the quality of the Topic Map representation. In fact, one of the major problems related to Topic Maps quality is that the generated Topic Map is usually very large and contains a huge amount of information (thousands of Topics and associations).This large volume of information and complexity can lead to a bad organization of the Topic Map and a lot of difficulties to users when they try to search some information using the Topic Map. This can be explained by the fact that a Topic Map, as it was designed, is a usage-oriented semantic structure so it should represent different views and different visions about the subjects of the studied domain according to various classes of users that might be interested with the Topic Map content. Because of this big amount of information in a Topic Map, it would be difficult for users especially those who are not experts in the studied domain to easily find what they search in reasonable times. One of the specificities of Topic Maps with regards to ontologies and conceptual schemas is the preponderance of the volume problem because they are intended to be viewed and used directly by the user. In our approach, to manage the big volume of a Topic Map, we propose a dynamic pruning method when we display the Topic Map by defining a list of meta-properties associated to topics. The Topic Map pruning process is a big issue to be addressed in our work, since a Topic Map is essentially used to organize a content of documents and to help users finding relevant information in these documents. So, it is required to maintain and enrich the Topic Map structure along the time in order to satisfy users’ queries and handle possible changes of the document content. To maintain the Topic Map, we propose to introduce some information, that we have called “meta properties”, about a Topic relevance according its usage when users explore the Topic Map. This information can be explored to evaluate the quality of a Topic Map. 3.1 Topics Notation In our previous works [2], [3], we have proposed to extend the TM-UML model by adding to the Topic characteristics, a list of meta-properties. Actually, we have defined two meta-properties. The first one reflects a Topic pertinence along the time. It is initialized when the Topic Map is created; this meta-property reflects a Topic relevance according to its usage by Topic Map users. It is also explored in the Topic Map pruning process, especially to delete all the Topics considered as non pertinent when we display the Topic Map. The second meta-property, allows implementing different layers in the Topic Map. Metaproperty 1 Topic relevance in the Topic Map: We propose to define a score (or level) for each Topic as a meta-property which reflects its importance in the Topic Map. As we can see on figure 1, the score is initialized when the Topic Map is created. It can be (a) very good when the Topic is obtained from three information sources which are documents, thesaurus and requests (b) good when the Topic is extracted from two information sources or (c) not very good when the Topic is extracted from one source. These qualities have to be translated into a mark between 0 and 1 in order to allow a pruning process in the visualization of the Topic Map: only Topics having a score greater than the required level are displayed.
46
N. Ellouze, E. Métais, and N. Lammari
Fig. 1. Score initialization
During the life of the Topic Map the mark will be computed to reflect the popularity level of each Topic. For this purpose the mark is a weighted average of different criteria: the number of documents indexed by the Topic (DN), the number of FAQs referring to this Topic (FN) and the number of consultation of this Topic (CN). The formula is (α *DN + β *FN + γ * CN)/ α + β + γ). Weights are parametrical; however we suggest setting γ greater than α and β in order to better reflects the usage of the Topic. Metaproperty 2 The level to which belongs the Topic in the Topic Map: The second meta-property, allows to implement different layers in the Topic Map. Our idea is to classify and organize information (Topics, links and resources) in the Topic Map into three levels [3]. (1) The upper level contains “Topic themes” obtained as a result to the thematic segmentation process applied to source documents and “Topic questions” extracted from users requests; (2) The intermediate level contains domain concepts, Topic instances, subtopics, Topic synonyms, synonyms of Topic instances, etc; and (3) the third level contains resources used to build and enrich the Topic Map which means textual documents available in different languages and their thematic fragments and all the possible questioning sources related to these documents. We explore this meta-property to organize the Topic Map in order to enhance navigation and facilitate search through its links. 3.2 Analysis of Notes We also introduce meta-meta data attached to the scores in order to store their evolution's profile and thus automatically update them in order to anticipate their popularity level. Indeed the number of consultation of a Topic is used to vary among time for several reasons as for example season's variations: we can notice that in Summer the "air conditioning' Topic is very referred to, while in winter people are more concerned by «heating devices». In this case the mark associated with "air conditioning"' will increase in summer and decrease in winter, and reversely for the « heating device Topic». Another example of score's variation is the one of news such as the crash of a plane, in this case the Topic reaches its maximal level of popularity very soon and then its popularity continually decreases until quite nobody is anymore interested in this event. So we will add meta-meta-data to capture the type of the Topic's score evolution (season dependant, time dependant, decreasing, increasing, etc.) in order to anticipate the score of a Topic and dynamically manage the pruning process when displaying the Topic Map.
Proposed Approach for Evaluating the Quality of Topic Maps
47
3.3 Using Meta-properties to Improve Topic Map Visualization The main goal of Topic Maps is to enable users find relevant information and access to the content of the source documents. Thus, there are two kinds of requirements for Topic Map visualization: representation and navigation. A good representation helps users identify interesting spots whereas an efficient navigation is essential to access information rapidly. In our approach, the idea is that we use the two meta-properties defined above (The Topic score and the level to which belongs the Topic) in our dynamic pruning process when we visualize and display the Topic Map in order to facilitate access to documents trough the Topic Map. We use the metaproperty of a Topic level to improve the Topic Map visualization by organizing it into three levels: the first one contains Topics themes and questions, the second one contains Topics that represents domain concepts, Topics instances, eventually, Topics answers which may also belong to the first level and finally the resource level that contains documents and their fragments. This organization provides to the users different levels of detail about the content of the Topic Map and allow them to move from one level to another based on their browsing options. Indeed, initially, we choose to display only the first level topics which means topics questions and Topics themes first level, then, when he navigates the user might be interested to a particular subject or theme, he will have the possibility to continue his search and browse the sub-tree of the Topic Map that contains all domain concepts Topics related to the theme chosen by the user. He can also access to documents and segments associated to this topic. In this way, the user will be able to build his own cognitive map containing information that interests him (depending on the parties which he visited). Figure 2 shows an example of a multilevel Topic Map visualization generated with our application [3]. Our application offers highlighting properties which mean: whenever a Topic Map node is selected, it is highlighted showing the current part of Topic Map related to it. More space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. In this case, Topic Map visualization will facilitate exploration and information search through the Topic Map. In addition to the multi-levels visualization, scores assigned to each Topic are also explored as selecting criteria to visualize the Topic Map. In fact, a Topic with a very good score is considered as a main Topic so in this case, we will have a default visualization of the Topic Map. The idea is that instead of definitively deleting a Topic because it is not very used, we prefer just low its score. Indeed, a Topic may be the target of very few queries in one season and coming back to the most frequently asked Topics next season (e.g. many questions concern air conditioners in summer, but only very few in winter). However some Topics - generally concerning case in the news - definitively decrease in importance. We define a rule for displaying Topics, this rule is defined as follows: only Topics with a score above a threshold will be displayed by default, this threshold is a parameter, for our case, we choose to set it at 0.5, other Topics will be displayed in gray, but the user could still view them if he wants. For example, depending on the season, there are gray Topics such as "air conditioning" in the winter and others are displayed by default (eg "air conditioning" in summer).
48
N. Ellouze, E. Métais, and N. Lammari
Fig. 2. An example of Topic Map visualization with our developed tool [3]
4 Conclusion and Future Work In this paper, we have presented an approach to evaluate the quality of Topic Maps. We note that one of the specificities of Topic Maps with regards to ontologies and conceptual models is the problem of volume, consequently the main goal of our approach id to manage the big number of topics and associations in the Topic Maps. In our approach, to resolve this problem, we have proposed a dynamic pruning process when we display the Topic Map by defining a list of meta-properties associated to each topic. The first meta-property represents the Topic score which indicates the Topic pertinence and its usage when the users explore the Topic Map. This meta-property is used to manage the Topic Map evolution especially to prune the topics considered as non relevant. The second meta-property indicates the level to which belongs the Topic in the Topic Map organized in three levels according to our meta-model that we have defined in our previous works [3]. We use these metaproperties to improve the Topic Map visualization in order to enhance users' navigation and understanding of Topic Map content. In our future works, we will discuss in more detail quality criteria of a Topic Map in order to identify an exhaustive list of meta-properties that help managing the Topic Map in the evolution process.
Proposed Approach for Evaluating the Quality of Topic Maps
49
References 1. ISO/IEC:13250. Topic Maps: Information technology-document description and markup languages (2000), http://www.y12.doe.gov/sgml/sc34/document/0129.pdf 2. Ellouze, N., Lammari, N., Métais, E., Ben Ahmed, M.: CITOM: Incremental Construction of Topic Maps. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.) NLDB 2009. LNCS, vol. 5723, pp. 49–61. Springer, Heidelberg (2010) 3. Ellouze, N.: Approche de recherché intelligente fondée sur le modèle des Topic Maps, Thèse de doctorat, Conservatoire National des arts et métiers, Paris, France, Décembre 03 (2010) 4. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouse. Springer, Heidelberg (2000) ISBN 3-540-65365-1 5. Akoka, J., Berti-Équille, L., Boucelma, O., Bouzeghoub, M., Comyn-Wattiau, I., Cosquer, M., Goasdoué, V., Kedad, Z., Nugier, S., Peralta, V., Sisaïd-Cherfi, S.: A Framework for Quality Evaluation in Data Integration Systems. In: Proceedings of the 9th International Conference on Enterprise Information Systems (ICEIS 2007), pp. 170–175 (2007) 6. Legrand, B., Michel, S.: Visualisation exploratoire, généricité, exhaustivité et facteur d’échelle. In: Numéro spécial de la revue RNTI Visualisation et extraction des connaissances, mars (2006) 7. Godehardt, E., Bhatti, N.: Using Topic Maps for Visually Exploring Various Data Sources in a Web-Based Environment. In: Maicher, L., Garshol, L.M. (eds.) TMRA 2007. LNCS (LNAI), vol. 4999, pp. 51–56. Springer, Heidelberg (2008) 8. Weerdt, D.D., Pinchuk, R., Aked, R., Orus, J.J., Fontaine, B.: TopiMaker -An Implementation of a Novel Topic Maps Visualization. In: Maicher, L., Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438, pp. 32–43. Springer, Heidelberg (2007) 9. Dicks, D., Venkatesh, V., Shaw, S., Lowerison, G., Zhang, D.: An Empirical Evaluation of Topic Map Search Capabilities in an Educational Context. In: Cantoni, L., McLoughlin, C. (eds.) Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2004, pp. 1031–1038 (2004) 10. Gyun, O.S., Park, O.: Design and Users’ Evaluation of a Topic Map-Based Korean Folk Music Retrieval System. In: Maicher, L., Sigel, A., Garshol, L.M. (eds.) TMRA 2006. LNCS (LNAI), vol. 4438, pp. 74–89. Springer, Heidelberg (2007)
BH : Behavioral Handling to Enhance Powerfully and Usefully the Dynamic Semantic Web Services Composition Mansour Mekour and Sidi Mohammed Benslimane Djilali Liabes University - Sidi Bel Abbes, Computer Science Department, Evolutionary Engineering and Distributed Information Systems Laboratory(EEDIS)
[email protected],
[email protected] Abstract. Service composition enables users to realize their complex needs as a single request, and it has been recognized as a flexible way for resource sharing and application integration since the appearance of Service-Oriented Architecture. Many researchers propose their approaches for dynamic services composition. In this paper we mainly focus on behaviour driven dynamic services composition, and more precisely on process integration and interleaving. To highly enhance the dynamic task realization, we propose a way to not only select service process, but also to integrate and interleave some of them, and we also take advantage of control flows compatibility. Furthermore, our solution ensures the correct service consumption at the provider and requester levels, by services behavioural fulfillment. Keywords: Semantic web service, composition, behaviour, selection, integration, interleaving, control flows compatibility.
1
Introduction
There are several benefits of the dynamic services composition. Unlike static composition, where the number of provided services to the end users is limited and the services are specified at design time, dynamic composition can serve applications or users on an on-demand basis. With dynamic composition, an unlimited number of new services can be created from a limited set of service components. Besides, there is no need to keep a local cataloge of available web services in order to create composite web services as is the case with most of the static-based composition techniques. Moreover, the application is no longer restricted to the original set of operations that were specified and envisioned at the design. The capabilities of the application can be extended at runtime. Also, the customisation of software based on the individual needs of a user can be made dynamic through the use of dynamic composition without affecting other users on the system [19]. Dynamic composition infrastructure can be helpful in upgrading an application. L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 50–61, 2011. c Springer-Verlag Berlin Heidelberg 2011
BH : Behavioral Handling
51
Instead of being brought offline and having all services suspended before upgrading, users can continue to interact with the old services while the composition of new services is taking place. This will provide seamless upgrading round-the-clock service capabilities to existing applications [19]. This paper proposes to tackle the dynamic web services composition problem using the flexible process handling ”selection, integration and/or interleaving”. After a review of the literature in section 2, section 3 details the proposed approach. In particular, after the service behavior specification is shown, the scenarios matching and composition are described, and also the BH architecture is introduced. Section 4 discusses results obtained in the experiments. Finally, Section 5 presents our conclusions and future work.
2
Related Works
Several dynamic service composition approaches have been proposed and implemented in the literature. In some works [9,10,14,17,20–22] services are generally described by their signature ”inputs, outputs and some times by also theirs preconditions and effects”. [17] presented an approach to combine services without any previous knowledge about how services should be chained. Its complexity is high as all the possible chaining schemes need to be investigated. In [14], Web service composition is viewed as a composition of semantic links that refer to semantic matchmaking between Web service parameters (i.e., outputs and inputs) in order to model their connection and interaction. [10] presents a myopic method to query services for their revised parameters using the value of changed information, to update the model parameters and to compose the WSC again at real-time. In [20], the authors present an approach for identifying service composition patterns from execution logs. They locate a set of associated services using Priory algorithm and recover the control flows among the services by analyzing the order of service invocation. [9] suggests an optimization approach to identify a reduced set of candidate services during dynamic composition. In [21], the authors propose an approach based gradual segmentation taking into account both the limitedness of service capacity and utilization of historical data, to ensure the equilibrium between the satisfaction degrees of these temporally sequential requirements. However, in these above approaches unexpected capabilities may be employed, which generates uncertainty regarding how user’s information is manipulated because they do not take reliability properties of composition behaviours into account. Some approaches improve this solution by providing task decomposition rules in order to orient the service chaining process [22]. On the other hand, in several works, [1–3, 6–8, 11–13, 15, 16] authors argue that the process description is richer than the signature description, as it provides more information about the service’s behaviour, thus, leading to a more useful composition. In [11], two strategies are given to select component Web services that are likely to successfully complete the execution of a given
52
M. Mekour and S.M. Benslimane
sequence of operations. In [2], the authors propose a simple Web services selection schema based on users requirement of the various non-functional properties and interaction with the system. [8] tooks the user constraints into account during composition and they are expressed as a finite set of logical formulas with the Knowledge Interchange Format language. In [1,3,6] provided service capabilities are matched against capabilities required in the target user task. [12] describes services as processes, and define a request language named PQL1 . This language allows finding in a process database those processes that contain a fragment that responds to the request. [1] proposes a composition schema by integrating a set of simple services to reconstruct a task’s process. In [16], the user’s request is specified in a high-level manner and automatically mapped to an abstract workflow. The service instances that match the ones described in the abstract workflow, in terms on inputs outputs pre-conditions and effects, are discovered to constitute a concrete workflow description. [15] focus on adaptive management of QoS-aware service composition in grid environments. The authors propose a heuristic algorithm to select candidate services for composition. [13] The authors presented a multi-agent based semantic web service composition approach. The approach adopts composition model that uses the dedicated coordinator agent and performs negotiation between the service requester agent and all the discovered service provider agents before the selection of final service provider agent. [7] propose a Petri net based hierarchical dynamic service composition to accurately characterize user preference.
3
Our Proposal
This work is an improvement of our early contribution in [18]. It aims to favorite the user needs (tasks) realization at real-time. Our solution enhances the chance of the user task realization, and it insures the right web services usage according to the flowing factors: – the flexible web service behaviour handling ”selection, integration and/or interleaving”, and also the control flows compatibility to enable the whole exploitation of available services at composition time, – the consideration of all the provided service scenarios as primitives2 , – the ability fo users to specify the required primitives scenarios3. We are considering these factors to enhance the dynamic task realization by the useful and powerful exploitation of provided services, and to fulfill both the provider’s constraints and requester’s needs.
1 2 3
PQL: Process Query Language. Primitive provided scenario: is a scenarios that must be wholly invoked as specified by his provider. Primitive required scenario: is a scenarios that must be wholly retrieved as specified by his requester.
BH : Behavioral Handling
3.1
53
Service Behaviour Specification
The specification of service behavior should rely on a formal model in order to enable the automated reasoning providing a valid services integration to realize the user task. The services and user task are considered as complex, and they are described by complex behaviors. The web service behaviour defines the temporal relationships and properties between the service operations necessary for a valid interaction with the service [4]. Thus, we can consider it as a set of scenarios provided by a service. Each of them can be described by a set/list of capabilities. This capabilities are interconnected by the control flows. To find all the scenarios that can be provided by the service behaviour, we generate a formal grammar from the service behaviour description, then we substitute all the non-terminal elements by the set of theirs production rules. For each composite service (CS) we constitute a production rule (pr). The left side of the rule is the composed service, and the right side represent the services that compose it. The control flows ”sequence (•), external choice (|)”, and also the loops constructs are represented implicitly by the formal grammar definition. In this work, we adopte two kinds of production rules to describe the iterative constructs: - C → C1 • C1 • C1 · · · •C1 , for the execution of service with known iterations times, - C → C • C1 |C1 , for the execution of service with unknown iterations times. The others control flows as ”parallel (), synchronization(j ), unordered(?)” are added to grammar as terminals elements. Grammaire generation: Let us consider by – Des : The set of all services (composites or atomics), and control flows retrieved in the service description, – CF : The set of all control flows that my be uses in the service description, so CF = {•, |, , j , ?}, – G(S, N, T, P ) : The formal grammar that describes a service, where: • S is the main service (the only service composed by all the others ones) that never appear in the right side of any production rules. • N is the set of all the composite service (non-terminal elements). • T is the set of all the atomic services and control flows (terminal elements), except the sequence, external choice and iterative constructs. • P is the set of production rules of all the composed services. So, in more formally manner, we define the grammar parameters as flow: – Definition 1: Set of non-terminal elements - N = {∃x ∈ Des/x is an composite web service} – Definition 2: Set of uses controle flows - U CF = {∃x ∈ Des/x ∈ CF }
54
M. Mekour and S.M. Benslimane
– Definition 3: Set of atomics services - AS = {∃x ∈ Des/x is an atomic service} – Definition 4: Set of terminal elements - T = {∃x ∈ Des/x ∈ U CF ∪ AS} – Definition 5: Set of production rules - P = {∀x ∈ N, ∃(ρ, ω) ∈ U CF X (N ∪ AS)+ /p : x → ρω} – Definition 6: The axiom - S = {∃!x ∈ N, ∀p ∈ P, ∃(ρ, y, (α, β)) ∈ U CF X N X A2 /(p : y → ραβ) and (x, x) = (α, β)} where: A ≡ (N ∪ AS)+ The algorithm of formal grammar generation from a service description is given in Algorithm 1.
Algorithm 1. Grammar Generation Input : SD; /* Service Description Output: G(S, N, T, P ); 1
N ← ∅; T ← ∅; P ← ∅;
2
for All C In SDes do /* C is a composite service
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
switch σ do /* σ is a control flow
*/
*/ */
case Construct with known iterations times: P ← P ∪ {C → (C1 • C1 ... • C1 )}; if (C1 IsAtomic) then T ← T ∪ {C1 }; case Construct with unknown iterations times: P ← P ∪ {C → C • C1 |C1 }; if (C1 IsAtomic) then T ← T ∪ {C1 }; otherwise P ← P ∪ {C → (C1 σC2 σ...σCn )}; T ← T ∪ {σ}; for i ← 1 to n do if (Ci IsAtomic) then T ← T ∪ {Ci }; N ← N ∪ {C}; S ← GetMWS(P ); /* A function to get a main service
*/
BH : Behavioral Handling
3.2
55
Non-terminals Substitution and Productions Rules Refinement
Once the grammar is generated, and as we motioned above, the different scenarios of service behaviour can be founded by the substitution of the non-terminal elements by they production rules. To enable the generation of all scenarios that can be provided by a service behaviour, the process of substitution must be started from the main service S. For some iterative control flows as ”iterate, repeat-while, repeat-until”, the iteration times is unknown until the effective execution is performed, that prevents the scenario building. Authors in [5, 23] solve this challenge by the adoption of prediction techniques that use the service invocation historic to anticipate the iterations time, but that is not always true all the time, because it can be changed from invocation to another according to environment parameters4. Furthermore, the iterations time is not very important as well as the right choice and selection of an appropriate service, because two services are ideally equivalent, then they must have the same iterations times under the same invocation constraints. Thus, to overcome this challenge, we are keeping the production rule C → C1 of the iterative construct with unknown iterations time and we can perform the process of substitution over it. However, until now the service behaviour is described with one global production rule (GP R) ”that represent all the different scenarios that can be provided by the service behaviour” fitted out with a set of production rules of iteratives constructs with a unknown iterations time. This global production rule contains backwardly any non-terminal element. This enable the best generation of different scenarios. To facilitate the matching operation and to generate the different scenarios of the service behaviour, we use the external choice control flow to decompose the global one. Thereafter, these scenarios are adapted, reorganised and also divided into sub-scenarios called ”Conversation Unites (CU )”. Each of them constitute a set/list of services ”atomics and or composites” interconnected by the same control flow δ, and it is described by a prefixed notation: CU : δC1 C2 ···Cm . The algorithm of non-terminals substitution and productions rules refinement is given in Algorithm 2. 3.3
Conversation Unites Matching
To match conversation unites, we compute the similarity among the control flows and the services that constitute them. To enhance the realisation chance of the Requested Conversation Unit (RCU ), we are taking advantage of the execution order. Thus, CU s : δC1 C2 C3 , δC1 C3 C2 , δC3 C1 C2 are considered as equals (such that δ is not a sequence). To powerfully carry out the RCU , we also take advantage of control flows compatibility if the required flow is more generic than the provided one. For example, if the required flow is ”unordered” and the provided one is a ”split” construct, so it will be useful to consider them as compatible. To compute similarity ”Sim” between the required and provided information, we propose the flowing fourth formulas. The formula (1), (2), (3), (4) are used respectively for Data (Di ) ”input/output”, Predicate (Pi ) ”precondition/effect”, Atomic Service (ASi ), and Conversation Unit (CUi ) similarity 4
service to invocate, parameters of invocation, etc.
56
M. Mekour and S.M. Benslimane
Algorithm 2. Non-Terminals Substitution and Optimization Input : P ; /* Set of productions rules Output: GP R; 1 2 3 4 5 6 7 8 9 10 11 12
*/
for All p In P do repeat CanSubstitute ← False; SetCS ← GetAllCS(p); /* To get all CSs from p ρ ← GetCF(p); /* To get ρ from p for All CS In SetCS do if ((ρ = ρ)or(ρ =|)) then p ← Ers (CS,P); /* To substitute CS prs in p P ← Delete (CS,P); /* To delete CS prs CanSubstitute ← True;
*/ */
*/ */
until (CanSuptitute =False); GP R ← Substitute(P ); /* To substitute all prs in S
*/
computing, where i is a required ”r” or an offered ”o” information, wx is the x information weight, l is a predicate lateral-part and op is a predicate arithmetic operator part ”=, =, =”, cp is a service parameter ”data/predicate information”. We used ontologies as data description formalism. Sim(Dr , Do ) =
Depth(Co ) where Di is an instance of a concept Ci ..... (1) Depth(Cr )
Sim(Pr , Po ) =
wl ∗ Sim(lr , lo ) + wop ∗ Sim(opr , opo ) ..... (2) wl + wop cp
Sim(ASr , ASo ) =
i
wpi npi
npi j
cp i
Sim(CUr , CUo ) =
3.4
was nas
nas i
Sim(Prj , Poj ) wpi
..... (3)
Sim(asri , asoi ) + wσ ∗ Sim(σr , σo ) ..... (4) was + wσ
Scenarios Composition
As the service behaviour is an external choice of different scenarios provided by a service, and each scenario is set of CU s, so the Required Conversation Unit ”RCU ” can be founded either by selection, integration and/or interleaving some of Provided ones ”P CU ”(see Fig. 1 ). This aspect is rarely considered by researchers. However, we are investigating it in this work in order to fulfill powerfully the user requirements. For example, if the required conversation is: ?C1 C2 C3 C4 and the provided ones are: C1 C3 and •C2 C3 , where the symbols ”?, , •” represent ”unordered,
BH : Behavioral Handling
57
Fig. 1. Conversations handling
split, sequence” constructs respectively, so the RCU will be founded by interleaving them. Furthermore, to fulfill the provider’s constraints, we assume that each provided scenario ”set of P CU ” is primitive and it must be consumed as specified by their provider. In the same way, to fulfill the requestor’s preferences, we have ability to specify the required primitive scenarios that must be founded as indicated by their requestor. These required primitive scenarios enables the achievement and the control of P CU s integration and/or interleaving. The algorithm of P CUs Handling is given in Algorithm 3. 3.5
Architecture
Fig. 2. BH Architecture
58
M. Mekour and S.M. Benslimane
Algorithm 3. P CUs Handling
1 2 3 4 5 6 7 8 9 10 11 12
Input : T hreshold, RCU, SetP CU ”: set of P CU ”; Output: N ewCU ; /* New CU
*/
M ax ← 0; Rang ← 1; Depth ← GetDepth(RCU ); /* To get RCU Depth
*/
repeat SetCombsP CU ← GetAllCombs (Rang, SetP CU ); /* To get all P CU s combinations with rank equals to Rang for Any CombP CU In SetCombsP CU do Sim ← GetSim(RCU, CombP CU ); if (Sim > M ax) then M ax ← Sim; N ewCU ← CombP CU ;
*/
Rang ++; until (Rang = Depth + 1) Or (M ax = 0);
As indicated in Fig. 2, the proposed architecture consists of three engines. – The CUBuilder: i)generates for a service/task description, ii)constitutes a global conversation scenario fitted out by a set of repetitive constructs conversations, iii)builds a set of CU (See section 3.1), iv) deploys the generated P CU s in the UDDI and the RCU s at client machine. – The Matchmaker: This engine uses the information submitted by the client to composes and combines the candidate P CU s (retrieved automatically in the UDDI ) either by selection, integration and/or interleaving some of them to constitute RCU s(See section 3.4). – The Generator: The generated scenario will be encoded by the generator in a given language as orchestration file (i.e., the process model of OWLS). This file will be used by the client to invoke and interact with services implied in the composition.
4
Experiments
Experiments in (Fig. 3) shown that the interleaving compositions count is always greater then those obtained by the others solutions, because they are specifics case of interleaving one. Theoretically, the interleaving compositions count is greater then the summation of those obtained by the others solutions. As indicated in (Fig. 4), when the RCU depth is: – Lower then P CU depth, any solution can be found, – Equals to P CU depth, the solutions count is the same for the all strategies, and it is the selection solutions count,
BH : Behavioral Handling
59
– Greater then P CU depth, any solution can be found by selection. The interleaving solutions count always greater then the integration solutions count. We be able also to see that the integration / interleaving solutions count increase further to the increasing of the deference between RCU and P CU depth. Furthermore, our approach minimizes the scenarios count to be combined ”be integrated and or interleaved”(Fig. 5), which in turn favors firstly the preconceived scenarios direct uses, instead of constituting the new ones. Secondly the count of integrated services (Fig. 6). For example if we have two solutions, where the first one contains two scenarios and the other contains three scenarios, so the choiced solution will be always the first for all implied services count, because it is the best one in view of scenarios count.
Fig. 3. RCU realization chance
Fig. 5. P CU s Count
5
Fig. 4. CU depth influence
Fig. 6. Services Count
Conclusions and Future Works
Services composition involves the development of customized services often by discovering, integrating, and executing existing services. It’s not only about consuming services, however, but also about providing services. This can be done in such a way that already existing services are orchestrated into one or more new services that fit better to your composite application. In this article we propose a dynamic approach to compose semantic web services. It enables process handling to constitute a concrete task description from it abstract one and the full semantic web services description. This solution:
60
M. Mekour and S.M. Benslimane
– Takes into account the complexity of both the task and services, – Fulfills the right service consumption by considering the provided scenarios as primitives, – Fulfills the user preferences by ability of the primitives scenarios specification. – Facilitates the scenarios matching by decomposing them into CU s – Increase the RCU realization chance and as well the task realization by: • the P CU Handling ”selection, integration and or interleaving”, • taking into account the control flows compatibility – Enables the composite web service selection and combination and also the composite web service part-invocation. Future works aims at the flowing engines refinements: – the Matchmaker engine by the RCU s combination to use one P CU , and also the quality of service regard. – the Generator engine to put up with all OWL-S constructs, and to enable the composite service manpower invocation.
References 1. Aggarwal, R., Verma, K., Miller, J., Milnor, W.: Dynamic web service composition in meteor-s. In: Proc. IEEE Int. Conf. on Services Computing, pp. 23–30 (2004) 2. Badr, Y., Abraham, A., Biennier, F.: Enhancing web service selection by user preferences of non-functional features. In: The 4th International Conference on Next Generation Web Services Practices, pp. 60–65 (2008) 3. Benatallah, B., Sheng, Q.Z., Dumas, M.: The self-serv environment for web services composition. IEEE Internet Computing 7(1), 40–48 (2003) 4. Benmokhtar, S.: Intergiciel S´emantique pour les Services de l’Informatique Diffuse. PhD thesis, Ecole Doctorale: Informatique, T´el´ecommunications et Electronique de Paris, Universite De Paris 6 (2007) 5. Canfora, G., Penta, M.D., Esposito, R., Villani, M.L.: An Approach for QoSaware Service Composition based on Genetic Algorithms. In: GECCO 2005, Washington, DC, USA, pp. 25–29 (June 2005); Copyright 2005 ACM 1595930108/05/0006 6. Chakraborty, D., Joshi, A., Finin, T., Yesha, Y.: Service Composition for Mobile Environments. Journal on Mobile Networking and Applications, Special Issue on Mobile Services 10(4), 435–451 (2005) 7. Fan, G., Yu, H., Chen, L., Yu, C.: An Approach to Analyzing User Preference based Dynamic Service Composition. Journal of Software 5(9), 982–989 (2010) 8. Gamha, Y., Bennacer, N., Naquet, G.V.: A framework for the semantic composition of web services handling user constraints. In: The Sixth IEEE International Conference on Web Services, pp. 228–237 (2008) 9. Ganapathy, G., Surianarayanan, C.: Identification of Candidate Services for Optimization of Web Service Composition. In: Proceedings of the World Congress on Engineering, London, U.K., vol. I, pp. 448–453 (2010) 10. Harney, J., Doshi, P.: Selective Querying For Adapting Web Service Compositions Using the Value of Changed Information, pp. 1–16 (2009)
BH : Behavioral Handling
61
11. Hwang, S.Y., Lim, E.P., Lee, C.H., Chen, C.H.: On composing a reliable composite web service: a study of dynamic web service selection. In: Processing of the Fifth IEEE International Conference on Web Service, pp. 184–189 (2007) 12. Klein, M., Bernstein, A.: Towards high-precision service retrieval. In: The Semantic Web - First International Semantic Web Conference, Sardinia, Italy, pp. 84–101 (2002) 13. Kumar, S., Mastorakis, N.E.: Novel Models for Multi-Agent Negotiation based Semantic Web Service Composition. Wseas Transaction on Computers 9(4), 339– 350 (2010) 14. Lecue, F., Deltiel, A., Leger, A.: Web Service Composition as a Composition of Valid and Robust Semantic Links. International Journal of Cooperative Information Systems 18(1), 1–62 (2009) 15. Luo, J.Z., Zhou, J.Y., Wu, Z.A.: An adaptive algorithm for QoS-aware service composition in grid environments, pp. 217–226 (2009); Special issue paper: SpringerVerlag London Limited 2009 16. Majithia, S., Walker, D.W., Gray, W.A.: A framework for automated service composition in service-oriented architecture. In: 1st European Semantic Web Symposium, pp. 265–283 (2004) 17. Masuoka, R., Parsia, B., Labrou, Y.: Task computing- the semantic web meets pervasive computing. In: 2nd International Semantic Web Conference, pp. 866– 881. Springer, Heidelberg (2003) 18. Mekour, M., Benslimane, S.M.: “Integration Dynamique des Services Web Semantiques ` a Base de Conversations”. In: Doctoriales en Sciences et Technologies de l’Information et de la Communication STIC 2009, M’Sila, Alg´erie, pp. 40–45 (D´ecembre 2009) 19. Mennie, D.: An architecture to support dynamic composition of service components and its applicability to internet security. Masters Thesis, Carleton University, Ottawa, Ontario, Canada (October 2000) 20. Tang, R., Zou, Y.: An Approach for Mining Web Service Composition Patterns from Execution Logs, pp. 1–10 (2010) 21. Wang, X., Wang, Z., Xu, X., Liu, A., Chu, D.: A Service Composition Approach for the Fulfillment of Temporally Sequential Requirements. In: 2010 IEEE 6th World Congress on Services, pp. 559–565 (2010) 22. Dan, W., Bijan, P., Evren, S., James, H., Dana, N.: Automating DAML-S web services composition using SHOP2. In: Proceedings of 2nd International Semantic Web Conference, Sanibel Island, Florida, pp. 195–210 (2003) 23. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: QoS-aware middleware for web services composition. IEEE Transactions on Software Engineering 30(5), 311–327 (2004)
Service Oriented Grid Computing Architecture for Distributed Learning Classifier Systems Manuel Santos, Wesley Mathew, and Filipe Pinto Centro Algoritmi, Universide do Minho, Guimarães, Portugal {mfs,Wesley}@dsi.uminho.pt,
[email protected] Abstract. Grid computing architectures are suitable for solving the challenges in the area of data mining of distributed and complex data. Service oriented grid computing offer synchronous or asynchronous request and response based services between grid environment and end users. Gridclass is a distributed learning classifier system for data mining proposes and is the combination of different isolated tasks, e.g. managing data, executing algorithms, monitoring performance, and publishing results. This paper presents the design of a service oriented architecture to support the Gridclass tasks. Services are represented in three levels based on their functional criteria such as the user level services, learning grid services and basic grid services. The results of an experimental test on the performance of system are presented. The benefits of such approach are object of discussion. Keywords: Distributed learning classifier system, user level services, learning grid service, and basic grid services.
1 Introduction Day by day a phenomenal expansion of digital data is happening in all knowledge sectors. The two types of challenges that isolated data mining faces related to data are the size and location of data repositories. Execution of data mining algorithms in a single computer is no longer sufficient to solve the issues of distributed explosion of data. The necessity bring by these requirements led scientists to invent higher level of data mining architecture like the distributed data mining architectures [11, 12]. Distributed Data Mining (DDM) architectures manage the complexity of the distributed data repositories all over the world [1]. Grid computing has emerged from distributed and parallel technologies; moreover it facilitates the coordinated sharing of computing resources across geographically distributed sites. Service oriented Grid computing architecture can develop flexible and suitable services of learning classifier systems in a grid platform. This paper presents the conceptual model of the grid services that are necessary for the distributed learning classifier system. Distributed learning classifier system basically generates two levels of learning models that are the local leaning model and the central learning model [2]. The local learning models are generated at different distributed sites and global model L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 62–70, 2011. © Springer-Verlag Berlin Heidelberg 2011
Service Oriented Grid Computing Architecture
63
is generated in the central system. The design of the service oriented grid computing architecture for distributed learning classifier systems presents three levels of services that are the user level services, the learning grid services and the basic grid services. The user level services are the top level of the hierarchical structure of the service oriented design of grid based learning classifier system. It also contains services for global system management. User level services acts as the interface for users to access the services in the learning grid. The learning grid services are the middle level services between user level services and basic grid services. Based on the demands of the user level service, the learning grid services will be invoked. The required services of the learning grid are the data access service, the local model induction service, the global model induction service, the local model evaluation service, the global model evaluation service, the execution schedule service and the resource allocation service. Learning grid services are executed with the supports of the basic grid services. The basic grid services are the core services that are provided in the grid computing applications. Basic grid services are the security management services, the data management services and the execution management services. The service oriented grid application can bring benefits to the area of Distributed Data Mining (DDM) application. The purpose of the DDM method is to avoid transferring the data between different local databases to the central database. The Service oriented approach has more flexibility to apply the data mining induction algorithm (local model induction service) to each database. Similarly, the global model induction service accumulates all local models that are generated by the local model induction service and generate global model. The main attraction of this design is, without dealing with huge amount of data in the central site and without transferring large volume of data through the network, to turn possible the data mining in a distributed environment. Remaining sections of the paper are organized as follows. Section 2 explains the learning classifier system’s technology. Section 3 presents the grid computing and its importance. Section 4 describes the Gridclass system and its structure. Section 5 explains the service oriented grid especially the services for the distributed learning classifier system. Section 6 presents some experimental results on the performance of the system and section 7 concludes de paper.
2 Learning Classifier Systems Learning Classifier System (LCS) is a concept formally introduced by John Holland as a genetic based machine learning algorithm. Supervised Classifier System (UCS) is a LCS derived from XCS [3, 13]. UCS adopted many features from the XCS that are suitable for supervised learning scheme. UCS algorithm was chosen to be applied in this implementation of grid based distributed data mining due to the supervised nature of the most part of the problems in this area. Substantial work has been done to parallelize and to distribute the LCS canonical model in order to improve the performance and to be suited to inherently distributed problems. Manuel Santos [4] developed the DICE system, a parallel and distributed architecture for LCS. A. Giani, Dorigo and Bersini also did significant research in the area of parallel LCS [5]. Other approaches can be considered in this group. For instance, meta-learning systems construct the global population of classifiers from a collection of inherently
64
M. Santos, W. Mathew, and F. Pinto
distributed data sources [5]. GALE is a fine grained parallel genetic algorithm based on a classification system [6]. Finally, learning classifier system ensembles with rule sharing is another associated work related to the parallel and distributed LCS [8].
3 Grid Computing Grid computing developed from distributed computing and parallel computing technologies. Under distributed computing few resources are sharing for all other resources, but in the grid computing all resources are shared; that is the main difference between distributed computing and grid computing. Cluster computing and internet computing are the alternatives of the grid computing. Cluster computing could able to share the resources of dedicated and independent machines in a particular domain [9]. Although cluster computing has the benefits of high performance, high availability, load balancing and scalability it is only available in a single domain. The internet computing could share the resources of a local area network and a wide area net work [9]. The resources in an internet computing are voluntarily connected therefore security of the data and of the processes are the main issue. But grid computing technology is the more suitable and reliable service for resource sharing and distributed application. Grid is able to take the benefits of computing power of distributed resources that are connected to different LANs or WAN in a reliable and secure manner. There are many benefits of sharing resources in a grid [9]: 1) improve the utilization of resources; 2) capable to execute large – scale applications that cannot be executed within a single resource; 3) Using heterogeneous computing resources across different locations and different administrative domains; and 4) better for collaborative applications. A Grid can be considered as a virtual supercomputer because geographically distributed resources can provide large computational power to grid. Grid is the union of data grid, computing grid and service grid. The following lines introduce definitions of these concepts [9]. Data grid gives access to store and retrieve data across multiple domains. It manages the security of data access and policies and controls the physical data stores. Computing grids are developed for providing maximum power of computing to an application. Service grid provides services that are not limited for a single computer. Service grid provide collaborative work group that includes users and applications. Users are able to interact with applications through the services that are available in the service grid.
4 Gridclass System Gridclass system is a distributed and parallel grid based data mining system using a supervised classifier system (UCS) [2]. Two different styles for inducing data mining models may be applied in the distributed applications: Centralized Data Mining (CDM) and Distributed Data Mining (DDM) [2]. CDM is the conventional method for distributed data mining that first collects all data from every node in the distributed environment and then applies a mining algorithm to the accumulated data. DDM method has to do data mining at every node and send results (learning models) to the central system for developing the final result (global model). Gridclass system
Service Oriented Grid Computing Architecture
65
adopts the pattern of DDM in the grid environment. In Gridclass seven different tactics are available for constructing the global model in DDM [8]: Generalized Classifier Method (GCM); Specific Classifier Method (SCM); Weighed Classifier Method (WCM); Majority Voting Method (MVM); Model Sampling Method (MSM); Centralized Training Method (CTM); and Data sampling Method (DSM). Gridclass is manly composed by three modules, the toolkit, the local nodes and the central system. The toolkit is an interface that gives facility to submit the work and analysis the results. All the local nodes in the grid having thereon data therefore individual UCS will execute synchronously and generates local learning models. Each one of the local nodes is connected to the central system so the central system collects all the local models to make the global model using a subset of the available DDM strategies. Gridgain is a java based grid computing middleware that is used for the implementation of Gridclass system [10]. This middleware is the combination of computational grid and data grid furthermore this is a simple technology for grid computing implementation.
5 Service Oriented Grid The service oriented grid operation makes simple the relationship between user and grid. Figure 1 shows the basic architecture of the service oriented grid for learning classifier system. Application level
G M
Central node
N1
L M
N2
L M
N3
L M
Grid Master [Service manager] GRID
R5
R1 R2
Grid Resources
R3
R4
Fig. 1. Basic structure of service GRID based Distributed learning classifier system
The service oriented grid represents different layered services for the distributed learning classifier system; those services can be invoked by using web service technology. User can submit the requirements of the distributed learning classifier system in the application level, then application level will request the service in grid master and grid master will execute those operations within the grid and returns result back to the application level. The Application level contains different local nodes (Ni) and a central node. The grid master (service manager) is the middle layer between
66
M. Santos, W. Mathew, and F. Pinto
application level and grid level where several resources, e.g., computers, are available (Ri). The learning process should be executed at each distributed site. Figure 2 presents the layered structure of services for distributed learning classifier system. User level services are available in the application level, learning grid services are available in the grid master level and basic grid services are available in the grid. For example, the users need to execute the distributed learning classifier system in a service oriented grid with specific data on the particular sites. Firstly, the user should specify the number of nodes and locations of the data. In the learning grid services different services may be used for inducing local models (LMi), so the user needs to mention the specific service for inducing learning model and the configuration parameters of that service. Similarly, many strategies are available for constructing the global model (GM) therefore the global model construction strategies and the configuration parameters should be specified. Besides, the output format of the result should be also mentioned. This is the entire information that user can specify in the user level services. User level services (Global system management service) Learning Grid Services Resources management services
Execution services
Data access services Local model induction services Global model induction services Local model evaluation service Global model evaluation service
Execution schedule services Resource allocation service
Basic grid services Security service, Data management service, Execution management service
Fig. 2. Layered structure services for distributed learning classifier system
The user level service will invoke the services that are available in the grid master for the execution of the user requirements. The grid master will pass information to the grid for the execution of the user request. 5.1 User Level Services The global system management service is the fundamental service in the user level service. This service acts as an interface between the user and the grid system, so the
Service Oriented Grid Computing Architecture
67
global system management service will collect all the information from user about the execution of distributed learning classifier system and invokes the grid learning services accordingly. The global system management service will display information to monitor the execution of the distributed learning classifier system. 5.2 Learning Grid Services Learning grid services are the main processing unit of the service oriented distributed learning classifier system. There are two types of services available: 1) Resource management services; and 2) Execution management services. The resource management services contain five services for data access, local model induction, global model induction, local model evaluation, and for global model evaluation. Data access services provide functions for fetching or for writing data from/to the data repository. Different types of data files are supported (e.g., CSV, XML) therefore many versions of the data access services are required in the learning grid. Local model induction services are in charge for generating the local models. Various instances of the local model induction services can be available in the resource management services. Global model induction services are used for constructing the global model from the different local models. Another two services exist for evaluating the performance of the local model and the global model. Local model and global models are stored in text files therefore the evaluation function will fetch those files and generates some graphical presentation (ROC graph) and make it available to the user. The execution management services [5] contain the execution schedule services and the resource allocation services. The execution schedule service programs the services for the complete cycle of the distributed learning classifier system based on the user requirements and the complexity of the problem. The first step is to trigger the local model induction services based on the number of distributed sites then invokes suitable data access service for reading data from distributed sites. While executing the local learning services, the local model evaluation services will give feedback to the user about the progress of the learning process. After the execution of learning model induction services, the execution scheduling service will invoke the global model induction services. Then again the data access service fetches the local models from each distributed site and provides to global model induction services. After the execution of the global model induction service, the global model evaluation service will present the global model performance in a human understanding format. Resources allocation services [9] will assign the resources based on the tasks scheduled by the execution scheduling services. 5.3 Basic Grid Services The basic grid services include the security services, the data management services, and the execution management services. All these services are provided by the grid computing environment, therefore basic grid services are known as core services. The services in the learning grid perform their functions with the support of these core services. The end user of the grid is not necessary to be aware about the basic services that are available in the grid environment. The security services provide encryption
68
M. Santos, W. Mathew, and F. Pinto
and decryption services for the data, authentication and authorization services [5]. The data and the service transaction protocols are defined in the security management services. Data access services in the learning grid works with the support of the data management service in the basic grid services. Execution management services direct the services to the resources. Execution management services play an important role for the load balancing and resources failure.
6 Experimental Work Gridclass system does not paralyze any part of the UCS. Various instances of the UCS are executed in different distributed sites with different set of data. Using the conventional method of Centralized Data Mining (CDM), each distributed site s∈{1,..,NS} has to send data to the central site. If each site generates Rs records, the total effort to generate a global model tends to: Tcdm Mcdm
Where Tcdm stands for the time needed to induce a global data mining model. Mcdm is the global modeling time. T stands for the communication time needed to transfer Rs records from the site S to the central site. Data security is another concern in sending data. The key advantage of this method is it avoids sending large size of data from each distributed sites to the central site. The effort to induce a global model can be computed as: Tddm Mddm
Mddm corresponds to the global modeling time. M is the modeling time for the model Ms. When the volume of data rises Tddm tends to be much smaller than Tcdm (Tddm