Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6413
Juan Trujillo Gillian Dobbie Hannu Kangassalo Sven Hartmann Markus Kirchberg Matti Rossi Iris Reinhartz-Berger Esteban Zimányi Flavius Frasincar (Eds.)
Advances in Conceptual Modeling – Applications and Challenges ER 2010 Workshops ACM-L, CMLSA, CMS, DE@ER, FP-UML, SeCoGIS, WISM Vancouver, BC, Canada, November 1-4, 2010 Proceedings
13
Volume Editors Juan Trujillo University of Alicante, Spain,
[email protected] Gillian Dobbie University of Auckland, New Zealand,
[email protected] Hannu Kangassalo University of Tampere, Finland,
[email protected] Sven Hartmann Clausthal University of Technology, Germany,
[email protected] Markus Kirchberg A*STAR, Singapore,
[email protected] Matti Rossi Aalto University, Finland,
[email protected] Iris Reinhartz-Berger University of Haifa, Israel,
[email protected] Esteban Zimányi Free University of Brussels, Belgium,
[email protected] Flavius Frasincar Erasmus University Rotterdam , The Netherlands,
[email protected] Library of Congress Control Number: 2010936076 CR Subject Classification (1998): D.2, D.3, H.4, I.2, H.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-16384-X Springer Berlin Heidelberg New York 978-3-642-16384-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface to ER 2010 Workshops
Welcome to the workshops associated with the 29th International Conference on Conceptual Modeling (ER 2010). As always, the aim of the workshops was to give researchers and participants a forum to discuss cutting edge research in conceptual modeling, and to pose some of the challenges that arise when applying conceptual modeling in less traditional areas. Workshops provided an intensive collaborative forum for exchanging late breaking ideas and theories in an evolutionary stage. Topics of interest span the entire spectrum of conceptual modeling including research and practice in areas such as theories of concepts and ontologies underlying conceptual modeling, methods and tools for developing and communicating conceptual models, and techniques for transforming conceptual models into effective implementations. In order to provoke more discussion and interaction, some workshops organized panels and/or keynote speakers inviting renowned researchers from different areas of conceptual modeling. In all, 31 papers were accepted from a total of 82 submitted, making an overall acceptance rate of 37%. The focus of this year’s seven workshops, which were selected competitively from a call for workshop proposals, ranged from the application of conceptual modeling in less traditional domains including learning, life science applications, services, geographical systems, and Web information systems, to using conceptual modeling for different purposes including domain engineering, and UML modeling. SeCoGIS: CMLSA: CMS: ACM-L: WISM: DE@ER: FP-UML:
Semantic and Conceptual Issues in GIS Conceptual Modeling of Life Sciences Applications Conceptual Modeling of Services Active Conceptual Modeling of Learning Web Information Systems Modeling Domain Engineering Foundations and Practices of UML
Setting up workshops such as these requires a lot of effort. We would like to thank the Workshop Chairs and their Program Committees for their diligence in selecting the papers in this volume. We would also like to thank the main ER 2010 conference committees, particularly the Conference Co-chairs, Yair Wand and Carson Woo, the Conference Program Co-chairs, Jeff Parsons, Motoshi Saeki and Peretz Shoval, the Webmaster, William Tan, and the Proceedings Chair, Sase Singh, for their support in putting the program and proceedings together.
November 2010
Juan Trujillo Gillian Dobbie
ER 2010 Workshop Organization
Workshop Co-chairs Juan Trujillo Gillian Dobbie
Universidad de Alicante, Spain University of Auckland, New Zeeland
SeCoGIS 2010 Program Chairs Jean Brodeur Esteban Zimányi
Natural Resources Canada, Canada Université Libre de Bruxelles, Belgium
SeCoGIS 2010 Program Committee Alia I. Abdelmoty Gennady Andrienko Natalia Andrienko Claudio Baptista Spiridon Bakiras Yvan Bedard Michela Bertolotto Benedicte Bucher James D. Carswell Nicholas Chrisman Christophe Claramunt Eliseo Clementini Maria Luisa Damiani Clodoveu Davis Max Egenhofer Fernando Ferri Frederico Fonseca Antony Galton Ki-Joune Li Therse Libourel Jugurta Lisboa Filho Miguel R. Luaces Jose Macedo Pedro Rafael Muro Medrano Mir Abolfazl Mostafavi Dimitris Papadias Dieter Pfoser
Cardiff University, UK Fraunhofer Institute IAIS, Germany Fraunhofer Institute IAIS, Germany Universidade Federal de Campina Grande, Brazil City University of New York, USA Universite Laval, Canada University College Dublin, Ireland Institut Geographique National, France Dublin Institute of Technology, Ireland Universite Laval, Canada Naval Academy Research Institute, France University of L’Aquila, Italy University of Milano, Italy Federal University of Minas Gerais, Brazil NCGIA, USA IRPPS-CNR, Italy Penn State University, USA University of Exeter, UK Pusan National University, South Korea Université de Montpellier II, France Universidade Federal de Vicosa, Brazil Universidade da Coruna, Spain Federal University of Ceara, Brazil Universidad de Zaragoza, Spain Universitè Laval, Canada University of Science and Technology, China Institute for the Management of Information Systems, Greece
VIII
ER 2010 Workshop Organization
Andrea Rodriguez Diego Seco Sylvie Servigne-Martin Emmanuel Stefanakis Kathleen Stewart Hornsby Christelle Vangenot Luis Manuel Vilches Blazquez Lubia Vinhas Jose Ramon Rıos Viqueira Nancy Wiegand
Universidad de Concepcion, Chile Universidade da Coruna, Spain INSA de Lyon, France Harokopio University of Athens, Greece University of Iowa, USA EPFL, Switzerland Universidad Politecnica de Madrid, Spain Instituto National de Pesquisas Espaciais, Brazil University of Santiago de Compostela, Spain University of Wisconsin-Madison, USA
SeCoGIS 2010 External Reviewers Francisco J. Lopez-Pellicer
CMLSA 2010 Program Chairs Yi-Ping Phoebe Chen Sven Hartmann
La Trobe University, Australia Clausthal University of Technology, Germany
CMLSA 2010 Program Committee Ramez Elmasri Amarnath Gupta Dirk Labudde Dirk Langemann Huiqing Liu Maria Mirto Oscar Pastor Fabio Porto Sudha Ram Keun Ho Ryu Thodoros Topaloglou Xiaofang Zhou
University of Texas, USA University of California San Diego, USA Mittweida University of Applied Sciences, Germany Braunschweig University of Technology, Germany Janssen Pharmaceutical Companies of Johnson & Johnson, USA University of Salento, Italy Valencia University of Technology, Spain EPF Lausanne, Switzerland University of Arizona, USA Chungbuk National University, South Korea University of Toronto, Canada The University of Queensland, Australia
CMLSA 2010 Publicity Chair Jing Wang
Massey University, New Zealand
ER 2010 Workshop Organization
CMS 2010 Program Chairs Markus Kirchberg Bernhard Christian-Albrechts
Institute for Infocomm Research, A*STAR, Singapore University of Kiel, Germany
CMS 2010 Program Committee Michael Altenhofen Don Batory Athman Bouguettaya Schahram Dustdar Andreas Friesen Aditya K. Ghose Uwe Glasser Georg Grossmann Hannu Jaakkola Andreas Prinz Sudha Ram Klaus-Dieter Schewe Michael Schre Thu Trinh Qing Wang Yan Zhu
SAP Research CEC Karlsruhe, Germany University of Texas at Austin, USA CSIRO, Australia Vienna University of Technology, Austria SAP Research Karlsruhe, Germany University of Wollongong, Australia Simon Fraser University, Canada University of South Australia, Australia Tampere University of Technology, Finland University of Agder, Norway University of Arizona, USA Software Competence Center Hagenberg, Austria University of Linz, Austria Technical University of Clausthal, Germany University of Otago, New Zealand Southwest Jiaotong University, China
CMS 2010 External Referees Michael Huemer Florian Rosenberg Wanita Sherchan Xu Yang
ACM-L 2010 Program Chairs Hannu Kangassalo Salvatore T. March Leah Wong
University of Tampere, Finland Vanderbilt University, USA SPAWARSYSCEN Pacific, USA
ACM-L 2010 Program Committee Stefano Borgo Alfredo Cuzzocrea Giancarlo Guizzardi
ISTC-CNR, Italy University of Calabria, Italy Universidade Federal do Espírito Santo, Brazil
IX
X
ER 2010 Workshop Organization
Raymond A Liuzzi Jari Palomäki Oscar Pastor Sudha Ram Laura Spinsanti Il-Yeol Song Bernhard Thalheim
Raymond Technologies, USA Tampere University of Technology/Pori, Finland Valencia University of Technology, Spain University of Arizona, USA LBD lab – EPFL, Switzerland Drexel University, USA Christian Albrechts University Kiel, Germany
WISM 2010 Program Chairs Flavius Frasincar Geert-Jan Houben Philippe Thiran
Erasmus University Rotterdam, The Netherlands Delft University of Technology, The Netherlands Namur University, Belgium
WISM 2010 Program Committee Syed Sibte Raza Abidi Sven Casteleyn Philipp Cimiano Roberto De Virgilio Tommaso Di Noia Flavius Frasincar Irene Garrigos Michael Grossniklaus Hyoil Han Geert-Jan Houben Zakaria Maamar Maarten Marx Michael Mrissa Oscar Pastor Dimitris Plexousakis Jose Palazzo Moreira de Oliveira Davide Rossi Hajo Reijers Philippe Thiran Christopher Thomas Erik Wilde
Dalhousie University, Canada Vrije Universiteit Brussel, Belgium University of Bielefeld, Germany Università di Roma Tre, Italy Technical University of Bari, Italy Erasmus University of Rotterdam, The Netherlands Universidad de Alicante, Spain ETH Zurich, Switzerland LeMoyne-Owen College, USA Delft University of Technology, The Netherlands Zayed University, UAE University of Amsterdam, The Netherlands Namur University, Belgium Valencia University of Technology, Spain University of Crete, Greece UFRGS, Brazil University of Bologna, Italy Eindhoven University of Technology, The Netherlands Namur University, Belgium Wright State University, USA UC Berkeley, USA
ER 2010 Workshop Organization
WISM 2010 External Referees C. Berberidis K. Buza
DE@ER 2010 Program Chairs Iris Reinhartz-Berger Arnon Sturm Ben-Gurion Jorn Bettin Tony Clark Sholom Cohen
University of Haifa, Israel University of the Negev, Israel Sofismo, Switzerland University of Middlesex, UK Carnegie Mellon University, USA
DE@ER 2010 Program Committee Colin Atkinson Mira Balaban Balbir Barn Kim Dae-Kyoo Joerg Evermann Marcelo Fantinato Jeff Gray Atzmon Hen-Tov John Hosking Jaejoon Lee David Lorenz John McGregor Klaus Pohl Iris Reinhartz-Berger Michael Rosemann Julia Rubin Lior Schachter Klaus Schmid Keng Siau Pnina Soffer Il-Yeol Song Arnon Sturm Juha-Pekka Tolvanen Gabi Zodik
University of Mannheim, Germany Ben-Gurion University of the Negev, Israel Middlesex University, UK Oakland University, USA Memorial University of Newfoundland, Canada University of São Paulo, Brazil University of Alabama, USA Pontis, Israel University of Auckland, New Zealand Lancaster University, UK Open University, Israel Clemson University, USA University of Duisburg-Essen, Germany University of Haifa, Israel The University of Queensland, Australia IBM Haifa Research Labs, Israel Pontis, Israel University of Hildesheim, Germany University of Nebraska-Lincoln, USA University of Haifa, Israel Drexel University, USA Ben-Gurion University of the Negev, Israel MetaCase, Finland IBM Haifa Research Labs, Israel
DE@ER 2010 External Referees Andreas Metzger Ornsiri Thonggoom
XI
XII
ER 2010 Workshop Organization
FP-UML 2010 Program Chairs Gunther Pernul Matti Rossi
University of Regensburg, Germany Aalto University, Finland
FP-UML 2010 Program Committee Doo-Hwan Bae Michael Blaha Cristina Cachero Gill Dobbie Irene Garrigos Peter Green Manfred Jeusfeld Ludwik Kuzniarz Jens Lechtenborger Susanne Leist Pericles Loucopoulos Hui Ma Jose Norberto Mazon Antoni Olive Andreas L. Opdahl Jeffrey Parsons Keng Siau Il-Yeol Song Bernhard Thalheim Ambrosio Toval Juan Trujillo Panos Vassiliadis
KAIST, South Korea OMT Associates Inc., USA University of Alicante, Spain University of Auckland, New Zealand University of Alicante, Spain University of Queensland, Australia Tilburg University, The Nederlands Blekinge Institute of Technology, Sweden University of Munster, Germany University of Regensburg, Germany Loughborough University Massey University, New Zealand University of Alicante, Spain Technical University of Catalonia, Spain University of Bergen, Norway Memorial University of Newfoundland, Canada University of Nebraska-Lincoln, USA Drexel University, USA Christian Albrechts University Kiel, Germany University of Murcia, Spain University of Alicante, Spain University of Ioannina, Greece
Table of Contents
SeCoGIS 2010 – Fourth International Workshop on Semantic and Conceptual Issues in Geographic Information Systems Preface to SeCoGIS 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean Brodeur and Esteban Zimanyi
1
Semantical Aspects W-Ray: A Strategy to Publish Deep Web Geographic Data . . . . . . . . . . . . Helena Piccinini, Melissa Lemos, Marco A. Casanova, and Antonio L. Furtado
2
G-Map Semantic Mapping Approach to Improve Semantic Interoperability of Distributed Geospatial Web Services . . . . . . . . . . . . . . . Mohamed Bakillah and Mir Abolfazl Mostafavi
12
MGsP: Extending the GsP to Support Semantic Interoperability of Geospatial Datacubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tarek Sboui and Yvan B´edard
23
Implementation Aspects Range Queries over a Compact Representation of Minimum Bounding Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nieves R. Brisaboa, Miguel R. Luaces, Gonzalo Navarro, and Diego Seco A Sensor Observation Service Based on OGC Specifications for a Meteorological SDI in Galicia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e R.R. Viqueira, Jos´e Varela, Joaqu´ın Tri˜ nanes, and Jos´e M. Cotos
33
43
CMLSA 2010 – Third International Workshop on Conceptual Modeling for Life Sciences Applications Preface to CMLSA 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi-Ping Phoebe Chen, Sven Hartmann, and Jing Wang
53
XIV
Table of Contents
Conceptual Modelling for Bio-, Eco- and Agroinformatics Provenance Management in BioSciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudha Ram and Jun Liu
54
Ontology-Based Agri-Environmental Planning for Whole Farm Plans . . . Hui Ma
65
CMS 2010 – First International Workshop on Conceptual Modeling of Service Preface to CMS 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Kirchberg and Bernhard Thalheim
75
Modeling Support for Service Integration A Formal Model for Service Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus-Dieter Schewe and Qing Wang Reusing Legacy Systems in a Service-Oriented Architecture: A Model-Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yeimi Pe˜ na, Dario Correal, and Tatiana Hernandez Intelligent Author Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing Wang and Ren´e Noack
76
86 96
Modeling Techniques for Services Abstraction, Restriction, and Co-creation: Three Perspectives on Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Bergholtz, Birger Andersson, and Paul Johannesson The Resource-Service-System Model for Service Science . . . . . . . . . . . . . . . Geert Poels
107 117
ACM-L 2010 The 3rd International Workshop on Active Conceptual Modeling of Learning, ACM-L Preface to ACM-L 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hannu Kangassalo, Sal March, and Leah Wong
127
Advances in Active Conceptual Modeling of Learning ACM-L 2010 Towards a Framework for Emergent Modeling . . . . . . . . . . . . . . . . . . . . . . . Ajantha Dahanayake and Bernhard Thalheim
128
Table of Contents
When Entities Are Types: Effectively Modeling Type-Instantiation Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Faiz Currim and Sudha Ram
XV
138
ACM-L 2009 KBB: A Knowledge-Bundle Builder for Research Studies . . . . . . . . . . . . . . David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao
148
WISM 2010 – The 7th International Workshop on Web Information Systems Modeling Preface to WISM 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flavius Frasincar, Geert-Jan Houben, and Philippe Thiran
159
Web Information Systems Development and Analysis Models Integration of Dialogue Patterns into the Conceptual Model of Storyboard Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Berg, Bernhard Thalheim, and Antje D¨ usterh¨ oft
160
Model-Driven Development of Multidimensional Models from Web Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Hern´ andez, Irene Garrig´ os, and Jose-Norberto Maz´ on
170
Web Technologies and Applications Integrity Assurance for RESTful XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastian Graf, Lukas Lewandowski, and Marcel Waldvogel
180
Collaboration Recommendation on Academic Social Networks . . . . . . . . . Giseli Rabello Lopes, Mirella M. Moro, Leandro Krug Wives, and Jos´e Palazzo Moreira de Oliveira
190
Mining Economic Sentiment Using Argumentation Structures . . . . . . . . . . Alexander Hogenboom, Frederik Hogenboom, Uzay Kaymak, Paul Wouters, and Franciska de Jong
200
DE@ER 2010 – Domain Engineering Preface to DE@ER 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iris Reinhartz-Berger, Arnon Sturm, Jorn Bettin, Tony Clark, and Sholom Cohen
211
XVI
Table of Contents
Methods and Tools in Domain Engineering Evaluating Domain-Specific Modelling Solutions . . . . . . . . . . . . . . . . . . . . . Parastoo Mohagheghi and Øystein Haugen
212
Towards a Reusable Unified Basis for Representing Business Domain Knowledge and Development Artifacts in Systems Engineering . . . . . . . . . Thomas Kofler and Daniel Ratiu
222
DaProS: A Data Property Specification Tool to Capture Scientific Sensor Data Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irbis Gallegos, Ann Q. Gates, and Craig Tweedie
232
FP-UML 2010 – Sixth International Workshop on Foundations and Practices of UML Preface to FP-UML 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gunther Pernul and Matti Rossi
243
Semantics and Ontologies in UML Incorporating UML Class and Activity Constructs into UEML . . . . . . . . . Andreas L. Opdahl
244
Data Modeling Is Important for SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Blaha
255
Representing Collectives and Their Members in UML Conceptual Models: An Ontological Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giancarlo Guizzardi
265
Automation and Transformation in UML UML Activities at Runtime: Experiences of Using Interpreters and Running Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Gessenharter
275
Model-Driven Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Aboulsamh, Edward Crichton, Jim Davies, and James Welch
285
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
295
4th International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS 2010) Preface Recent advances in information technologies have increased the production, collection, and diffusion of geographical data, thus favoring the design and development of geographic information systems (GIS). Nowadays, GISs are emerging as a common information infrastructure, which penetrate into more and more aspects of our society. This has given rise to new methodological and data engineering challenges in order to accommodate new users’ requirements for new applications. Conceptual and semantic modeling are ideal candidates to contribute to the development of the next generation of GIS solutions. They allow to elicit and capture user requirements as well as the semantics of a wide domain of applications. The SeCoGIS workshop brings together researchers, developers, users, and practitioners carrying out research and development in geographic information systems. The aim is to stimulate discussions on the integration of conceptual modeling and semantics into current geographic information systems, and how this will benefit the end users. The workshop provides a forum for original research contributions and practical experiences of conceptual modeling and semantic web technologies for GIS, fostering interdisciplinary discussions in all aspects of these two fields, and will highlight future trends in this area. The workshop is organized in a way to highly stimulate interaction amongst the participants. This edition of the workshop attracted papers from 11 different countries distributed all over the world: Brazil, Canada, Chile, France, Italy, Lebanon, Mexico, Spain, Switzerland, United Kingdom, and USA. We received 17 papers from which the Program Committee selected 5 papers, making an acceptance rate of 29%. The accepted papers were organized in two sessions. The first one is devoted to semantical aspects, where the first paper focuses on publishing Deep Web data, and the latter two are focused on semantic interoperability. In the second session, two papers focusing on implementation aspects will be presented. We would like to express our gratitude to the program committee members and the external referees for their hard work in reviewing papers, the authors for submitting their papers, and the ER 2010 organizing committee for all their support. July 2010
Jean Brodeur Esteban Zimányi
W-Ray: A Strategy to Publish Deep Web Geographic Data Helena Piccinini1,2, Melissa Lemos1, Marco A. Casanova1, and Antonio L. Furtado1 1 Department of Informatics – PUC-Rio – Rio de Janeiro, RJ – Brazil {hpiccinini,melissa,casanova,furtado}@inf.puc-rio.br 2 Diretoria de Informática – IBGE – Rio de Janeiro, RJ – Brazil
[email protected] Abstract. This paper introduces an approach to address the problem of accessing conventional and geographic data from the Deep Web. The approach relies on describing the relevant data through well-structured sentences, and on publishing the sentences as Web pages, following the W3C and the Google recommendations. For conventional data, the sentences are generated with the help of database views. For vector data, the topological relationships between the objects represented are first generated, and then sentences are synthesized to describe the objects and their topological relationships. Lastly, for raster data, the geographic objects overlapping the bounding box of the data are first identified with the help of a gazetteer, and then sentences describing such objects are synthesized. The Web pages thus generated are easily indexed by traditional search engines, but they also facilitated the task of more sophisticated engines that support semantic search based on natural language features. Keywords: Deep Web, Geographic Data, Natural Language Processing.
1 Introduction Unlike the Surface Web of static pages, the Deep Web [1] comprises data stored in databases, dynamic pages, scripted pages and multimedia data, among other types of objects. Estimates suggest that the size of the Deep Web greatly exceeds that of the Surface Web – with nearly 92,000 terabytes of data on the Deep Web versus only 167 terabytes on the Surface Web, as of 2003. In particular, Deep Web databases are typically under-represented in search engines due to the technical challenges of locating, accessing, and indexing the databases. Indeed, since Deep Web data is not available as static Web pages, traditional search engines cannot discover data stored in the databases through the traversal of hyperlinks, but rather they have to interact with (potentially) complex query interfaces. Two basic approaches to access Deep Web data have been proposed. The first approach, called surfacing, or Deep Web Crawl [16], tries to automatically fill HTML forms to query the databases. Queries are executed offline and the results are translated to static Web pages, which are then indexed [15]. The second approach, called federated search, or virtual integration [4, 18], suggests using domain-specific mediators to facilitate access to the databases. Hybrid strategies, which extend the previous approaches, have also been proposed [21]. J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 2–11, 2010. © Springer-Verlag Berlin Heidelberg 2010
W-Ray: A Strategy to Publish Deep Web Geographic Data
3
Despite recent progress, accessing Deep Web data is still a challenge, for two basic reasons [20]. First, there is the question of scalability. Since the Deep Web is orders of magnitude larger than the Surface Web [1], it may not be feasible to completely index the Deep Web. Second, databases typically offer interfaces designed for human users, which complicates the development of software agents to interact with them. This paper proposes a different approach, which we call W-Ray by analogy with medical X-Ray technology, to published conventional and geographic data, in vector or raster format, stored in the Deep Web. The basic idea consists of creating a set of natural language sentences, with a simple structure, to describe Deep Web data, and publishing the sentences as static Web pages, which are then indexed as usual. The use of natural language sentences is interesting for three reasons. First, they lead to Web pages that are acceptable to Web crawlers that consider words randomly distributed in a page as an attempt to manipulate page rank. Second, they facilitate the task of more sophisticated engines that support semantic search based on natural language features [5, 24]. Lastly, the descriptions thus generated are minimally acceptable to human users. The Web pages are generated following the W3C guidelines [3] and the recommendations published by Google to optimize Web site indexing [9]. This paper is organized as follows. Section 2 describes how to publish conventional data. Section 3 discusses how to describe geographic data in vector format. Section 4 extends the discussion to geographic data in raster format. Finally, Section 5 contains the conclusions. The details of the W-Ray approach can be found in [22].
2 The W-Ray Approach for Conventional Databases 2.1 Motivation and Overview of the Approach The W-Ray approach to publishing conventional data as Web pages proceeds in two stages. In the first stage, the designer manually defines a set of database views that capture which data should be published, and specifies templates that indicate how sentences should be generated. The second stage is automatic and consists of materializing the views, translating the materialized data to natural language sentences, with the help of the templates, and publishing the sentences as static Web pages. Note that metadata, typically associated with geographic data, can be likewise processed. As an alternative to synthesizing natural language sentences, one might simply format the materialized view data as HTML tables. However, this is not a reasonable strategy for at least two reasons. First, some search mechanisms consider tables as visual objects. Second, tables may be difficult to read, even for the typical user, or at all impossible, for the visually impaired users. Indeed, the third principle of the W3C recommendation [3] indicates that “Information and the operation of user interface must be understandable.”, and item 4 of the Google Web page optimization guidelines [9] recommends that “(Web page) content should be: easy-to-read; organized around the topic; use relevant language; be fresh and unique; be primarily created for users, not search engines”. This recommendation reflects the fact that Web crawlers may interpret words randomly or repeatedly distributed in a Web page as an attempt to manipulate page rank, and thereby reject indexing the page.
4
H. Piccinini et al.
Finally, we observe that some of the W3C specific recommendations for the visually impaired user in fact coincide with Google’s orientations. Comparing the two, it is clear that the difficulties faced by the visually impaired user are akin to those a search engine suffers during the data collection step. As an example, both Google and W3C recommend using the attribute "alt" to describe the content of an image. Naturally, the content of an image is opaque to both a visually impaired user and a search engine, but an alternate text describing the image can be indexed by a search engine and read (by a screen reader) to the visually impaired user. In general, many W-Ray strategies defined to address the limitations of search engines also apply to the design of a database interface for the visually impaired user. 2.2 Guidelines for View Design The designer should first select which data should be published with the help of database views. We offer the following simple guidelines that the designer should follow: • Attributes whose values have no semantics outside the database should not be directly published. • Artificially generated primary keys, foreign keys that refer to such primary keys, attributes with domains that encode classifications or similar artifacts, if selected for publication, should have their internal values replaced by their respective external definitions. For example, a classification code should be replaced by the corresponding classification term. • Attributes that contain private data should not be published. • Views should not contain too many attributes; only those attributes that are relevant to help locate the objects and their relationships should be selected. 2.3 Translating the Materialized Data to Natural Language Sentences The heart of the W-Ray approach lies in the translation of materialized view data to natural language sentences. Fuchs et al. [8] propose a single language for machine and human users, basically by translating English sentences to first-order logic. Others propose to translate RDF triples to natural language sentences [7, 13], simply by concatenating the triples. Tools to translate conventional data to RDF triples have also been developed [2, 6], which typically map database entities to classes, attributes to datatype properties, and relationships to object properties. The proposals introduced in [7, 13] do not consider sequences of RDF triples, though, which we require to compose simple sentences into more complex syntactical constructions. Therefore, we combine the strategies to synthesize sentences described in [13] with the mapping of conventional data to RDF triples introduced in [2]. The translation of materialized view data to natural language sentences involves two tasks: choice of an appropriate external vocabulary; and definition of templates to guide the synthesis of the sentences. First observe that the database schema names, including view names, are typically inappropriate to be externalized to the database users. This implies that the designer must first define an external vocabulary, that is, a set of terms that will be used to communicate materialized view data to the users. The designer should obey the following generic guideline:
W-Ray: A Strategy to Publish Deep Web Geographic Data
5
• The external vocabulary should preferably be a subset of a controlled vocabulary covering the application domain in question, or of a generic vocabulary, such as that of an upper-level ontology or Wordnet. If followed, this guideline permits defining hyperlinks from the terms of the external vocabulary to the terms of the controlled vocabulary. A similar strategy to synthesize sentences is discussed in [11]. An extension to Wordnet is also proposed in [23] to treat concepts corresponding to compound nouns. After selecting the external vocabulary, the designer must define templates that will guide the synthesis of the sentences. We offer three alternatives: free template definition; default template definition; and modifiable default template definition. The first alternative leaves template definition in the hands of the designer and, thus, may lead to sentences with arbitrary structure. In the default template alternative, the designer first creates an entity-relationship model that is a high-level description of the views, and then uses a tool that generates default templates based on the ER model and synthesizes sentences with a regular syntactical structure. The last alternative is a variation of the second and allows the designer to alter the default templates. For the free template definition alternative, we offer the following guidelines: • A template must use the external vocabulary and other common syntactical elements (articles, conjunctions, etc.) [19], as well as punctuation marks. • A template should generate a sentence that characterizes an entity through its properties and relationships. • The subject of the sentence should have a variable associated with an identifying attribute of the view. • The predicate of the sentence should have variables associated with other view attributes that further describe the entity, or that relate the entity to other entities. The use of free templates is illustrated in what follows, using a relational view of the SIDRA database, which the Brazilian Institute of Geography and Statistics (IBGE) publishes on the Web with the help of HTML forms. The full details can be found in [22]. We start by defining views over the SIDRA database. To save space, Table 1 shows just the “political_division” view: the first column indicates the view name, the second column indicates the attribute names of the view, the third column describes the attributes, and the fourth column associates a variable with each attribute. We then define a template to publish the “political division” view data: U is a “L” that has a total of V M for the year Y and aggregate variable A. Table 1. Schematic definition of a view over the SIDRA database
View Name Attribute Name political_division name level aggreg_var aggreg_var_value unit_measure year
...
Attribute Description name of the political division level of the political division, such as state, county,… name of an aggregation data, such as resident population value of the aggregation data unit measure of the aggregation data year the aggregation data was measured
Variable U L A V M Y
6
H. Piccinini et al.
Next, the view is materialized. Each line of the resulting table is transformed into a sentence, using the template. The following sentence illustrates the result: Roraima is a unit of the federation that has a total of 395.725 people for the year 2007 and aggregate variable “resident population”. Note that: the underlined words are the subject of the sentence; the predicate “is a unit of the federation” qualifies the subject; the words in boldface are view data that play the role of predicatives of the subject, together with the fragments in italics. We now repeat the example using the default templates alternative. Recall that, in this alternative, the designer starts by creating an ER model of the views. In our running example, the ER model would be: entity(political_division,name). attribute(political_division,level). attribute(political_division,aggreg_var). attribute(political_division,aggreg_var_value). attribute(political_division,unit_measure). attribute(political_division,year).
Using the variables defined in Table 1, the tool generates default templates such as: 'There is a political division with name P' 'The level of P is L'
Using default templates, the tool then synthesizes sentences such as (data in boldface): 'There is a political division with name Roraima'. 'The level of Roraima is unit of the federation'.
Finally, the modifiable default template alternative allows the designer to alter the default templates. Examples of template redefinitions are (where the variables in boldface italics in the new template have to occur in the default template): Default template: 'There is a political division with name P' New template: 'P' Default template: 'The level of P is L' New template: 'is a L' The designer is also allowed to compose the modified templates as in the example: facts((political_division(P),level(P,L)).
Using modified templates, the tool synthesizes sentences such as (data in boldface): 'Roraima is a unit of the federation'
2.4 Guidelines for Publishing the Sentences as Static Web Pages As mentioned before, W-Ray follows the W3C recommendation [3], as well as the Google Web page optimization guidelines [9]. Briefly, the most relevant criteria that W-Ray adopts to publish Web pages are: • Create hyperlinks between the published data and metadata (W3C Recomm. 3). • Create hyperlinks between the published data to improve data exploration via navigation (W3C Recomm. 1.3.2 and 2.4 and Google Recomm. 3 and 5). • Create content with well-structured sentences, as addressed in Section 2.2 (W3C Recomm. 3 and Google Recomm. 4).
W-Ray: A Strategy to Publish Deep Web Geographic Data
7
• Use text to describe images when the attribute “alt” does not suffice (W3C Recomm. 1.1.1 and Google Recomm. 7). In the example of Section 2.3, the subject of the sentence – Roraima – would be hyperlinked to a Web Page with further information about the State of Roraima. Briefly, the URLs would be generated upfront by concatenating a base URI with the primary key of the data (see[22] for the details).
3 W-Ray for Geographical Data in Vector Format We first observe that a number of tools [17] offer facilities to convert geographic data in vector format to dynamic Web pages. However, such Web pages are typically not indexed by search engines. We also observe that geographic data in vector format is not opaque, as raster images are, since the data is often associated with conventional data and, in fact, with the (geographic) objects stored in the database. A solution to make vector data visible to the search engines would therefore be to publish the conventional data associated with them, as discussed in Section 2. This strategy would however totally ignore the geographic information that the vector data capture. In the W-ray strategy, we explore how to translate the relevant geographic information again as natural language sentences. On a first approximation, the strategy is the same as for conventional data: define a set of database views that capture which data should be published; materialize the views; translate the materialized data to natural language sentences; and publish the sentences as static Web pages. More specifically, suppose that the vector data is organized by layers. Then, when defining a view, the designer essentially has to decide: • Which layers will be combined in the view. For example, the view might combine the political division, populated places and waterways layers; • For each layer included in the view, which objects will be retained in the view. For example, one might discard all populated places below a certain population; • For each layer included in the view, which attributes will be retained in the view; • When the view combines several layers, o Which is the priority between the layers. For examples, the populated places layer may have priority over the political division and the waterways layers; o Which topological relationships between the objects of different layers should be materialized. For example, for each populated place (of the highest priority layer), one might decide to materialize which navigable waterways (of the lowest priority layer) are within a buffer of 100km centered in the populated place. o In which topological order the objects will be described. For example, populated places might be listed from north to south and from west to east. As for conventional data, the designer should select the external names preferably from a controlled vocabulary such as the ISO19115 Topic Categories [12]. For example, consider a view consisting of three layers - the political division, the populated places and the waterways of Brazil - filtered as follows: • political division: keep only the states, with their name, abbreviated name, area and population, located in the north region
8
H. Piccinini et al.
• populated places: retain only the county and state capitals, with their name, political status, area and population, located in the states in the north region • waterways: keep only the name, navigability and flow Furthermore, assume that the topological relationship between populated places and political division is ‘is located in’ and that between waterways and political division is ‘cross’. Assume that populated places have priority and that they are listed from north to south and from west to east. Examples of sentences would be (using the same conventions as in Section 2.3): Roraima is a unit of the federation that has a total of 395.725 people for the year 2007 and aggregate variable “resident population”. Roraima is located in the North Region, with an area of 22,377,870 square kilometers. Boa Vista is a city that has a total of 249.853 people for the year 2007 and aggregate variable “resident population”. Boa Vista is located in the unit of federation Roraima and is the capital city of the unit of federation Roraima, with an area of 5,687 square kilometers. Amazonas is a waterway that crosses the unit of federation Amazonas and the unit of federation Pará, with flow permanent and navigability navigable. The subject of each sentence (underlined words) would also have a hyperlink to a dynamic Web page with the full information about the state or the city, generated by executing a query over the underlying database. Using default templates, the running example would be restated as follows: • Declaration of the entity-relationship model: entity(political_division,name). entity(populated_places,name). entity(waterways,name). attribute(political_division,population). attribute(political_division,abbreviated_name). attribute(political_division,area). attribute(populated_places,level). attribute(populated_places,local_area). attribute(populated_places,local_population). attribute(waterways,flow). attribute(waterways, navigability). relationship(located_in,[populated_places, political_division]). relationship(crosses, [waterways, political_division]).
• Examples of synthesized sentences, using default templates (with data in boldface): 'There is a populated places with name City of Boavista'. 'There is a political division with name State of Amazonas'. 'There is a political division with name State of Pará'. 'There is a waterways with name Amazon River'. 'The flow of Amazon River is permanent'. 'The navigability of Amazon River is navigable'. 'City of Boavista is related to State of Roraima by located in'. 'Amazon River is related to State of Amazonas by crosses'. 'Amazon River is related to State of Pará by crosses'.
W-Ray: A Strategy to Publish Deep Web Geographic Data
9
Turning to the modified default templates alternative, examples are: • Template redefinition: Default template: 'There is a political division with name P' New template: 'The P' Default template: 'R is related to P by crosses' New template: 'is crossed by R' Default template: 'The flow of R is F' New template: 'which is F' Default template: 'The navegability of R is V' New template: 'and V' • Template composition: facts((political_division(P),crosses(R,P), flow(R,S),navigability(R,V))).
• Sentences generated using the new templates (with data in boldface): 'The State of Amazonas is crossed by Amazon River which is permanent and navigable' 'The State of Pará is crossed by Amazon River which is permanent and navigable'
4 W-Ray for Raster Data Following the idea introduced in Leme et al. [14], the W-Ray strategy describes raster data by publishing sentences that capture the metadata describing how the raster data was acquired, and the geographic objects contained within its bounding box. The geographic objects might be obtained, for example, from a gazetteer, such as the ADL gazetteer [10], which includes a useful Feature Type Thesaurus (FTT) for classifying geographic features. As for vector data, the designer should define views, this time based on the classification of the geographic objects. As a concrete example, consider the image fragment of the City of Rio de Janeiro, taken out of the Web site “Brazil seen from Space”, and assume that: • the metadata of the image indeed indicates the coordinates of its bounding box • the geographic objects and their classifications are taken from the ADL Gazetteer • the designer decides to associate images with geographic objects classified as ‘hydrographic feature’, a topic category of FTT, whose centroid is contained in the bounding box of the image The raster image would then be processed as follows: 1.
2.
The georeferencing parameters are extracted from the image. In this case, the image fragment is consistent with a scale of 1:25.000 and has bounding box defined by ((43°15’W, 22° 52’ 30”S), (43° 07’ 30”W, 23°S)). By querying the ADL Gazetteer using the georeferencing parameters extracted in Step 1 and the ADL FTT term selected, ‘hydrographic feature’, one locates 9 objects, which the first few are:
10
H. Piccinini et al.
a. Feature(“Rodrigo de Freitas, Lagoa - Brazil”, lakes, contains) b. Feature(“Comprido, Rio – Brazil”, streams, contains) c. Feature(“Maracana, Rio – Brazil, streams, contains) The query results would be translated to the following sentence, describing the image (using the same conventions as in Section 2.3): The image of Rio de Janeiro, Brazil, contains the lake “Rodrigo de Freitas” and the streams “Comprido” and “Maracanã”. where the underlined words form the subject of the sentence, the words in boldface italics were extracted from the ADL FTT, and those in boldface denote geographic objects in the ADL Gazetteer whose centroids are contained in the bounding box of the image.
5 Conclusions This paper outlined an approach to overcome the problem of accessing conventional and geographic data from the Deep Web. The approach relies on describing the data through natural language sentences, published as Web pages. The Web pages thus generated are easily indexed by traditional search engines, but they also facilitated the task of engines that support semantic search based on natural language features. The details of the approach can be found in [22]. Further work is planned to assess which of the three alternatives for generating templates, if any, leads to better recall. The experiments will use massive amounts of data from geographic databases organized by IBGE, as well as a large multimedia database. Lastly, we remark that the approach can be easily modified to generate RDF triples, instead of natural language sentences, and to cope with multimedia data. In a broader perspective, it can also be used to describe conventional, geographic and multimedia data to the visually impaired users. The challenges here lie in structuring the sentences in such a way to avoid cognitive overload. Acknowledgements. This work was partly supported by IBGE, CNPq under grants 301497/2006-0, 473110/2008-3, 557128/2009-9, FAPERJ E-26/170028/2008, and CAPES/PROCAD NF 21/2009.
References [1] Bergman, M.K.: The Deep Web: Surfacing Hidden Value. J. Electr. Pub. 7(1) (2001) [2] Bizer, C., Cyganiak, R.: D2R Server – Publishing Relational Databases on the Web as SPARQL Endpoints. In: Proc. 15th Int’l. WWW Conf., Edinburgh, Scotland (2006) [3] Caldwell, B., Cooper, M., Reid, L.G., Vanderheiden, G.: Web Con-tent Accessibility Guidelines (WCAG) 2.0. In: W3C Recommendation (2008) [4] Callan, J.: Distributed information retrieval. In: Advances in Information Retrieval, pp. 127–150. Springer, US (2000)
W-Ray: A Strategy to Publish Deep Web Geographic Data
11
[5] Costa, L.: Esfinge - Resposta a perguntas usando a Rede. In: Proc. Conf. IberoAmericana IADIS WWW/Internet, Lisboa, Portugal (2005) [6] Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Proc. 1st Conference on Social Semantic Web, Leipzig, Germany. LNI, vol. 113, pp. 59–68 (2007) [7] Fliedl, G., Kop, C., Vöhringer, J.: Guideline based evaluation and verbali-zation of OWL class and property labels. Data & Knowledge Eng. 69(4), 331–342 (2010) [8] Fuchs, N.E., Kaljurand, K., Kuhn, T.: Attempto Controlled English for Knowledge Representation. In: Baroglio, C., Bonatti, P.A., Małuszyński, J., Marchiori, M., Polleres, A., Schaffert, S. (eds.) Reasoning Web. LNCS, vol. 5224, pp. 104–124. Springer, Heidelberg (2008) [9] Google. In: Google’s Search Engine Optimization Starter Guide, Version 1.1 (2008) [10] Alexandria Digital Library, Guide to the ADL Gazetteer Content Standard, v. 3.2 (2004) [11] Hollink, L., Schreiber, G., Wielemaker, J., Wielinga, B.: Semantic Annotation of Image Collections. In: Proc. Knowledge Markup and Semantic Annota-tion Workshop, Sanibel, Florida, USA (2003) [12] ISO 19115:2003, Geographic Information – Metadata [13] Kalyanpur, A., Halaschek-Wiener, C., Kolovski, V., Hendler, J.: Effective NL Paraphrasing of Ontologies on the Semantic Web. In: Workshop on End-User Semantic Web Interaction, 4th Int. Semantic Web conference, Galway, Ireland (2005) [14] Leme, L.A.P.P., Brauner, D.F., Casanova, M.A., Breitman, K.: A Software Architecture for Automated Geographic Metadata Annotation Generation. In: Proc. XXII Simpósio Brasileiro De Banco De Dados, SBBD, João Pessoa, Brazil (2007) [15] Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proc. 4th Biennial Conf. on Innovative Data Systems Research (CIDR), Asilomar, California, USA (2009) [16] Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep-Web Crawl. In: Proc. VLDB, vol. 1(2), pp. 1241–1252 (2008) [17] MapServer, http://mapserver.org/about.html#about [18] Meng, W., Yu, C.T., Liu, K.L.: Building efficient and effective metasearch en-gines. ACM Computing. Survey 34(1), 48–89 (2002) [19] Praninskas, J.: Rapid review of English grammar. Prentice-Hall, NJ (1975) [20] Raghavan, S., Garcia-Molina, H.: Crawling the HiddenWeb. In: Proc. VLDB, pp. 129– 138 (2001) [21] Rajaraman, A.: Kosmix: HighPerformance Topic Exploration using the Deep Web. In: Proc. VLDB, Lyon, France (2009) [22] Piccinini, H., Lemos, M., Casanova, M.A., Furtado, A.L.: W-Ray: A Strategy to Publish Deep Web Geographic Data. Tech Rep. 10/10. Dept. Informatics, PUC-Rio (2010) [23] Sorrentino, S., Bergamaschi, S., Gawinecki, M., Po, L.: Schema Normalization for Improving Schema Matching. In: Laender, A.H.F. (ed.) ER 2009. LNCS, vol. 5829, pp. 280–293. Springer, Heidelberg (2009) [24] Zheng, Z.: AnswerBus question answering system. In: Proc. 2nd International Conference on Human Language, San Diego, California, pp. 399–404 (2002)
G-Map Semantic Mapping Approach to Improve Semantic Interoperability of Distributed Geospatial Web Services *
Mohamed Bakillah and Mir Abolfazl Mostafavi 1
Centre de recherche en géomatique (CRG), Université Laval, Québec, Canada, G1K 7P4
[email protected] Abstract. The geospatial domain is influenced by the Web developments; consequently, an increasing number of geospatial web services become available through Internet. A rich description of geospatial web services is required to resolve semantic heterogeneity and achieve semantic interoperability of geospatial web services. However, existing geospatial web services descriptions and semantic mapping approaches employed to reconcile them are not always rich enough, especially with respect to semantics of spatiotemporal features. This article proposes a new semantic mapping model, the G-MAP, which is based on a semantically augmented description of geospatial web services. G-MAP introduces the idea of semantic mappings between services that depends on context, and an augmented mapping technique based on dependencies between features of concepts describing geo-services. An implementation scenario demonstrates the validity of our approach. Keywords: Geospatial Web Service, Semantic Interoperability, Semantic Mapping, Knowledge Representation.
1 Introduction Geospatial Web Services (GWSs) are modular components of geospatial computing applications; they can be published, discovered and invoked to access and process distributed geospatial data coming from different sources. Previously, geospatial services were available only through GIS desktop application; today, more services are accessible on the Web and through distributed applications and networks [21]. The emergence of geospatial web services (GWSs) and service-oriented architecture (SOA) brought a new paradigm for businesses and organizations where it is now possible to combine different geospatial web services to create more complex services that are adapted to the user’s need. Interoperability is a key issue for the discovering and composition of GWSs, and for the development of the Geospatial Semantic Web [8]. According to ISO TC204, document N271, interoperability is “the ability of systems to provide services to and accept services from other systems and to use the services *
Corresponding author.
J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 12–22, 2010. © Springer-Verlag Berlin Heidelberg 2010
G-Map Semantic Mapping Approach to Improve Semantic Interoperability
13
so exchanged to enable them to operate effectively together.” The Open Geospatial Consortium (OGC) and ISO/TC 211 have created several standards to support interoperability of geospatial web services, such as the Web Service Modeling Language (WSDL) that supports the description of web services and standard operations that allow retrieving the description of the capabilities provided by a service. SOAP is a standard protocol for service binding. Those standards support interoperability at the syntax level. However, semantic heterogeneity affecting GWS is still an obstacle to semantic interoperability. Semantic heterogeneity is the difference in the intended meaning of concepts describing data and services [6]. Semantic interoperability allows organisations to share and re-use knowledge they have, internally and with other stakeholders [20]. Semantic heterogeneity occurs because services are developed by different organizations, for different purposes and using different terminologies [16]. To overcome the problem of service discovery and interoperability, OGC has proposed catalog services, where services are published and users can manually browse the catalog to find the service they look for, but this is a very tedious task. Recent approaches to service interoperability and discovering such as [17] represent the functional capabilities of GWS with ontologies, which are “explicit specifications of a conceptualisation”, according to Gruber’s definition [12]. Ontology is widely used for semantic interoperability of geographic information systems [10]. It is composed by concepts (or classes), relations, and axioms describing entities that are assumed to exist in a domain of interest [1]. Then, semantic mappings or semantic similarities between concepts of ontologies are used to reconcile different services or find services that match a given query. Examples of such approaches are [9][13][21][14][15][7]. To support semantic interoperability of GWS, the description of their capabilities should be as deep as possible. In addition, the semantics of spatial and temporal aspects of this description should be explicit. The semantic matching approach should be developed to reason with a deep description of GWS and produce different semantic relations between them. In this paper, we present a new approach for the semantic interoperability of GWS, which uses a new semantically augmented representation of GWS that integrates context, semantics of spatial and temporal aspects of the service’s description, and dependencies between elements of service’s description. Then, we propose the G-MAP semantic mapping system, which was specifically designed to compare the proposed service descriptions with inference engines, in an automatic manner. G-MAP includes a new augmented structural matching criterion that uses dependencies to find missing, implicit semantic mappings between GWS descriptions. The implementation scenario demonstrates that the approach supports semantic interoperability of GWS and helps the user to discover and select the more relevant GWS with respect to its requirements.
2 Related Work on Geospatial Web Services Semantic Interoperability The Semantic Web has been conceived as a huge data repository where people can search and access needed information [4]. With the emergence of web service technologies, it also became a repository of web functionalities. Examples of geospatial web services (GWSs) include catalog and geospatial repository services, locationbased services, data access and transformation services [2], as well as web map services [5]. Several approaches for the discovery, interoperability and composition of
14
M. Bakillah and M.A. Mostafavi
GWSs have been proposed. Typically, in order to make a GWS available on the Web, service providers publish relevant metadata about the capabilities of their service on a web server where requestors can discover registered services and bind to them to obtain their service [13]. With the development of Geospatial Semantic Web technologies, some approaches use formal languages that support reasoning, such as Description Logics [13][14][19]. In the work of Lutz and Klien [14] on retrieval of geographic information, subsumption-based reasoning is used. When the user submits a search concept, the system returns a taxonomy of concepts that are subsumed by (more specific than) the search concept. However, it does not return the concepts that are more general than or overlapping with the search concept. For example, if the search concept is “lake”, the retrieval system may not return the concept “waterbody” which is also relevant. Similarly, Wiegand and Garcia proposed a task-based and Semantic Web approach to retrieve geospatial data [22]. They formalize the relationships between tasks (ex: land use management), and types of data sources. A user can submit a query to the knowledge base where sources’ descriptions are stored in order to find sources that correspond to a selected task. A Jena reasoning engine retrieves the sources that are associated to the requested tasks. The reasoning engine returns only the sources that completely satisfy the query. Therefore, the problem is the same as with subsumption reasoning. Janowicz [13] suggest that a semantic similarity measure is preferable (or complementary) to subsumption reasoning since it can retrieve concepts that are close in meaning to the search concept, without rejecting those that may not meet the exact condition of subsumption. He proposes a semiautomatic similarity-based retrieval approach for GWS that uses the Web Service Modeling Language (WSML-Core). The semantic similarity indicates to what degree the retrieved GWS satisfy the user requirements. [23] present an ontology-driven discovering model for geographical information services, where a multilevel semantic similarity approach addresses the problem of how to select a similarity threshold above which the service is similar enough to the service request. While the recall of a semantic similarity measure is better than that of subsumption reasoning, it is not expressive enough to help the user to select the more relevant service. What is needed is a semantic mapping system that uses GWS descriptions with deep semantics and produce different kinds of semantic relations between them. The solution proposed in this paper is based on the G-MAP semantic mapping system that overcomes the mentioned limitations of existing approaches. This system uses a new representation of the GWS based on a multi-view augmented concept model.
3 Representation of Geospatial Web Services Descriptions Semantic interoperability of geospatial web services (GWS) is dependent on the richness of the semantic description of GWS. GWS are described with a function, input and output, pre-conditions and post-conditions [14]. The function is the role of the GWS: for example, compute Euclidian distance between two locations. The input is the data taken by the service (ex: two GML points) and the output is the result of the process performed by the service (ex: a distance). The pre-conditions and postconditions are conditions on the input and the output respectively, for example, the minimal spatial accuracy of the input GML points. The proposed representation of
G-Map Semantic Mapping Approach to Improve Semantic Interoperability
15
GWS descriptions is based on the Multi-View Augmented Concept (MVAC) model that we presented in [3]. This model was developed to improve existing concepts definitions, which can lack valuable features. The idea is to add two layers of semantics to the definition of a concept: a set of views valid in different contexts, and dependencies between features of the concept. The MVAC also includes spatial and temporal descriptors, which are new features that define the semantics of spatial and temporal properties of the concept. The MVAC is defined with the following features: cMVA = < n(c), {p(c)}, {r(c)}, {spatial_d(c)}, {temporal_d(c)}, {v(c)}, {dep(c)}>. n(c) is the name of the concept. {p(c)} is its set of properties. {r(c)} is the set of relations that c has with other concepts. {spatial_d(c)} is a set of spatial descriptors about the spatiality of the concept. Spatiality of a concept can be described as a part of a thing, for instance “center of, axis of, contour of, top of…” Spatial descriptors also include characteristics related to geometry: shape, area, length, etc. {temporal_d(c)} is a set of temporal descriptors about the temporality of the concept. The semantics of temporality is an occurrent: a process, event or change that produces in time. Temporal descriptors also include temporal characteristics, such as duration and frequency. {v(c)} is a set of views, and {dep(c)} is a set of dependencies. A view is a selection of features that are valid in a given context. Views are indicated with the following expression: context (context value) → feature (concept, [set of feature’s values]), which reads as: if the context is “context value”, then the value of “feature” is one of this [set of feature’s values]. For example, two possible views of the concept watercourse may be: context (flooding) → function (watercourse, evacuation area), and context (tourism) → function (watercourse, [navigable, skating]). The first view indicates that when the context is “flooding”, the possible values of the property “function” of concept “watercourse” are “navigable” or “skating”. Dependencies express that a first feature's values are related to a second feature's values. We formalize dependencies with rules, in the form: head → body, for example: Is-a (land, lowland) → FloodRisk (land, high), where Is-a (land, lowland) reads as “land is-a lowland”. We propose that some or all parameters of a GWS description (function, input, output, pre-conditions, and post-conditions) can be semantically described with a MVAC concept. For example, consider a GWS that finds flood risk zones inside a given geographical region, given in OWL abstract syntax: Class(input complete restriction(is-A someValuesFrom (GML: surface))) Class(pre-condition complete restriction(part-of someValuesFrom(NorthAmerica))) Class(function complete restriction(is-A someValuesFrom(LocalisationOfFloodRiskZone))) Class(output complete restriction(is-A someValuesFrom(GML: surface) restriction (hasContext someValuesFrom(floodDisasterResponse, floodPrevention))) Class(output_FloodPrevention_Context complete restriction(is-A someValuesFrom(GML: surface) restriction (CloseTo someValuesFrom(waterbody))) Class(output_ floodDisasterResponse_Context complete restriction(is-A someValuesFrom(GML: surface) restriction (AdjacentTo someValuesFrom(waterbody))) Class(post-condition complete restriction(hasSpatialAccuracy (5meters)))
16
M. Bakillah and M.A. Mostafavi
Class(floodedLand complete restriction(is-A someValuesFrom (GML: surface) restriction (depth hasSomeValuesFrom(high)) restriction (status hasSomeValuesFrom (navigable)))
The GWS description indicates that two contexts are possible: flood disaster response and flood prevention. Under those views, a flood risk zone is defined as a surface adjacent to a waterbody or a surface close to a waterbody, respectively, because the conception of the degree of risk in disaster response is different from its conception in disaster prevention. The last class indicates a dependency between the depth of water of a flooded land and its navigable status. The MVAC-based GWS descriptions are the input knowledge representation of the proposed semantic mapping system.
4 G-MAP Augmented Semantic Mapping System In this section, we present the G-MAP augmented semantic mapping system and its core components. Figure 1 illustrates the architecture of the G-MAP system.
GWS GWS Description Description GWS Description
Extract Dependencies
input MVAC Service Description Generation Tool
Final Output:
Basic Matching Augmented Mapping Inference Engine
External Resources
Produces
Uses MVAC GWSMVAC GWS Description Description MVAC GWS Description
Lexical-to-Semantic Transformation Query
Output Basic Elements Mappings
Complex Mapping Inference Engine
Translator
User Query Interface
Basic Element Lexical Matcher
Augmented Mapping Inference Engine
Multi-View Augmented Mappings
Semantic Inference Engine Spatial Semantic Mapping component
Fact Base populates
Temporal Semantic Mapping component Thematic Semantic Mapping component
Mapping Rules Base
Fig. 1. Architecture of the G-MAP Semantic Mapping System
G-MAP executes a gradual process that takes as input the MVAC Geospatial Web Service Description (MVAC GSW descriptions) and a query, matches elements of the MVAC GSW descriptions with elements of the query in a three main steps gradual process, and outputs the semantic relations between the query and the MVAC GSW descriptions. G-MAP is an automatic process since it uses reasoning rules that automatically infer the semantic relations. Prior to the G-MAP action, the MVAC Service Description Generation Tool is responsible for building MVAC GWS descriptions. This process is described in [3]. A query interface allows the service requestor to formulate a query, which is a template of a requested GSW description. The three steps of G-MAP, identified with grey boxes in Fig. 1, are described in the following paragraphs.
G-Map Semantic Mapping Approach to Improve Semantic Interoperability
17
4.1 Basic Matching This first component of G-Map computes the semantic mappings between the simplest elements of MVACs that describe the services’ parameters (input, output, function, pre-conditions, and post-conditions). Those simplest elements, referred to as basic MVAC elements, are the terms used to designate any MVAC features, including names of properties, relations, spatial and temporal descriptors, or their values. The process includes two main steps. At first, the Basic Element Lexical Matcher computes a lexical relation (synonymy, hyponymy, hypernymy, partonomy) for a pair of elements. This lexical relation is determined with the help of an appropriate external resource, for example, a global ontology holding standardized vocabulary about geometrical shapes, spatial relations of topology, etc., another global ontology of Time (temporal relations and attributes) or a domain-independent global ontology. In the second main step, this lexical relation is transformed into a semantic relation between basic MVAC elements: {equivalence, includes, included in, or disjoint}. Example of such transformation is provided in [18]. The Complex Mapping Inference Engine reuses the semantic relations between basic MVAC elements. 4.2 Complex Mapping Inference Engine The role of the Complex Mapping Inference Engine is to infer semantic relations between complex MVAC elements (properties, relations, descriptors, views, MVACs, and finally, GWS descriptions), based on the semantic relations between the basic MVAC elements that compose them. This inference problem is formulated into the problem of verifying a set of logical rules, which express the condition for a semantic relation between two complex MVAC elements to be true. A semantic mapping rule consists of a mapping rule antecedent and a mapping rule consequent. The consequent is a semantic relation between two complex MVAC elements, and the antecedent is a conjunction and/or disjunction of conditions on semantic relations between basic MVAC elements that must be respected for the consequent to be verified. For example, the condition for equivalence between two spatial properties x and y: p(x) ∧ p(y) ∧ name (x, np1) ∧ name (y, np2) ∧ range (x, rp1) ∧ range (y, rp2) ∧ spatial_descriptors (x, sd1) ∧ spatial_descriptors (y, sd2) ∧ equivalent (np1, np2) ∧ equivalent (rp1, rp2) ∧ equivalent (sd1, sd2) ⇒ equivalent (x, y)
We have created those rules that compose the Mapping Rule Base using logic set theory. The general principle is that two MVACs are overlapping if according to their definition, they can share a common set of instances. A concept’s feature (ex: a property) is seen as a very simple concept with only one property. Therefore, for example, two properties are overlapping if their names and their ranges (set of values) are not semantically disjoint. First, the semantic relations between basic MVAC elements are translated into statements that can be compared against the antecedents of mapping rules. Statements are stored in the Fact Base. The Mapping Inference Engine, which has spatial, thematic and temporal components responsible for matching the corresponding features, matches facts of the Fact base against rules in the Mapping Rule Base. If a rule is verified, the relation stated in the consequent is added in the Fact
18
M. Bakillah and M.A. Mostafavi
Base as a new statement. The Mapping Inference Engine verifies another rule until no rules remains in the Mapping Rule Base. Note that the mapping of spatial and temporal properties depends on the mapping of spatial and temporal descriptors. Therefore, spatial and temporal descriptors are mapped prior to properties. The contribution of the Complex Mapping Inference Engine is its ability to compare concepts which structure is more complex than the one used by existing semantic mapping approaches that produce semantic relations. For example, [18] and [11] consider only hierarchical relations between concepts (e.g. is-a), whereas we have developed mapping rules that takes as input any kind of relations, by comparing their names and ranges while preserving their structure. Also, G-MAP takes as input properties that are enriched with descriptors, a capacity that does not exist in previous systems. 4.3 Augmented Mapping Inference Engine The contribution of the Augmented Mapping Inference Engine is to exploit the dependencies to discover missing mappings between MVAC elements. For example, consider two properties depth of watercourse and water level. It is probable that no external resource, such as lexicon, can help to discover that they represent the same property. However, if we discover that they participate in similar dependencies, we could infer that they may represent the same property. The Augmented Mapping Inference Engine extracts the dependencies from the MVAC GWS descriptions. In parallel, the system extracts from the Fact Base the non-equivalent pairs of MVAC elements. We assume that the semantic relation between those elements can be false because implicit information (contained in dependencies) that was not considered can modify the result. Dependencies of different MVACs are matched, considering that the mismatching elements are non-disjoint. If, with this assumption, dependencies matches, then, the previously mismatching elements are presented to the user as a new match. For example, consider the following dependencies: d1: depth(floodedLand, high)→ status(floodedLand, navigable), d2: water level(floodplain, high)→ status(floodplain, navigable), with the following semantic relation: equivalent(floodedLand, floodplain). If we make the assumption equivalent
(depth, water level), we find that d1 and d2 are equivalent, and conclude that equivalent (depth, water level) was an implicit mapping. The final augmented mappings are displayed to the user, which can select the geospatial web service that matches best its query based on computed semantic relations.
5 Implementation of Our Approach To demonstrate the feasibility of our approach, we implemented it with Java and used OWL descriptions of GWS. We show a scenario where an expert–user responsible for flood management searches for flood risk zones in Canada. The expert formulates that the zones returned by the required service should have elevation of 4 meters or less to be considered as flood risk zones. The expert’s request is formulated as a GWS description, based on the vocabulary of its ontology, shown in OWL abstract syntax: Class(input complete restriction(is-A someValuesFrom (GML: surface))) Class(pre-condition complete restriction(part-of someValuesFrom(Canada)))
G-Map Semantic Mapping Approach to Improve Semantic Interoperability
19
Class(function complete restriction(is-A someValuesFrom(FindFloodRiskZone)) restriction (Before someValuesFrom(Storm)) Class(output complete restriction(is-A someValuesFrom(GML: surface) restriction (Elevation someValuesFrom( <xsd:complexType> <xsd:sequence> <xsd:element name = "VehRentalCore"> <xsd:complexType> <xsd:sequence> <xsd:element name = "PickUpLocation"> <xsd:complexType> <xsd:attribute name = "LocationCode" type = "xsd:string"/> <xsd:element name = "ReturnLocation"> <xsd:complexType> <xsd:attribute name = "LocationCode" type = "xsd:string"/> <xsd:attribute name = "PickUpDateTime" type = "xsd:string"/> <xsd:attribute name = "ReturnDateTime" type = "xsd:string"/> <xsd:element name = "VendorPrefs"> <xsd:complexType> <xsd:sequence> <xsd:element name = "VendorPref" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:attribute name = "CompanyShortName" type = "xsd:string"/> <xsd:attribute name = "Code" type = "xsd:string"/> <xsd:attribute name = "PreferLevel" type = "xsd:string"/> <xsd:attribute name = "Status" type = "xsd:string"/>
Fig. 2. Sample XSD file. An XSD file defines the structure of the XML files for a service.
258
M. Blaha
We further propose the use of a UML class model for representing enterprise data. The XSD notation is too verbose for such a representation. Also an XSD hierarchy is skewed towards a single service. In contrast, a UML class model can transcend individual services. The issue then becomes how to derive an XSD file for a service from an enterprise model. The most difficult aspect of mapping UML class models to XSD files is the handling of associations that reach across hierarchies. The key is the treatment of identity [4]. Associations can reach across hierarchies by referencing external identifiers. We distinguish external identifiers (unique combinations of real-world attributes) from internal identifiers (meaningless fields that are unique and used for internal links). Developers can recover ideas from existing XSDs and incorporate them into an enterprise data model via reverse engineering. In this case, the input is the XSDs (or XML data implying XSDs) and the output is an enterprise data model. Each XSD file gives a piecemeal glimpse of the underlying enterprise model. Since XSD files often lack a uniform abstraction basis, it can be difficult to merge XSD files. It is best to identify subject areas, integrate within the subject areas, and then integrate for the enterprise. Note that the enterprise data model defines concepts as the intellectual basis for services. Thus you cannot merely construct a literal data model of the requirements. It does not suffice to have a rote representation of the source use cases. Instead you must abstract requirements to reconcile inconsistencies and get at the deeper meaning. Such an abstract model is more profound, more stable, more extensible, and more valuable to a business. By necessity, the coupling between an enterprise model and XSDs will be loose, as SOA services will be of different vintages and will have different snapshots of the evolving enterprise model. Fig. 3 shows a simple enterprise data model. The appropriate XSD hierarchy depends on the service. Each service has a root and fleshes out lower levels by traversing the enterprise model. Thus the findOrders service has Order as level 1. Customer and ProductType are 1 traversal away from Orders and at level 2. Supplier is at level 3 under ProductType. The use of an enterprise data model is a necessary but not a sufficient technology. An enterprise data model makes it possible to integrate services, but does not cause XSDs to be integrated. Developers still must have personal discipline and use a robust development process. In contrast, with an XSD editor alone it is all but impossible to obtain the overall perspective that makes integration possible, regardless of the development process.
3 Specification vs. Implementation Another way to think about SOA is to regard an XSD file as part of the specification for a service. The XSD file is just a partial specification as functionality must also be documented. The actual service code is then the implementation. This distinction between specification and implementation is analogous to that with the Eiffel programming language [15]. Eiffel rigorously separates specification from implementation. An Eiffel contract is the specification that programming code implements. Eiffel presumes that the contract is more difficult to change than the code. Eiffel classes communicate only via the specification that invokes the code.
Data Modeling Is Important for SOA
Customer name address phoneNumber
1
*
Order orderNumber dateTime
* *
ProductType name code
* *
259
Supplier name address phoneNumber
UML enterprise model
Customer Order ProductType Supplier
Order Customer ProductType Supplier
ProductType Supplier Order Customer
Supplier ProductType Order Customer
FindCustomers service
FindOrders service
FindProducts service
FindSuppliers service
Fig. 3. Simple enterprise data model. An enterprise data model can coordinate services across an organization.
Since a service can reference other services, a change to an XSD specification is disruptive and to be avoided. In particular, “the stability of service interfaces is the key to SOA success” [6]. In contrast, service code is internal and not directly accessed. The code for a service can be changed as long as it is correct, executes efficiently, and is built professionally. Thus developers could substitute a faster algorithm or broaden the capabilities of a service without disrupting clients that invoke it. When we build software we often start with a data model of the critical concepts [5] [8]. In a similar manner, SOA development should give prominence to an enterprise data model and use it to help drive the vision for the SOA roadmap.
4 Example: Services for a Large Company A large company’s experience with services illustrates the drawbacks of disjointed XSD files. The company currently has about 100 XSD files. The services have been designed by different teams over a period of several years and so, not surprisingly, the XSD files are inconsistent and redundant. The chaos is getting worse as the number of services grows. Forecasts call for a 10 to 100 fold increase in services over the upcoming years. There are multiple flaws. • Redundancy. The XSD files have much redundancy. For example, there is an address XSD file. In addition the bank and person XSD files each have their own address data that differs from the address XSD file. • Element vs. attribute. There is no obvious reason why some fields are defined as XSD elements and others as XSD attributes. For example, address has a stateProvince element and identification has a stateProvince attribute. • Data types. Data types are inconsistent. For example, some dates are defined as strings; others are defined as dates.
260
M. Blaha
• Element/attribute multiplicity. The XSD files are haphazard with their use of required fields. • Element inclusion vs. element reference. The XSD files have no clear policy for embedding a local element vs. referencing a global element. There are also problems with the files collectively. It is difficult to find concepts and this is only going to worsen as the number of XSD files increases. Similarly, with so many XSD files it is unclear about where to place new concepts. One reason for this chaos is the lack of XSD enterprise modeling tools. Most XSD tools present a hierarchy for an individual service. Few tools can take a data model and generate XSDs. The tools are handicapped by a lack of agreement in the literature for how to map data models to XSDs. Also a tool must devise a user interface so that a developer can indicate how to traverse an enterprise model to generate a hierarchy. A further problem is the distributed ownership of data. Many organizations have weak central control because most of the IT budget is allocated to departments and individual projects. The benefits of centralized control are diffuse and the lack of control only gradually becomes apparent as information systems age. IT management often lacks the expertise and incentives to deal with such gradual, long-term problems. The use of an enterprise data model would not only reconcile the XSD files, but would also improve understanding and provide the underpinning for a more rigorous development practice.
5 Example: Open Travel Alliance Standard We prepared a data model for the car portion of the message users guide [16]. There are twenty-six use cases with XML data that cover scenarios for car services. The use cases structure the data into different hierarchies. The model is incomplete because the sample data lacks details such as whether data is required or optional. Nevertheless the data model did aid our understanding and is more concise than the XML data. The style of the car XML data was more uniform than the XSD files in Section 4 — this is not surprising given that the standard was created by the same team all at once. Fig. 4 and Fig. 5 show data models for customer and vehicle from the car data.
6 Example: Digital Weather Standard References [10] and [11] present an XSD specification for digital weather. The data is essentially just a collection of many numbers for various kinds of measurements — such as temperature, precipitation, wind speed, and cloud cover — as well as the date, time, and location. Intrinsically, there are few cross references across the hierarchies, much less than with the typical business information system. The XSD design protocol is mostly uniform, as would be expected with a standard. However, there is some variation in the choice of XSD element and XSD attribute with no obvious reason for the variation.
Data Modeling Is Important for SOA
Document
Telephone
* phoneTechType
number issueStateProvince issueCountry type birthDate expireDate
areaCityCode phoneNumber
*
* 1
AddedDriver startDate endDate relation
0..1 additional
1
1
1
corpDiscountName corpDiscountNmbr qualificationMethod
*
Email type emailAddress Address
Customer
1
261
0..1
type streetNumber cityName postalCode stateProvince
1
PersonName citizenCountry
* namePrefix[0..1]
0..1
givenName surname nameSuffix[0..1]
Country code name
1
* CustLoyaltyProgram name
1
CustLoyaltyAccount
* membershipNumber travelSector
Fig. 4. UML customer data model — from Open Travel Alliance
7 Example: ACORD Life, Annuity and Health Standard We skimmed this standard [1] [2]. The amount of explanation is overwhelming. The documentation is thorough and extensive — 3542 pages define 462 objects. The documentation would be better yet if accompanied by a data model highlighting the major concepts and relationships.
8 Example: GraphML Standard GraphML is a file interchange format for applications that use graphs [12] [13]. A core language describes graph structure and an extension mechanism handles application-specific data. Fig. 6 shows the data model for graph structure. The model reflects the structure defined in the XSD files as well as our understanding of graphs. A graph is a set of nodes and edges. A node is something that is of interest. An edge is a coupling between nodes. Nodes and edges can connect directly or they can connect via intermediate ports. A port is a defined position on a node for making a connection.
262
M. Blaha
VehicleIdentity Vehicle
VehicleClass size
0..1
1
code codeContext airConditioned fuelType transmissionType passengerQuantity baggageQuantity isConfirmable distanceUnit distancePerFuelUnit includeExclude returnVehicleIndicator description
1
*
1
0..1
1
0..1
vehicleAssetNumber licensePlateNumber stateProvCode countryCode vehicleID_number vehicleColor VehicleMakeModel code name modelYear VehicleType vehicleCategory doorCount
1
* VehicleRentalDetails
1
0..1
condition
parkingLocation 1 0..1
ConditionReport
1 0..1
FuelLevelDetails
OdometerReading
fuelLevelValue
quantity unitOfMeasure
Fig. 5. UML vehicle data model — from Open Travel Alliance
GraphML supports both directed and undirected graphs; the edges in a graph are directed or undirected by default as indicated by edgeDefault. An edge can override the graph default (via directed); thus a graph can have both directed and undirected edges. GraphML also supports hyperedges — a generalized edge that can connect more than two nodes. An endpoint is an end of a hyperedge. Ordinary edges have two endpoints — source and target. Hyperedges, by definition, have multiple endpoints. Hyperedges cannot directly connect to nodes and only connect via endpoints and ports. GraphML has XSD definitions for occurrences (Graph, Edge, Node, Hyperedge, Port, and Endpoint) as well as for types (GraphType, EdgeType, NodeType, HyperedgeType, PortType, and EndpointType). The type definitions give rise to the data structure in [Fig. 6]. The type definitions use XSD elements sparingly and favor XSD attributes. The occurrence definitions refer to the corresponding type in what is often called the “Garden of Eden” style [14]. The GraphML example demonstrates another benefit of data modeling. Multiple XSD design practices are used in practice, such as use of element vs. attribute, reference to a global element vs. embedding a local element, and definition of structure via
Data Modeling Is Important for SOA sourcePort
EdgeType
0..1
id directed
*
0..1
source
GraphType id edgeDefault
1
1
PortType
0..1
name
*
0..1
*
0..1
target
NodeType 0..1
0..1 targetPort
* * 0..1
263
* id
1 1
0..1
* HyperEdgeType id
1
*
EndpointType id type
Fig. 6. UML data model for graph structure — from GraphML
occurrences or types. A model helps with problem understanding by setting aside these arbitrary, confounding design differences and instead focusing on the intrinsic essence of a problem. It is much easier to understand the content and scope of GraphML with the UML model in Fig. 6 than with multiple XSD files.
9 Conclusion The current XSD practice is disappointing. SOA technology is being held back by the lack of rigor with XSD interchange files. Developers work diligently on the logic of individual XSDs but pay little attention to how the XSD files fit together and collectively evolve. The focus is on designing in the small (individual services) rather than designing in the large (collections of services). The current practice is in many ways the antithesis of software engineering. The premise of software engineering is to think deeply about an entire problem, and only then start writing code. Instead SOA developers are looking only at individual services and overlooking integration issues. SOA practice can be improved by basing XSD files on a data model of the enterprise. There are several benefits of such an approach. • Global understanding. It is difficult to understand a collection of services by studying each XSD file, one at a time. An enterprise data model gives a comprehensive overview. Each XSD file expresses a subset of the enterprise model. • Consistency. An enterprise model can align services and their data • Communication. A data model provides a concise explanation that is a helpful prelude to a more detailed study of XSD code.
264
M. Blaha
• Extensibility. A broad understanding of an enterprise helps developers determine where to add data and functionality for new services. • Expressiveness. The UML class model is a more natural representation for data than a hierarchy. Data modeling is only one of the technologies that is needed for SOA, but it is one that has been sorely lacking. Data modeling can yield profound insights that reduce the complexity and risks of SOA development. Data modeling can help ensure that services align with the needs of a business and that they scale as deployment ramps up. This paper explains the benefits of an enterprise data model. We have taken some early steps to apply an enterprise data model to services, but do not yet have experimental data to demonstrate an improvement. Such a demonstration would be an important topic of further research.
Acknowledgements We thank Paul Brown, Rod Sprattling, and Patti Lee for their helpful suggestions.
References 1. ACORD Home page, http://www.acord.org/Pages/default.aspx 2. ACORD XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/Accord/Life%20Standards/2.20.01/ 3. Atkinson, C., Bostan, P.: The Role of Congregation in Service-Oriented Development. In: PESOS 2009, Vancouver, Canada, May 18-19, pp. 87–90 (2009) 4. Blaha, M.: Patterns of Data Modeling. CRC Press, New York (2010) 5. Blaha, M., Rumbaugh, J.: Object-Oriented Modeling and Design with UML, 2nd edn. Prentice Hall, Upper Saddle River (2005) 6. Brown, P.C.: Implementing SOA. Addison-Wesley, New York (2008) 7. Brown, P.: Personal communication 8. Carey, M.J.: SOA What? IEEE Computer 41(3), 92–94 (2008) 9. Carey, M., Reveliotis, P., Thatte, S., Westmann, T.: Data Service Modeling in the AquaLogic Data Services Platform. IEEE Congress On Services (2008) 10. Digital Weather Home Page, http://www.nws.noaa.gov/ndfd/ 11. Digital Weather XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/DWML/0/ 12. GraphML Home Page, http://graphml.graphdrawing.org/index.html 13. GraphML XSD schema, http://schemas.liquid-technologies.com/ LibraryDocs/GraphML/1.0/ 14. Lammel, R., Kitsis, S., Remy, D.: Analysis of XML Schema Usage. In: XML 2005 Conference (2005) 15. Meyer, B.: Applying Design by Contract. IEEE Computer 25(10), 40–51 (1992) 16. OpenTravel TM Alliance Message Users Guide (June 2009), http://www.opentravel.org/Specifications/Default.aspx 17. Tolk, A., Diallo, S.Y.: Model-Based Data Engineering for Web Services. IEEE Internet Computing, 65–70 ( July/August 2005)
Representing Collectives and Their Members in UML Conceptual Models: An Ontological Analysis Giancarlo Guizzardi Ontology and Conceptual Modeling Research Group (NEMO), Federal University of Espírito Santo (UFES), Vitória (ES), Brazil
[email protected] Abstract. In a series of publications, we have employed ontological theories and principles to evaluate and improve the quality of conceptual modeling grammars and models. In this article, we continue this work by conducting an ontological analysis to investigate the proper representation of types whose instances are collectives, as well as the representation of a specific part-whole relation involving them, namely, the member-collective relation. As a result, we provide an ontological interpretation for these notions, as well as modeling guidelines for their sound representation in conceptual modeling. Keywords: representation of collectives and their members, ontological foundations for conceptual modeling, part-whole relations.
1 Introduction In recent years, there has been a growing interest in the application of Foundational Ontologies, i.e., formal ontological theories in the philosophical sense, for providing real-world semantics for conceptual modeling languages, and theoretically sound foundations and methodological guidelines for evaluating and improving the individual models produced using these languages. In a series of publications, we have successfully applied ontological theories and principles to analyze a number of fundamental conceptual modeling constructs ranging from Roles, Types and Taxonomic Structures, Relations, Attributes, Weak Entities and Datatypes, among others (e.g., [1-3]). In this article we continue this work by investigating a specific aspect of the representation of part-whole relations. In particular, we focus on the ontological analysis of collectives and of a specific partwhole relation involving them, namely, the member-collective relation. Parthood is a relation of fundamental importance in a number of disciplines including cognitive science [4-6], linguistics [7-8], philosophical ontology [9-11] and conceptual modeling [1-3]. In ontology, a number of different theoretical systems have been proposed over time aiming to capture the formal semantics of parthood (the so-called mereological relations) [9,10]. In conceptual modeling, a number of socalled secondary properties have been used to further qualify these relations. These include distinctions which reflect different relations of ontological dependence, such as the distinction between essential and mandatory parthood [1,2]. Finally, in J. Trujillo et al. (Eds.): ER 2010 Workshops, LNCS 6413, pp. 265–274, 2010. © Springer-Verlag Berlin Heidelberg 2010
266
G. Guizzardi
linguistic and cognitive science, there is a remarkable trend towards the definition of a typology of part-whole relations (the so-called meronymic relations) depending on the different types of entities they relate [7]. In general, these classifications include the following three types of relations: (i) subquantity-quantity (e.g., alcohol-wine, milkmilk shake): modeling parts of an amount of matter; (ii) component-functional complex (e.g., mitral valve-heart, engine-car): modeling aggregates of components, each of which contribute to the functionality of the whole; (iii) member-collectives (e.g., tree-forest, lion-pack, card-deck of cards, brick-pile of bricks). This paper should then be seen as a companion to the publications in [2] and [3]. In the latter, we managed to precisely map the part-whole relation for quantities (the subquantity-quantity relation) to a particular mereological system. Moreover, in that paper, we managed to demonstrate which are the secondary properties implied by this relation. In a complementary manner, in [2], we exposed the limitations of classical mereology to model the part-whole relations between functional complexes (the component – functional complex relation). Additionally, we also managed to further qualify this relation in terms of the aforementioned secondary properties. The objective of this paper is to follow the same program for the case of the membercollective relation. The remainder of this article is organized as follows. Section 2 reviews the theories put forth by classical mereology and discusses their limitations as theories of conceptual parthood. These limitations include the need for a theory of (integral) wholes to be considered in additional to a theory of parts. In section 3, we discuss collectives as integral wholes and present some modeling consequences of the view defended there. Moreover, we elaborate on some ontological properties of collectives that differentiate them not only from their sibling categories (quantities and functional complexes), but also from sets (in a set-theoretical sense). The latter aspect is of relevance since collectives as well as the member-collective relation are frequently taken to be identical to sets and the set membership relation, respectively. In section 4, we promote an ontological analysis of the member-collective relation, clarifying on how this relation stand w.r.t. to basic mereological properties (e.g., transitivity, weak supplementation, extensionality) as well as regarding the modal secondary property of essential parthood. As an additional result connected to this analysis, we outline a number of metamodeling constraints that can be used for the implementation of a UML modeling profile for representing collectives and their members in conceptual modeling. Section 5 presents some final considerations.
2 A Review of Formal Part-Whole Theories 2.1 Mereological Theories In practically all philosophical theories of parts, the relation of (proper) parthood (symbolized as PrimEvolOperation evolType: EvolType
spec Constraint
> EvolType addClass addProperty addPropertyWithValue
> CompType parallel sequential
Fig. 1. A Profile for Evolution
introduce the features we need to record model edits and relationships between attributes of the two different models. The solid-headed arrow represents the extension of a UML concept or metaclass through a stereotype. In this profile, a Model is simply a collection of model elements, and is itself a named element in an EvolutionModel. Each evolution model is associated with two models, src and tgt, representing the system model before and after the evolutionary step. It is associated also with a single evolution operation, most likely a composite operation of class CompEvolOperation. Each component of this operation relates two models, as source and target; these are two points in a chain of models, leading from the src to the tgt of the overall evolution. A range of primitive operations may be defined, each with a relationship specification, presented as an OCL constraint. It is a simple matter to provide a complete set of primitives, simply by providing operations that add and delete each kind of model element, as well as operations that modify their features. These operations will have a range of parameters, depending upon the kind of element involved. They include, for example addClass(name: String) and modifyProperty(class: Class; name: Name; newName: Name; newUpper: UnlimitedNatural; newLower: Integer)
288
M. Aboulsamh et al.
We can add significant value through the identification of specific evolution patterns and their formalisation as additional, primitive operations. The literature on schema evolution is a rich source of candidates: see for example [2,7,9]. Those that are frequently applied include: the introduction of an association class; the in-lining of a class; and the repositioning of a feature within an inheritance hierarchy. Other useful patterns correspond to compound operations in a language editor, and may be automatically derived from the editor model. For example, we might have a renameProperty term, with just three parameters, for changing the name of a property but leaving all other properties with their current values. We might also have an inlineClass term, corresponding to a combination of evolutionary steps in which a class is deleted and its properties are added to an associated class. Operations that add or modify elements may include expressions that specify new values for properties. For example, a property of a class may be modified by the following primitive operation: modifyPropertyWithValue(class: Class; name: Name; newName: Name; newUpper: UnlimitedNatural; newLower: Integer; newValue: OCLExpression) Here, the OCL expression explains the value of that property in terms of the values of properties in the original model; this does not require the @pre construct of OCL, as the expression will be evaluated in the original context. Evolution operations may be composite or primitive, with the methods of composition described by the enumeration CompType. Two methods are mentioned here: sequential and parallel; the latter being useful if, for example, we wished to exchange the roles of two properties, or if we wished to delete an association together with its properties. Using the concrete textual syntax of ; and || for these operators, we can define operations such as inlineClass in terms of their component actions on model elements: inlineClass(Source,Target,property) = ( forall p : Source.properties . addPropertyWithValue(Target,p,Source.p.type, Source.p.upper, Source.p.lower, Source.p) ) ; deleteClass(Source) where forall is additional method in our language, implemented as an iterator over the declared set. If the expressions supplied are computable, in the context of our platformspecific implementations, then the resulting language of transformations contains all of the information that we need to migrate the data against the new model. However, models formulated and maintained for the purpose of modeldriven development will inevitably be subject to a range of implicit and explicit constraints. Although many proposed evolutions will involve restructuring and extension of models, many more will involve the introduction of additional constraints, or the modification of constraints already specified.
Model-Driven Data Migration
289
The simplest, and most common, constraint evolution involves a change to the multiplicity of some property or association: for example, we might decide that a one-to-many association needs instead to be many-to-many, or vice versa. In UML, this will correspond to a change to the upper value associated with one of the properties. More complex constraints may be specified as class or model invariants, describing arbitrary constraints upon the relationships between values of properties across the model. If the conjunction of constraints after the evolutionary step is logically weaker than before, and the model is otherwise unchanged, then there is no doubt as to the feasibility of the corresponding data migration. Whatever data the system currently holds, if it is consistent with the original model, then it should also be consistent with the new model. However, where the conjunction of constraints afterwards is stronger than before, or the evolutionary step involves other changes to structures and values, then data may not fit: that is, the data migration corresponding to the proposed evolution might produce a collection of values that does not properly conform to the new model. It is thus not enough to produce a specification in our language of changes, suitable for automatic translation into a platform-specific implementation. We would wish also to determine, in advance, whether or not this program will succeed in migrating data collected against the old model into a system conforming to the new model: this may be difficult to determine at either the specification or the implementation level, and simply performing the migration and then testing to see whether it has succeeded may not be an acceptable strategy.
3
Example
As an example of how we may apply this approach, we will consider a model of a simple student management system. The class diagram of Fig. 2 describes the information held by the system, including the subject of each course, the name of each student, and the address and phone number of the contact record associated with a student. Each student object is associated with a single contact record and a set of courses, for which they are currently registered. The association between students and courses is bi-directional, and the diagram includes the constraint that the information content in each direction must be consistent: for any course, every student s in the set students must have a reference to that course (self) in the s.registeredFor association; for any student, every course c in the set registeredFor must include a reference to that student (self) in the c.students association. Other constraints of the model, not shown in the diagram, might describe properties that, although not essential to the consistency of the representation, may be important in terms of the external meaning of the data. For example, the following constraint would require that no student should be registered for a course that is due to run before the official start date of their studies: context Student registeredFor -> forall (c | c.date > startDate)
290
M. Aboulsamh et al.
Student name dateOfBirth startDate
Course students *
registeredFor
subject date
* register(s:Student)
contact
1
Contact address phone
context Student inv registeredFor -> forall(c | c.students -> includes(self)) context Course inv students -> forall(s | s.registeredFor -> includes(self))
Fig. 2. A simple student management system
while the next would require that no student should be registered for more than one course in the same subject: context Student registeredFor -> forall (c, d | c d implies c.subject d.subject) If constraints such as these are broken, the data may describe a situation which is undesirable, or even impossible, given our intended use and interpretation of the data. For example, if the constraint startDate > dateOfBirth did not hold, then startDate clearly does not correspond to the date of some formal registration or induction ceremony; either that, or the system contains some incorrect data. It is thus important that these constraints are taken into account in any data migration. A simple evolution of this model might involve the addition of a property closingDate to the Course class in our example model, with the intention that this should represent the date by which all registrations should be completed. We may specify an initial, default value for this property for all existing courses using the following evolution operation: addPropertyWithValue(Course,closingDate,Date,1,1,date - 1 week) This specifies that the closingDate should be one week in advance of the course. We may combine this operation with others to perform a more complex evolution. For example, inlineClass(Contact,Student,contact) ; ( addPropertyWithValue(Course,closingDate, Date,1,1,date - 1 week) || renameProperty(Student,date,startDate) ) ; addAssociationClass(Registration,Student, registeredFor,Course,students) ;
Model-Driven Data Migration
Student name dateOfBirth startDate contactAddress contactPhone
291
Course students *
registeredFor subject * closingDate startDate
Registration registrationDate status
register(s:Student) confirm(s:Student) cancel(s:Student)
Fig. 3. A simple student management system, evolved
addPropertyWithValue(Registration,registrationDate,Date, 1,1,students.startDate) ; addPropertyWithValue(Registration,status, RegistrationStatus,1,1,confirmed) ; addOperation(Course,confirm,s:Student) ; addOperation(Course,cancel,s:Student) describes an evolution of Fig. 2 into the model shown in Fig. 3, where the operations inlineClass and addAssociationClass have the obvious interpretations, and the Date and RegistrationStatus parameters to addProperty represents the intended types of the properties being added. We might decide that the link between Course.date and Student.startDate is inappropriate, and that we should instead insist that no course registrations are made before a student has been admitted to the programme. At the same time, we do wish to insist that all course registrations are made before the closingDate of the course in question. We may achieve this effect by adding the following constraints to our model context Course students -> forall (r | r.registrationDate forall (r | r.registrationDate >= startDate) in the same evolutionary step that introduces registrationDate. This represents a perfectly reasonable evolution of the model, but it may be that there are Student–Course pairs in the existing data that cannot be successfully migrated. Any student record including a course registration within a week of admission will be mapped to a combination of Student, Registration, and Course that will not satisfy the new constraints: the specified registrationDate will be less than a week before the closingDate. Using SQL as the language of our platform-specific implementation, with a standard object-to-relational mapping, these evolution operations can be translated automatically to produce the following procedures:
292
M. Aboulsamh et al.
ALTER TABLE student ADD address VARCHAR (150) DEFAULT ’’ NOT NULL ALTER TABLE student ADD phone VARCHAR (25) DEFAULT ’’ NOT NULL UPDATE student AS TT SET address=( SELECT ST.address FROM contact AS ST, ... WHERE TT.pk=(SELECT AT.student_contactfk1 FROM student_contact_contact_student AS AT WHERE TT.pk=AT.student_contactfk1) ... DROP TABLE contact ALTER TABLE course ADD closingdate DATE NULL UPDATE course SET closingdate=DATE_ADD(startdate,INTERVAL ’-1’ WEEK) ALTER TABLE course CHANGE date startdate DATE NULL ALTER TABLE student_registeredfor_course_students RENAME TO registration ... UPDATE registration SET status=’confirmed’ The first block, ending in DROP TABLE contact, corresponds to the evolution step inlineClass(Contact,Student,contact). It creates the two new attributes in the student table, copies their values from the course table, and then deletes the course table. The remaining SQL corresponds to the addition and removal of properties, and the creation of an association class. The necessary and sufficient condition for this migration to succeed is given by the constraint identified above—that no student record includes a course registration within a week of admission. This may be implemented automatically as an SQL query: SELECT COUNT(*) FROM student AS ST, course AS TT, student_registeredfor_course_students AS RT WHERE ST.pk=RT.student_contactfk1 AND TT.pk=RT.student_contactfk2 AND ST.startdate