The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Open Access Publications
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
editors
Distributed Agent-Based Retrieval Tools
Proceedings of the 1st International Workshop
Polimetrica
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Alessandro Soro, Giuliano Armano, Gavino Paddeu
International Scientific Publisher
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
2006 Polimetrica ® S.a.s. Corso Milano, 26 20052 Monza – Milano Phone ++39. 039.2301829 Web site: www.polimetrica.com Cover project by Arch. Diego Recalcati ISBN 88-7699-043-7 Printed Edition ISBN 88-7699-052-6 Electronic Edition The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Table of Contents
Preface
7
Semantic Search Engines Based on Data Integration Systems D. Beneventano, S. Bergamaschi
11
A Generalized Yule Stochastic Process for the Modelling of the Web Graph Growth 39 G. Concas, M. Locci, M. Marchesi, S. Pinna, I. Turnu Enhancing JSR-179 for Positioning System Integration and Management P. Bellavista, A. Corradi, C. Giannelli A MultiAgent System for Personalized Press Reviews A. Addis, G. Cherchi, A. Manconi, E. Vargiu On the Design of a Transcoding Component for a Dynamic Adaptation of Multimedia Resources E. Lodolo, F. Marcolini, S. Mirri
49
67
87
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with Contextual Service Provisioning for Mobile Internet Users 103 V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
5
Table of Contents
6
165
RAIS - An Information Sharing System for Peer-to-peer Communities 149 M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Extracting Dependency Relations for Opinion Mining G. Attardi, M. Simi
123 On the Value of Query Logs for Modern Information Retrieval D. Laforenza, C. Lucchese, S. Orlando, R. Perego, D. Puppin, F. Silvestri
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Preface
Proceedings of the DART 2006 Workshop on “Distributed Agent-based Retrieval Tools: The Future of Search Engines’ Technologies”
Facing an impressive technology challenge Search engines are still on the leading edge of Internet technologies. The current arena in search engine applications is dominated by large corporations that provide search services based on traditional crawling / indexing techniques, through powerful clusters of servers. The main concern of such corporations are multimedia and personalization, whereas the research in this field is looking ahead; in particular to the distribution of information and computational resources, which can be obtained by adopting suitable technologies. In fact, many researchers are concentrating their effort in the task of supplanting the centralized model of the web by giving priority to client-based, instead of server-based, applications. In our opinion, this alternative approach may take advantage of P2P information sharing and retrieval. In fact, nowadays, several P2P applications exist, which are responsible for almost half of the Internet bit transfer (the current estimate is thousands of terabytes). Furthermore, recent progresses in the fields of Artificial Intelligence and Software Agents should be helpful in the task of devising novel technological platforms able to ensure appropriate support to research in the area of information retrieval. We do believe that a new generation of infrastructures, systems, and algorithms can be developed by resorting to P2P and agent-based 7
8
Preface
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
solutions, so that Internet users will be provided with a common set of tools able to take advantage of data and task distribution in the task of retrieving and sharing relevant information. While information is increasing at an impressive rate, it is still (typically) made available to users through web interfaces –which are actually mainly suitable to human beings rather than to software applications. Users are accustomed to connect to many sites characterized by heterogeneous interfaces, as well as to switch among them in order to submit and to retrieve preliminary and intermediate data. An integrated access to this huge amount of information requires complex searching and retrieval software. In particular, data / process integration is concerned with how to link data, how to select and extract information, and how to pipe retrieval and analysis steps. The need for moving from an interactive to an automated approach, with the aim of managing information, requires new automation technologies, tools, and applications. The DART workshop is aimed at highlighting the cutting edge of search engines’ technology and applications, fostering information exchange between researchers working on various fields, such as distributed and pervasive systems, Semantic Web and Web services, mobile and personalized services and applications, and compare their efforts against users’ and industry’s needs and expectations as well as at increasing mutual awareness and cooperation between ongoing projects on advanced search systems around the world. We do hope that the DART 2006 workshop will allow researchers to actively participate in the discussion, to achieve a better understanding of the possibilities offered by the current technologies, and to reuse the most relevant concepts and information in the daily activity of each participant. On occasion of the workshop, CRS4, Tiscali and the University of Cagliari will present the project DART: a distributed architecture for semantic search and personalized content distribution, funded by the Italian Ministry of Research.
A joint effort for a high quality volume This volume is the result of a joint effort of the Scientific Committee members, speakers, participants and sponsors. We would like to A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Preface
9
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
thank all authors for their active commitment and all members of the scientific committee for their help in keeping high the quality of papers. Let us also thank all invited speakers, for their distinguished keynotes: • Ricardo Baeza-Yates - Yahoo! Research - Director of Yahoo! Research at Barcelona, Spain, and Yahoo! Research Latin America at Santiago, Chile - Extraction of Semantic Information from Query Logs • Douwe Osinga - Google Inc. - Softwarer Engineer for Google’s European Research Center, Zurich - Google’s Vision for Europe • Antonio Savona - Ask.com - Director of Advanced Search Products - Ask.com and its R&D Center in Italy for Europe • Pieter van der Linden - Thompson - Technical director of the Quaero program - Quaero ... Ergo Sum - Overview of a European program on Multimedia Search and Navigation The speakers are also expected to give a significant contribution to the success of DART 2006, for their engagement and ability in spreading relevant information related to the topics of the workshop. Our sincere thanks go to our sponsors for their financial support, including the POLARIS science park of Sardinia.
June 2006
Giuliano Armano, DIEE, Univ. of Cagliari Alessandro Soro, CRS4, Polaris Gavino Paddeu, CRS4, Polaris
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Semantic Search Engines Based on Data Integration Systems
Domenico Beneventano, Sonia Bergamaschi Dipartimento di Ingegneria dell’Informazione, Universit` a degli Studi di Modena e Reggio Emilia Via Vignolese 905, 41100 Modena, Italy {lastname.firstname}@unimore.it Abstract As the use of the World Wide Web has become increasingly widespread, the business of commercial search engines has become a vital and lucrative part of the Web. Search engines are common place tools for virtually every user of the Internet; and companies, such as Google and Yahoo!, have become household names. Semantic Search Engines try to augment and improve traditional Web Search Engines by using not just words, but concepts and logical relationships. In this paper a relevant class of Semantic Search Engines, based on a peer-to-peer, data integration mediator-based architecture is described. The architectural and functional features are presented with respect to two projects, SEWASIE and WISDOM, involving the authors. The methodology to create a two level ontology and query processing in the SEWASIE project are described1 . 1 This research has been partially funded by the UE-IST SEWASIE project and italian MIUR PRIN WISDOM project. An extended version of this paper will appear in: D. Beneventano, S. Bergamaschi: “Semantic Search Engines based on Data Integration Systems”, In Semantic Web: Theory, Tools and Applications (Ed. Jorge Cardoso), Idea Group Publishing.
11
12
1
D. Beneventano, S. Bergamaschi
Introduction
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Commercial search engines are mainly based upon human-directed search. The human directed search engine technology utilizes a database of keyword, concepts, and references. The keyword searches are used to rank pages, but this simplistic method often leads to voluminous irrelevant and spurious results. Google with its 400 million hits per day, and over 4 billion indexed Web pages, is undeniably the most popular commercial search engine used today, but even with Google, there are problems. For example, how can you find just the right bit of data that you need out of the ocean of irrelevant results provided? A well-know problem is that traditional Web search engines use keywords that are subject to the two well-known linguistic phenomena that strongly degrade a query’s precision and recall: • Polysemy (one word might have several meanings) and • Synonymy (several words or phrases, might designate the same concept). Precision and recall are classical information retrieval evaluation metrics. Precision is the fraction of a search output that is relevant for a particular query, i.e., is the ratio of the number of relevant Web pages retrieved to the total number of irrelevant and relevant Web pages retrieved. The recall is the ability system to obtain all or most of the relevant pages, i.e., is the ratio of the number of relevant Web pages retrieved to the total number of relevant pages in the Web. As Artificial Intelligence (AI) technologies become more powerful, it is reasonable to ask for better search capabilities which can truly respond to detailed requests. This is the intent of semantic-based search engines and agents. A semantic search engine seeks to find documents that have similar concepts not just similar words. In order for the Web to become a semantic network, it must provide more meaningful meta-data about its content, through the use of Resource Description Framework (RDF) (www.w3.org/RDF/) and Web Ontology Language (OWL) (www.w3.org/2004/OWL/) tags which will help to form the Web into a semantic network. In a semantic network, the meaning of content is better represented and logical connections are formed between related information. Semantic search methods augment and improve traditional search results by using not just words, but concepts and logical relationA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
13
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
ships[16][4]. Several systems have been built based on the idea of annotating Web pages with Resource Description Framework (RDF) and Web Ontology Language (OWL) tags to represent semantics (see Related Work). However, the limitation of these systems is that they can only process Web pages that are already manually annotated with semantic tags and it seems unfeasible to annotate the enormous amount of Web pages. Furthermore, most semantic-based search engines suffer performance problems because of the scale of the very large semantic network. In order for the semantic search to be effective in finding responsive results, the network must contain a great deal of relevant information. At the same time, a large network creates difficulties in processing the many possible paths to a relevant solution. The requirements for an intelligent search engine are given by a special class of users, small and medium-sized enterprises (SMEs) which are threatened by globalization. One of the keys to sustainability and success is being able to access information. This could be a cheaper supplier, an innovative working method, a new market, potential clients, partners, sponsors, and so on. Current Internet search tools are inadequate because even if they are not difficult to use, the search results are often of little use with their pages and pages of hits. Suppose an SME needs to find out about a topic -a product, a supplier, a fashion trend, a standard, etc. For example, a search is made for fabric dyeing processes for the purpose of finding out about the disposal of the dyeing waste material. A query to www.google.com for fabric dyeing listed 44.600 hits at the time of writing, which related not only manufacturers of fabric dyeing equipment, but also the history of dyeing, the dyeing technology, and so on. Eventually, a useful contact may be found, and the search can continue for relevant laws and standards concerning waste disposal. But is it law or the interpretation of the law? What if the laws are of a different country where the practices and terminologies are different? Thus, intelligent tools to support the business of SMEs in the Internet age are necessary. We believe that data integration systems, domain ontologies and peer-to-peer architectures are good ingredients for developing Semantic Search Engines with good performance. In the following, we will provide empirical evidence for our hypothesis. More precisely, we will describe two projects, SEWASIE and WISDOM, which rely on these architectural features and developed key semantic search functionalities. They both exploit the MOMIS (Mediator EnvirOnment for Multiple A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
14
D. Beneventano, S. Bergamaschi
Information Sources) data integration system[13][9]. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
1.1
Data Integration Systems
Data integration is the problem of combining data residing at different autonomous sources, and providing the user with a unified view of these data. The problem of designing Data Integration Systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view [39]. Integration System are usually characterized by a classical wrapper/mediator architecture [54] based on a Global Virtual Schema (Global Virtual View - GVV ) and a set of data sources. The data sources contain the real data, while the GVV provides a reconciled, integrated, and virtual view of the underlying sources. Modeling the mappings among sources and the GVV is a crucial aspect. Two basic approaches for specifying the mapping in a Data Integration System have been proposed in the literature: Local-As-View (LAV ), and Global-As-View (GAV ), respectively [35][51]. The LAV approach is based on the idea that the content of each source should be characterized in terms of a view over the GVV. This idea is effective only whenever the data integration system is based on a GVV that is stable and well-established in the organization; this constitutes the main limitation of the LAV approach; another negative issue is the complexity of query processing which needs reasoning techniques. On the other hand, as a positive aspect, the LAV approach favours the extensibility of the system: adding a new source simply means enriching the mapping with a new assertion, without other changes. The GAV approach is based on the idea that the content of each class of the GVV should be characterized in terms of a view over the sources. GAV favours the system in carrying out query processing, because it tells the system how to use the sources to retrieve data (unfolding). However, extending the system with a new source is now more difficult: the new source may indeed have an impact on the definition of various classes of the GVV, whose associated views need to be redefined. MOMIS is a Data Integration System which performs information extraction and integration from both structured and semi-structured data sources. An object-oriented language, with an underlying Description Logic, called ODLI3 [13], is introduced for information extraction. Information integration is then performed in a semi-automatic way, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
15
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
by exploiting the knowledge in a Common Thesaurus (defined by the framework) and ODLI3 descriptions of source schemas with a combination of clustering techniques and Description Logics. This integration process gives rise to a virtual integrated view of the underlying sources for which mapping rules and integrity constraints are specified to handle heterogeneity. Given a set of data sources related to a domain it is thus possible to synthesize a basic domain ontology (the GVV ). MOMIS is based on a conventional wrapper/mediator architecture, and provides methods and open tools for data management in Internet-based information systems. MOMIS follows a GAV approach where the GVV and the mappings among the local sources and the GVV are defined in a semi-automatic way. We faced the problem of extending the GVV after the insertion of a new source in a semi-automatic way. In [9], we proposed a method to extend a GVV, avoiding starting from scratch the integration process.
1.2
Schema Based Peer-to-Peer Networks
A new class of P2P networks, so called schema based P2P networks have emerged recently (see [1][36][14][43]), combining approaches from P2P as well as from the data integration and semantic Web research areas. Such networks build upon peers that use metadata (ontologies) to describe their contents and semantic mappings among concepts of different peers’ ontologies. In particular, in Peer Data Management Systems (PDMS) [36] each node can be a data source, a mediator system, or both; a mediator node performs the semantic integration of a set of information sources to derive a global schema of the acquired information. As stated in a recent survey [6], the topic of semantic grouping and organization of content and information within P2P networks has attracted considerable research attention lately (see, for example, [50][21]). In super-peer networks [55](Yang, and Garcia-Molina, 2003), metadata for a small group of peers is centralized onto a single superpeer; a super-peer is a node that acts as a centralized server to a subset of clients. Clients submit queries to their super-peer and receive results from it; moreover, super-peers are also connected to each other, routing messages over this overlay network, and submitting and answering queries on behalf of their clients and themselves. The semantic overlay clustering approach, based on partially-centralized (super-peer) networks [42] aims at creating logical layers above the physical network A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
16
D. Beneventano, S. Bergamaschi
topology, by matching semantic information provided by peers to clusters of nodes based on super-peers. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
1.3
The SEWASIE and WISDOM projects
SEWASIE and WISDOM are Semantic Search Engines which follow a super-peer mediator-based approach and try to face and solve the problems of: Ontology creation, semi-automatic annotation and efficient query processing. The first, SEWASIE (www.sewasie.org), rely on a two-level ontology architecture: the low level, called the peer level contains a mediatorbased data integration system; the second one, called super-peer level integrates peers with semantically related content (i.e. related to the same domain). A novel approach for defining the ontology of the superpeer and querying the peer network is introduced. The search engine has been fully implemented in the SEWASIE project prototype exploiting agent technology, i.e. the individual components of the system are implemented as agents [34]. The second, WISDOM (www.dbgroup.unimo.it/wisdom/), is based on an overlay network of semantic peers: each contains a mediatorbased data integration system. The cardinal idea of the project is to develop a framework that supports a flexible yet efficient integration of the semantic content. Key feature is a distributed architecture based on (i) the P2P paradigm and (ii) the adoption of domain ontologies. By means of these ingredients the integration of information sources is separated in two levels: at the lower level a strong integration, which involves the information content of a bunch of sources to form a semantic peer; an ontology describes the (integrated) information offer of a semantic peer. At the upper level, a loose integration among the information offered by a set of semantic peers is provided; namely a network of peers is built by means of semantic mappings among the ontologies of a set of semantic peer. When a query is posed against one given semantic peer, it is suitably propagated towards other peers among the network of mappings. The rest of the paper is organized as follows. In section “The SEWASIE project” the architecture of a semantic search engines, SEWASIE, is described. In section “Building the SEWASIE system ontology”, a two-level data integration system and the methodology to build a two-level ontology are described. In section “Querying the SEWASIE system”, we briefly introduce the query tool interface and the query A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
17
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
reformulation techniques for the two-level data integration system. In section “The Wisdom project” a semantic search engine which represents an architectural evolution of SEWASIE is described. A section on related work briefly synthesizes other research efforts. Conclusions and an Appendix on the ontology language of MOMIS, ODLI3 , conclude the paper.
2
The SEWASIE Project
SEWASIE (SEmantic Web and AgentS in Integrated Economies) implemented an advanced search engine that provides intelligent access to heterogeneous data sources on the Web via semantic enrichment to provide the basis of structured Web-based communication. The prototype provides users with a search client that has an easy-to-use query interface to define semantic queries. The query is executed by a sophisticated query engine that takes into account the semantic mappings between ontologies and data sources, and extracts the required information from heterogeneous sources. Finally, the result is visualized in a useful and user-friendly format, which allows identifying semantically related clusters in the documents found. From an architectural point of view, the prototype is based on agent technology, i.e. the individual components of the system are implemented as agents, which are distributed over the network and communicate with each other using a standardized agent protocol (FIPA). Figure 1 gives an overview of the architecture of SEWASIE. A user is able to access the system through a central user interface where the user is provided with tools for query composition, for visualizing and monitoring query results, and for communicating with other business partners about search results, e.g. in electronic negotiations. SEWASIE Information Nodes (SINodes) are mediator-based systems, providing a virtual view of the information sources managed within an SINode. The system may contain multiple SINodes, each integrating several data sources of an organization. Within an SINode, wrappers are used to extract the data and metadata (local schemas) from the sources. The Ontology Builder, based on the MOMIS framework, is a semi-automatic tool which creates a bootstrap domain ontology by extracting and integrating the local schemata of the sources into a GVV. The GVV is annotated w.r.t. a lexical ontology (Wordnet [44]). The annotated GVV and the mappings to the data source schemas are stored in a metadata A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
18
D. Beneventano, S. Bergamaschi
repository (SINode ontology in figure 1) and queries expressed in terms of the GVV can be processed by the Query Manager of an SINode. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 1: SEWASIE Architecture Brokering Agents (BA) integrate several GVVs from different SINodes into a BA Ontology, that is of central importance to the SEWASIE system. On the one hand, the user formulates the queries using this ontology. On the other hand, it is used to guide the Query Agents to select the useful SINodes to solve a query. The SEWASIE network may have multiple BAs, each one representing a collection of SINodes for a specific domain. Mappings between different BAs may be established. A Query Agent receives a query (expressed in terms of a specific BA Ontology) from the user interface, rewrites the query, in cooperation with the BA, in terms of the GVVs of the useful SINodes; sends the rewritten sub-queries to the involved SINodes. The result integrates the answers of the sub-queries of the SINodes and is stored in a result repository, so that it can be used by the various end-user components. For example, Monitoring Agents can be used to store a query A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
19
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
result in a permanent repository. The monitoring agent will then execute the query repeatedly, and compare the new results with previous results. The user will be notified if a document that fits his monitoring profile has changed. Furthermore, the Monitoring Agent can link multidimensional OLAP reports with ontology-based information by maintaining a mapping between OLAP models and ontologies. Finally, the Communication Tool provides the means for ontology-based negotiations. It uses query results, the ontologies of the BAs, and specific negotiation ontologies as the basis for a negotiation about a business contract.
3
Building the SEWASIE System Ontology
We describe a two-level ontology architecture: the low level, called the peer level integrates data sources with semantically close content, the upper level, called super-peer level integrates peers with semantically close content. The architecture is shown in figure 2: • a peer (called SINode) contains a mediator-based data integration system, which integrates heterogeneous data sources into an ontology composed of: an annotated GVV, denoted by SINodeGVV, and Mappings to the data source schemas. • a super-peer (called BA -Brokering Agent ) contains a mediatorbased data integration system, which integrates the GVV of its peers into an ontology composed of: an annotated GVV, denoted by BA-GVV, and Mappings to the GVVs of its SINodes. The BA/SINodes architecture is realized with a two-level data integration system. From an organizational point of view, with a twolevel architecture we have a greater flexibility as we can integrate both data sources and data integration systems, already developed in an independent way (on the basis, for example, of sectorial or regional objectives). Mappings m1 (m2) are GAV mappings and then for each class of the BA-GVV (SINode GVV) a query QN over the local classes of the sources will be defined. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
20
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 2: The Brokering Agent/SINodes architecture
3.1
Ontology Creation with MOMIS
The GVV and the mapping assertions (mappings for short) have to be defined at design time by the Ontology Designer. This is done by using the Ontology Builder graphical interface, built upon the MOMIS framework. The methodology to build the ontology of an SINode and of a BA is similar; we describe, first, the methodology for an SINode and then discuss the differences for a BA ontology. The methodology is composed of two steps: 1. Ontology Generation : The system detects semantic similarities among the involved source schemas, automatically generates a GVV and the mappings among the GVV and the local schemata; 2. Mapping Refinement : The Ontology Designer interactively refines and completes the automatic integration result; in particular, the mappings, which has been automatically created by the system can be fine tuned and the query associated to each global class defined. Ontology Generation The Ontology generation process is outlined in figure 3. 1. Extraction of Local Source Schemata: Wrappers acquire schemas of the involved local sources and convert them into ODLI3 . Schema description of structured sources (e.g. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
21
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
relational database and object-oriented database) can be directly translated, while the extraction of schemata from semistructured sources need suitable techniques as described in [3]. To perform information extraction and integration from HTML pages (see paragraph Automatic Data and Metadata Extraction in section “The WISDOM project”), research and commercial Web data extraction tools, such as ANDES [45], Lixto [8] and RoadRunner [25], have been experimented and adopted.
Figure 3: The Ontology generation process for an SINode 2. Local Source Annotation: Terms denoting schemas elements in data sources are semantically annotated. The Designer can manually choose the appropriate Wordnet meaning(s) for each term and/or perform an automatic annotation which associates to each term the first meaning of Wordnet. 3. Common Thesaurus Generation: Starting from the annotated local schemata, MOMIS builds a Common Thesaurus that describes intra and inter-schema knowledge in the form of: synonyms, broader terms/narrower terms, meronymy/holonymy, equivalence and generalization relationships. The Common Thesaurus is incrementally built by adding schema-derived relationships (automatic extraction of intra schema relationships from each schema separately), lexicon-derived relationships (inter-schema lexical relationships derived by the annoA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
22
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
tated sources and Wordnet interaction), designer-supplied relationships (specific domain knowledge capture) and inferred relationships (via Description Logics equivalence and subsumption computation). 4. GVV generation: Starting from the Common Thesaurus and the local sources schemata, MOMIS generates a GVV consisting of a set of global classes, plus mappings to connect the global attributes of each global class with the local sources’ attributes. Going into details, the GVV generation is a process where ODLI3 classes describing the same or semantically related concepts in different sources are identified and clusterized in the same global class. The Ontology Designer may interactively refine and complete the proposed integration results; in particular, the mappings which has been automatically created by the system can be fine tuned as discussed in the next section (Mapping Refinement). 5. GVV annotation: The GVV is semi-automatically annotated, i.e. each its element is associated to the meanings extracted from the annotated sources. GVV annotation will be exploited in the BA Ontology building process; moreover, the GVV annotation can be useful to make the domain ontology available to external users and applications [9]. Mapping Refinement The system automatically generates a Mapping Table (MT ) for each global class C of a GVV , whose columns represent the local classes L(C) belonging to C and whose rows represent the global attributes of C. An element MT [GA][LC] represents the set of local attributes of LC which are mapped onto the global attribute GA. The query associated to a global class C is implicitly defined by the Ontology Designer starting from the MT of C. The Ontology Designer can extend the MT by adding: Data Conversion Functions, Join Conditions and Resolution Functions. Data Conversion Functions The Ontology Designer can define, for each not null element MT[GA][L], a Data Conversion Function, denoted by MTF[GA][L], which represents the mapping of local attributes of L into the global attribute GA. Join Conditions Merging data from different sources requires different instantiations of the same real world object to be identified; this process is called object identification [46]. The topic of object identification is currently a very active research area with significant contributions both A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
23
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
from the artificial intelligence [49] and database communities [5][24]. To identify instances of the same object and fuse them we introduce Join Conditions among pairs of local classes belonging to the same global class. Resolution Functions The fusion of data coming from different sources taking into account the problem of inconsistent information among sources is a hot research topic [31][15][32][46][41]. In MOMIS the approach proposed in [46] has been adopted: a Resolution Function for solving data conflicts may be defined for each global attribute mapping onto local attributes coming from more than one local source. If the designer knows that there are no data conflicts for a global attribute mapped onto more than one source (that is, the instances of the same real object in different local classes have the same value for this common attribute), he can define this attribute as an Homogeneous Attribute. On the basis of the resulting MT the system automatically generates a query QN associated to C, by extending the Full Disjunction operator [30](Galindo-Legaria, 1994), that has been recognized as providing a natural semantics for data merging queries [47]. QN is defined in such a way that it contains a unique tuple resulting from the merge of all the different tuples representing the same real world object. This problem is related to that of computing the natural outer-join of many relations in a way that preserves all possible connections among facts [47]. Such a computation has been termed as Full Disjunction (FD) by Galindo Legaria [30](Galindo-Legaria, 1994). Finally, QN is obtained by applying Resolution Functions to the attributes resulting from FDExpr : for a global attribute GA we apply the related Resolution Function to T (L1).GA, T (L2).GA, . . . , T (Lk).GA; this query QN is called FDQuery.
3.2
The Brokering Agent Ontology
A first version of the BA ontology is bootstrapped using the same techniques applied for an SINode. The BA-GVV generation process is performed starting step 3 (extraction and annotation of the local schemata has not to be done (see figure 3)): in this case we integrate SINode-GVVs which are already annotated ODLI3 schemata. The BAGVV has to be refined and enriched as it represents the interface of the BA: relationships with other BAs and other agents (see figure 1) are built through the BA ontology; also, we foresee that users interact A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
24
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
(i.e. query and browse information) with the system by means of the BA ontology. Therefore, a more sophisticated ontology design tool was developed, starting from the i.com-tool [29]; the tool provides the usual feature of ontology editors and, in addition, it is connected to a reasoner which enables consistency checks or the deduction of implicit relationships. Finally, the translation of the BA-GVV into the Semantic Web standards for ontologies such as OWL, is a straightforward process (see APPENDIX I).
4
Querying the SEWASIE System
In this section we briefly describe a query tool interface that supports the user in formulating a query and the query reformulation process for the two-level data integration system. Query Formulation: The SEWASIE Query Interface assists users in formulating queries. A query can be composed interactively by browsing the ontology in a tree-like structure and selecting relevant items for the query [22]. The query interface is intelligent as it contains an online reasoning functionality, i.e. it allows only a combination of items which is consistent w.r.t. the ontology. The underpinning technologies and techniques enabling the query user interface are described in [28]. Query Reformulation: Query reformulation takes into account the two different levels of mappings (figure 2): in [19][11] is proved that if m1 and m2are GAV mappings, the mapping is indeed the composition of m1and m2; this implies that query answering can be carried out in terms of two reformulation steps: 1) Reformulation w.r.t. the BA Ontology and 2) Reformulation w.r.t. the SINode Ontology. These two reformulation steps are similar and are: 1. Query expansion: the query posed in terms of the GVV is expanded to take into account the explicit and implicit constraints: all constraints in the GVV are compiled in the expansion, so that the expanded query can be processed by ignoring constraints. Then, the atoms (i.e. subqueries referring to a single global class) are extracted from the expanded query. 2. Query unfolding: the atoms in the expanded query are unfolded by taking into account the mappings M between the GVV and the local sources in N . A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
5
25
The WISDOM Project
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The WISDOM (Web Intelligent Search based on DOMain ontologies) project aims at studying, developing and experimenting methods and techniques for searching and querying information sources available on the Web. The main goal of the project is the definition of a software framework that allows computer applications to leverage the huge amount of information contents offered by Web sources (typically, as Web sites). In the project it is assumed that the number of sources of interest might be extremely large, and that sources are independent and autonomous one each other. These factors raise significant issues, in particular because such an information space implies heterogeneities at different levels of abstraction (format, logical, semantics). Providing effective and efficient methods for answering queries in such a scenario is the challenging task of the project. WISDOM represents an architectural evolution w.r.t. SEWASIE as: • data extraction from large Web-sites will be automatically performed by generating wrappers [25]; • a loose integration among the information offer of a network of semantic peers is provided; • semantic mappings among the ontology of semantic peers will be discovered; • peer-to-peer query processing will be performed on the basis of the inter-peer mappings which are enriched by content summaries which provide quantitative information; • peer-to-peer query processing takes into account quantitative information to implement execution strategies able to quickly compute the “best” answers to a query. WISDOM architecture The WISDOM architecture follows a schema-based super-peer network architecture. An overlay network of semantic peer is built in order to allow to retrieve information of interest even outside the semantic peer that received the query. Figure 4 illustrates the main architectural elements that frame such a network. The overall idea is to associate A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
26
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
with every semantic peer an ontology Onti, which describes the information offered by the semantic peer itself. A network of semantic peers is thus built by defining mappings between the ontologies of a set of semantic peers.
Figure 4: WISDOM Semantic Peer Network Semantic Peer A Semantic Peer contains a data integration system, which integrates heterogeneous data sources into an ontology composed of an annotated GVV and Mappings to the data source schemas. In particular, in the context of the WISDOM project the problem of wrapping large Web sites is faced. A large number of Web sites contain highly structured regions. These sites represent rich and up-to-date information sources, which could be used to populate WISDOM semantic peers. However, since they mainly deliver data through intricate hypertext collections of HTML documents, it is not easy to access and compute over their data. To overcome this issue, several researchers have recently developed techniques to automatically infer Web wrappers[7][23][25][53], i.e., programs that extract data from HTML pages, and transform them into a machine processable format, typically in XML. The developed techniques are based on the observation that many Web sites contain large collections of structurally similar pages: taking as input a small set of sample pages exhibiting a common template, it is now possible to generate as output a wrapper to extract data from any page sharing the same structure as the input samples. These proposals represent an important step towards the automatic extraction of data from Web data sources. However, as argued in [7][25], intriguing issues arise when scaling up from the single collection A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
27
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
of pages to whole sites. The main problems, which significantly affect the scalability of the wrapper approach, are how to identify the structured regions of the target site, and how to collect the sample pages to feed the wrapper generation process. Presently, these tasks are done manually. To overcome this limitation techniques to addresses these issues, making it feasible to automatically extract data from large data intensive Web sites are investigated in the project. [48] introduces an approach to generating ontologies based on table analysis. Based on conceptual modeling extraction techniques, this approach attempts to (i) understand a table’s structure and conceptual content; (ii) discover the constraints that hold between concepts extracted from the table; (iii) match the recognized concepts with ones from a more general specification of related concepts; and (iv) merge the resulting structure with other similar knowledge representations. Peer-to-Peer Mapping and Query Processing To frame a network of semantic peers we adopt peer-to-peer mappings: a semantic peer-to-peer mapping, denoted Mi,j, is a relationship between the ontology Onti of the semantic peer Pi, and the ontology Ontj of the semantic peer Pj. Intuitively a mapping Mi,j allows the rewriting of a query posed against the ontology Onti into a query over the ontology Ontj. Mappings will be computed by extending the methodology for p2p semantic coordination presented in [17]. The main idea is that discovering mappings across ontologies requires a combination of lexical knowledge, world knowledge and structural information (how concepts are arranged in a specific ontology). This information is used in a methodology called semantic elicitation, which builds a formal representation of the information represented by a concept in an ontology (or even more frequently in “light weight ontologies”, like for example taxonomies of classification schemas). This formal representation is then used to infer relations with formal objects in other ontologies by automated reasoning systems (e.g. SAT solvers, description logic reasoners, and so on); the choice of the system depends on the expressiveness of the representation which is built during the elicitation phase. By means of peer-to-peer mappings, a query received by a given peer can be ideally extended to every peer for which a mapping is defined. However, it is not always convenient to propagate a query to any peer for which a mapping exists. For example, it can be inefficient to include in the query processing peers having a limited extension of the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
28
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
concepts involved by the query. To overcome this issue, every peer-topeer mapping has associated a content summary. Content summaries provides a “profile” of a data source by means of quantitative information; peer-to-peer query processing takes into account such quantitative information to implement execution strategies able to quickly compute the “best” answers to a query. Given a pair of semantic peers for which it exists a peer-to-peer mapping, the content summary associated with such a mapping provides quantitative information about the extension of the concepts in the source ontology that can be found through the mapping in the target semantic peer. A simple example of the information provided by a content summary is the cardinality, in the target peer, of the concepts of the source ontology. In [12] more details about wrapping large Web sites, and optimizing the distribution of queries by providing content summaries are provided. Browsing the results Visualizing the results of a WISDOM query faces a common problem, that is, to guarantee a satisfactory compromise between expressivity and domain-independence when visualizing and navigating RDFlike graphs. Here expressivity is meant as the capability of delivering an intuitive representation of knowledge and some tailored navigation primitives to end-users working in a given application domain, while domain-independence aims to accomplish a high degree of reusability. Most existing tools, such as KAON [52] and WebOnto [27], favour domain-independence and represent entities in a way that is closer to the abstract form used to formally define them. This is familiar to knowledge engineers but not to domain experts. Indeed, though domain-specific formalisms have a lower degree of reusability, they provide graphically richer constructs allowing for a representation that is closer to how entities appear in the application domain. The approach developed in WISDOM to address this issue is to build a flexible framework in which reusable components realize the domain-independent tasks in generating a friendly presentation of a piece of knowledge [33].
6
Related Work
Several projects have been developed in the area of semantic Web and semantic search engines. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
29
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Semantic Search Engines Exploiting Annotation of Web Pages The SHOE (Simple HTML Ontology Extension) project, begun in 1995, was one of the first efforts to explore languages and tools that enable machine understandable Web pages [38][37]. In the SHOE framework, HTML pages are annotated via ontologies to support information retrieval based on semantic information; the annotated pages are gathered into a knowledge base by a Web-crawler; the knowledge base can be queried using the ontology as a schema for query forming. The major challenge of this project is designing a query tool that can exploit the power of a Knowledge Base (SHOE uses description logics as its basic representation formalism) while still being simple enough for the casual user; when the query returns very few or no results, it provides a tool to automatically convert the formal query into a suitable query string for a Web search engine to find relevant pages. A major problem of the SHOE search tool is that it limits search within one class and it is hard to specify query conditions on multiple nodes at the same time. For example, a query such as ‘find people whose research group is in a specific department’ cannot be specified. OntoBroker [26] is in many ways similar to SHOE and allows the annotation of Web pages with ontological metadata. A broker architecture called Ontobroker with three core elements: a query interface for formulating queries, an inference engine used to derive answers, and a webcrawler used to collect the required knowledge from the Web has been implemented. It provides a more expressive framework for ontologies, using Frame-Logic for the specification of ontologies, annotation data and queries. OntoBroker includes a query interface and a hyperbolic viewer, which implements a visualization techniques to allow a quick navigation in a large ontology [26] but searching with complicated conditions is limited. P2P Architecture for Semantic Search Engines Due to the scale of the ever-growing Web, classic centralized models and algorithms can no longer meet the requirements of a search system for the whole Web. Decentralization seems to be an attractive alternative. Consequently Web retrieval has received growing attention in the area of peer-to-peer systems. Decentralization of Web retrieval methods, in particular of text-based retrieval and link-based ranking as used in standard Web search engines have become subject of intensive research. This allows both to distribute the computational effort for more scalable solutions and to share different interpretations of the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
30
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Web content to support personalized and context-dependent search. In [2] a review of existing studies about the algorithmic feasibility of realizing peer-to-peer Web search using text and link-based retrieval methods is presented. A common framework consisting of an architecture for peer-to-peer information retrieval and a logical framework for distributed ranking computation that enables interoperability of peers using different peer-to-peer search methods is described. ALVIS [18] is an ongoing research project founded by the European Community under the IST programme. Its goal is the design, use and interoperability of topic-specific search engines with the goal of developing an open-source prototype of a peer-to-peer, semanticbased search engine. Existing search engines provide poor foundations for semantic-based Web operations, and are becoming monopolies, distorting the entire information landscape. Their approach is not the traditional Semantic Web approach with coded metadata, but rather an engine that can build on content through semi-automatic analysis.
7
Conclusions
In this paper we discussed some ingredients for developing Semantic Search Engines based on Data Integration Systems and peer-to-peer architectures. With reference to Data Integration Systems, we refer to the list outlined in the invited tutorial on ”Information Integration” by Lenzerini [40] to point out the topics covered by the paper. First of all, the strength of the proposed approach is a solution to the problems of “How to construct the global schema (GVV )” and “How to model the mappings between the sources and the global schema”. MOMIS follows a semantic approach to perform extraction and integration from both structured and semi-structured data sources. Semantics means semantic annotation w.r.t. a lexicon ontology and Description Logics background. MOMIS follows a GAV approach where the GVV and the mappings among the local sources and the GVV are defined in a semi-automatic way. Regarding the problem of “Data extraction, cleaning and reconciliation” we adopted some ad-hoc solutions, such as Data Conversion Functions, Join Conditions and Resolution Functions. For more general solutions and a deeper discussion, the reader may refer to the bibliography given in section “Building the SEWASIE system Ontology”. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
31
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Regarding “The querying problem: How to answer queries expressed on the global schema” we overview the major aspects involved in querying the system, i.e., the query building and the query reformulation process (for the two-level data integration system). Regarding the problem of “(Automatic) source wrapping”, we discussed Extraction of Local structured Source and we implemented wrappers for structured sources. In WISDOM, data extraction from large Web-sites will be automatically performed by generating wrappers [25] will be performed.
Bibliography [1] Aberer, K., Cudr`e-Mauroux, P., & Hauswirth, M. [2003] The chatty Web: emergent semantics through gossiping. In ACMWWW Conference, pp. 197–206. [2] Aberer, K., & Wu, J. [2005] Towards a common framework for peerto-peer Web retrieval Vol. 3379 of LNCS, Springer. pp. 138–151 [3] Abiteboul, S., Buneman, P., & Suciu, D. [2000] Data on the Web: From relations to semistructured data and XML Data Management Systems. Morgan Kaufmann. [4] Alesso, H.P. [2002] Semantic Search Technology. http://www.sigsemis.org/columns/swsearch/SSE1104 [5] Ananthakrishna, R., Chaudhuri, S., & Ganti, V. [2002] Eliminating fuzzy duplicates in data warehouses. In VLDB Conference, pp. 586– 597. [6] Androutsellis-Theotokis, S., & Spinellis, D. [2004] A survey of peerto-peer content distribution technologies ACM Computer Survey, 36 (4), pp. 335–371. [7] Arasu, A., & Garcia-Molina, H. [2003] Extracting structured data from Web pages. In ACM SIGMOD Conference, pp. 337– 348. [8] Baumgartner, R., Flesca, S., & Gottlob, G. [2001] Visual Web information extraction with lixto In VLDB Conference, pp. 119–128. [9] Beneventano, D., Bergamaschi, S., Guerra, F., & Vincini, M. [2003] Synthesizing an integrated ontology. IEEE Internet Computing, 7 (5), pp. 42–51. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
32
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[10] Beneventano, D., Bergamaschi, S., Lodi, S., & Sartori, C. [1988] Consistency Checking in Complex Object Database Schemata with Integrity Constraints. IEEE Trans. Knowledge and Data Eng., 10(4), (pp. 576–598). [11] Beneventano, D., & Lenzerini,M. [2005] Final release of the system prototype for query management. Sewasie Deliverable D.3.5 http://www.dbgroup.unimo.it/prototipo/paper/D3.5final.pdf. [12] Bergamaschi, S., Bouquet, P., Ciaccia, P., & Merialdo, P. [2005] Speaking words of wisdom: Web intelligent search based on domain ontologies. In Italian Semantic Web Workshop. [13] Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. [2001] Semantic integration of heterogeneous information sources. Data Knowl. Eng., 36 (3), (pp. 215–249). [14] Bernstein, P. A., Giunchiglia, F., Kementsietsidis, A., Mylopoulos, J., Serafini, L., & Zaihrayeu, I. [2002] Data management for peerto-peer computing : A vision. In WebDB Workshop, (pp. 89–94). [15] Bertossi, L. E., & Chomicki, J. [2003] Query answering in inconsistent databases. In J. Chomicki, R. van der Meyden, & G. Saake (Eds.), Logics for Emerging Applications of Databases (pp. 43–83). Springer. [16] Boley, H. [2002] The semantic Web in ten passages. from http://www.dfki.uni-kl.de/∼boley/sw10pass/sw10pass-en.htm. [17] Bouquet, P., Serafini, L., & Zanobini, S. [2005] Peer-to-peer semantic coordination. Journal of Web Semantics, 2 (1). [18] Buntine, W. L., & Taylor, M. P. [2004] Alvis: Superpeer semantic search engine. In EWIMT Workshop. [19] Cal`ı, A., Calvanese, D., Di Giacomo, G. D., & Lenzerini, M. [2004] Data integration under integrity constraints. Inf. Syst., 29 (2), (pp. 147–163). [20] Calvanese, D., Di Giacomo, G. D., Lembo, D., Lenzerini, M., & Rosati, R. [2004] What to ask to a peer: Ontolgoy-based query reformulation. In KR Conference, (pp. 469–478). A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
33
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[21] Castano, S., Ferrara, A., Montanelli, S., Pagani, E., & Rossi, G. [2003] Ontology-addressable contents in p2p networks. In Semantics in Peer-to-Peer and Grid Computing Workshop. Budapest, Hungary. [22] Catarci, T., Dongilli, P., Mascio, T. D., Franconi, E., Santucci, G., & Tessaris, S. [2004] An ontology based visual tool for query formulation support. In ECAI Conference (pp. 308–312). [23] Chang, C.-H., & Lui, S.-C. [2001] Iepad: Information extraction based on pattern discovery. WWW Conference, (pp. 681–688). [24] Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. [2003] Robust and efficient fuzzy match for online data cleaning. In ACM SIGMOD Conference (pp. 313–324). [25] Crescenzi, V., Mecca, G., & Merialdo, P. [2001] RoadRunner: Towards automatic data extraction from large Web sites. In VLDB Conference, (pp. 109–118). [26] Decker, S., Erdmann, M., Fensel, D., & Studer, R. [1999] Ontobroker: Ontology based access to distributed and semi-structured information. In IFIP Conference (pp. 351–369). [27] Domingue, J., Motta, E., & Garcia, O. [1999] Knowledge Modelling in WebOnto and OCML: A User Guide. Knowledge Media Institute, Milton Keynes, UK. [28] Dongilli, P., Fillottrani, P. R., Franconi, E., & Tessaris, S. [2005] A multi-agent system for querying heterogeneous data sources with ontologies. In SEBD Italian Symposium, (pp. 75–86). [29] Franconi, E., & Ng, G. [2000] The i.com tool for intelligent conceptual modeling. In KRDB Workshop (pp. 45–53). [30] Galindo-Legaria, C. A. [1994] Outerjoins as disjunctions. In ACMSIGMOD Conference (pp. 348–358). [31] Di Giacomo, G. D., Lembo, D., Lenzerini, M., & Rosati, R. [2004] Tackling inconsistencies in data integration through source preferences. In ACM IQIS Workshop (pp. 27–34). A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
34
D. Beneventano, S. Bergamaschi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[32] Greco, G., Greco, S., & Zumpano, E. [2003] A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng, 15 (6), 1389–1408. [33] Golfarelli, M., Proli, A., Rizzi, S. [2006] M-FIRE: A Metaphorbased Framework for Information Representation and Exploration. In WEBIST Conference. [34] Gong, L. [2001] Industry Report: JXTA: A Network Programming Environment. IEEE Internet Computing, 5(3):88–95. [35] Halevy, A. Y. [2001] Answering queries using views: A survey. VLDB Journal, 10(4), 270–294. [36] Halevy, A. Y., Ives, Z. G., Madhavan, J., Mork, P., Suciu, D., & Tatarinov, I. [2004] The piazza peer data management system. IEEE Trans. Knowl. Data Eng., 16 (7), 787–798. [37] Heflin, J., Hendler, J. A., & Luke, S. [2003] Shoe: A blueprint for the semantic Web. Spinning the Semantic Web (pp. 29–63). MIT Press. [38] Hendler, J., & Heflin, J. [2000] Searching the Web with SHOE. In. AAAI Workshop - Artificial Intelligence for Web Search. (pp. 35–40). [39] Lenzerini, M. [2002] Data Integration: A Theoretical Perspective. In ACM-PODS Conference (pp. 233–246). [40] Lenzerini, M. [2003] Intelligent Information Integration. Tutorial at the IJCAI Conference. [41] Lin, J., & Mendelzon, A. O. [1998] Merging databases under constraints. Int. Journal. Cooperative Inf. Syst., 7 (1), 55–76. [42] L¨ oser, A., Naumann, F., Siberski, W., Nejdl, W., & Thaden, U. [2003a] Semantic overlay clusters within super-peer networks. Vol. 2944 of LNCS, Springer. (pp. 33–47). [43] L¨ oser, A., Siberski, W., Wolpers, M., & Nejdl, W. [2003b] Information integration in schema-based peer-to-peer networks. Vol. 2681 of LNCS, Springer. (pp. 258–272). A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
35
[44] Miller, A. [1995] WordNet: A Lexical Database for English. Communications of the ACM, 38 (11), 39–41. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[45] Myllymaki, J. [2002] Effective Web data extraction with standard xml technologies. Computer Networks, 39 (5), 635–644. [46] Naumann, F., & Haussler, M. [2002] Declarative data merging with conflict resolution. In MIT-IQ Conference (pp. 212–224). [47] Rajaraman, A., & Ullman, J. D. [1996] Integrating information by outerjoins and full disjunctions. In ACM-PODS Conference (pp. 238–248). [48] Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G. [2005] Towards Ontology Generation from Tables. WWW Journal, 7(3), 261-285. [49] Tejada, S., Knoblock, C. A., & Minton, S. [2001] Learning object identification rules for information integration. Inf. Syst., 26 (8), 607–633. [50] Tempich, C., Harmelen, F. V., Broekstra, J., Ehrig, M., Sabou, M., Haase, P., Siebes, R., & Staab, S. [2003] A metadata model for semantics-based peer-to-peer systems. In SemPGRID Workshop. [51] Ullman, J. D. [1997] Information integration using logical views. In ICDT Conference, (pp. 19–40). [52] Volz, R., Oberle, D., Staab, S., & Motik, B. [2003] KAON SERVER -A Semantic Web Management System. In WWW Conference. [53] Wang, J., & Lochovsky, F. [2002] Data-rich section extraction from html pages. In WISE Conference (pp. 313–322). [54] Wiederhold, G., [1993] Intelligent integration of information. In ACM SIGMOD Conference, (pp. 434–437). [55] Yang, B., & Garcia-Molina, H. [2003] Designing a super-peer network. In U. Dayal, K. Ramamritham, & T. M. Vijayaraman (Eds.), IEEE ICDE Conference (pp. 49–56)
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
36
D. Beneventano, S. Bergamaschi
APPENDIX I - The ODLI3 language The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
ODLI3 is an extension of the object-oriented language ODL (Object Definition Language) (www.service-architecture.com/database/articles/odmg 3 0.html) which is a specification language used to define the interfaces to object types that conforms to the ODMG2 object model is introduced for information extraction. ODLI3 extends ODL with constructors, rules and relationships useful in the integration process both to handle the heterogeneity of the sources and to represent the GVV. In particular, ODLI3 extends ODL with the following relationships that express intraand inter-schema knowledge for source schemas: Intensional relationships. They are terminological relationships defined between classes and attributes, and are specified by considering class/attribute names, called terms: • Synonym of (SYN ) relationships are defined between two terms ti and tj that are synonyms; • Broader terms (BT ) relationships are defined between two terms ti and tj , where ti has a broader, more general meaning than tj . BT relationships are not symmetric. – Narrower terms (NT ) relationships are the opposite of BT relationships. – Related terms (RT ) relationships are defined between two terms ti and tj that are generally used together in the same context in the considered sources. An intensional relationship is only a terminological relationship, with no implications on the extension/compatibility of the structure (domain) of the two involved classes (attributes). Extensional relationships. Intensional relationships SYN, BT (NT) between two class names and may be “strengthened” by establishing that they are also extensional relationships: • Ci SYN ext Cj : this means that the instances of Ci are the same of Cj ; 2 The Object Database Management Group (ODMG) is a consortium of companies defining standards for object databases (www.odmg.org).
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Semantic Search Engines Based on Data Integration Systems
37
• Ci BT ext Cj : this means that the instances of Ci are a superset of the instances of Cj ; The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The standard IS-A relationship Ci IS-A C j of object-oriented languages implies the extensional relationship Cj BText Ci . ODLI3 also extends ODL with the addition of integrity-constraint rules, which declaratively express if-then rules at both the intra- and inter-source level. ODLI3 descriptions are translated into the Description Logic OLCD - Object Language with Complements allowing Descriptive cycles - (Beneventano, Bergamaschi, Lodi, and Sartori, 1988), in order to perform inferences that will be useful for semantic integration. Because the ontology is composed of concepts (represented as global classes in ODLI3 ) and simple binary relationships, translating ODLI3 into a Semantic Web standard such as RDF, DAML+OIL, or OWL is a straightforward process. In fact, from a general perspective, an ODLI3 concept corresponds to a class of the Semantic Web standards, and ODLI3 attributes are translated into properties. In particular, the IS-A ODLI3 relationships are equivalent to subclass-of in the considered Semantic Web standards. Analyzing the syntax and semantics of each standard, further specific correspondences might be established. For example, there is a correlation between ODLI3 ’s simple domain attributes and the DAML+OIL DataTypeProperty concept. Complex domain attributes further correspond to the DAML+OIL ObjectProperty concept (http://www.w3.org/TR/daml+oil-reference).
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
A Generalized Yule Stochastic Process for the Modelling of the Web Graph Growth
Giulio Concas, Mario Locci, Michele Marchesi, Sandro Pinna, Ivana Turnu Department of Electrical and Electronic Engineering, Universit` a di Cagliari Cagliari, 09123, Italy {concas, mario.locci, michele, pinnasandro, ivana.turnu}@diee.unica.it Abstract The Web is made up of a large number of well-known communities [14], groups of individuals who share a common interest. In this paper we present a dynamic model of the Web growth, based on communities and derived from the Yule stochastic process. The Web, modelled as a graph (Webgraph), has largely been studied by researchers. They found three main properties of the Webgraph: scale-free, small world and fractal. The model proposed in this paper is dynamic because the Webgraph is continuously growing at rapid pace either in terms of number of nodes or links. Networks generated by our model are either scale-free or small world. Our model is based on communities. A page belonging to a community has more links with pages of the same community than with pages of different communities. The size of communities follows a power law. The model is also based on the Yule stochastic process which is one of the most applicable mechanisms for generating power law distributions. We obtain distinct power-law distributions for in-degree and out39
40
G. Concas, M. Locci, M. Marchesi, S. Pinna, I. Turnu
degree, with exponents that are in good agreement with current data for the web. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
1
Introduction
Complex networks existing in nature and derived from human activities have attracted the attention of researchers in the last years. These networks may be modelled as a graph made up of nodes and links among them. A large number of real complex networks are characterised by common properties, the most important of them being ”scale-free” and ”small world”. The first property means that the network shows a power law distribution of the number of links per node (degree), p(x) = x−γ [7]. The scale free property can be explained with the principle of preferential attachment: nodes of high degree attract new connections with higher probability. The small world property means that one can reach a node from another node, following the smallest path between the nodes, in a few number of steps, even in the case of large graphs [16]. Mathematically, the average shortest path length grows as the logarithm of the total number of nodes N , ∼ lnN [21]. The World-Wide Web may be modelled as a graph, where nodes represent Web pages and hyperlinks represent directed edges between Web pages; we refer to this as the Webgraph. Many papers reported scale free and small world properties even for the web [7], [21], [18]. Scale-free degree distributions [13], [2], [1], [3] have been found both for inlinks and outlinks (exponents γin ≈ 2.1 and γout ≈ 2.6, respectively). Moreover, it has been found that the average number of edges per vertex is about 7 [13], [15]. In this paper we present a dynamic model of the web based on the Yule stochastic process with embedded communities. This model may be used to build a graph which exhibits the same properties observed in literature for the web graph. For example, the model can be calibrated in order to get a graph having the same power law exponents of the web graph, both for inlinks and outlinks. The distribution of communities size is modelled to have a power law form with α 2 as reported by Clauset et al. [5]. Developing a realistic model of the web is important for several reasons: 1. Mining information on the web. Data mining on the web has A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A Generalized Yule Stochastic Process for the Modelling...
41
greatly benefited from the analysis of the link structure of the web. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
2. Specific algorithms could be proven to work within the model. In fact, many problems to address on the Web are computationally difficult for general graphs. 3. Predicting the evolution of new phenomena of the Web.
2
Related works
Kumar et al. [14] and Barabasi and Albert [1] found that both the in-degree and out-degree distribution of the nodes of the Webgraph follows a power-law. This property was confirmed by Broder et al. [2] as a basic web property. Flake et al. [8] suggest that a Web community is a collection of Web pages in which the pages belonging to a community have more connections (inlinks or outlinks) to pages inside the community than to pages outside the community. They found an algorithm to discover the communities based on ”max flow-min cut” theorem given by Ford and Fulkerson [11]. Clauset et al. [5] described a new algorithm for detecting community structure and found that the distribution of communities size follows a power law with α 2. In fact, the ten largest communities account for about 87% of the entire network. A model to generate a graph with embedded communities has been proposed by Tawde et al. [20]. They generate M communities each of which with a certain number of nodes and each node with a certain number of outlinks and inlinks. All the outlinks and inlinks are initially dangling. They use the topic citation matrix [4] to model interaction among communities. The element [i][j]of this matrix gives the fraction of the total outlinks of the community Ci pointing to nodes within Cj . The remained dangling links are to be connected to the nodes in the ROW (the rest of the web).
3
The model
The model is based on the Yule stochastic process, which was originally used to explain the number of species of a particular genus [17], [19], A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
42
G. Concas, M. Locci, M. Marchesi, S. Pinna, I. Turnu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[22]. In a Yule process, entities get random increments of a given property in proportion to their present value of that property. From time to time, new entities are created with initial value of their property equal to k0 , and in between the appearance of one entity and the next, m existing entities are chosen in proportion to their current value of their property, plus a constant c. The Yule process leads to a power law distribution of the number of genus’ species. The exponent of the power law was calculated by Newman [17] as reported in equation 1. k0 + c (1) m We have extended the Yule process by considering web pages as ”genus” and the links as ”species”. Thus, the probability of a Web page getting a new link is proportional to the number of links already pointing to that page [17], [6]. For example, a Web page that already has many links is more likely to be discovered during Web navigation and hence more likely to be linked again. In particular, for the Web model we have defined two kinds of species: inlinks and outlinks, because both the in-degree and out-degree distribution of the nodes of the Webgraph follow a power-law distribution. In addition, each node belongs to a community and we supposed that all the community nodes have at least fifty percent of their links inside the community [8]. The dynamic of the model may be easily described as follows: α=2+
1. The graph starts with a number of nodes ≥ k0 belonging to different communities. As we will explain in the next point 3 , k0 is the number of initial outlinks of a new node. 2. A community is chosen from a power law with exponent γc . Let NC be the number of communities, we choose from the power law a number between 1 and NC . 3. A new node (startN ode) is added to the graph. This node belongs to the community chosen in the previous point 2. 4. The new startN ode arises with k0 outlinks and each outlink ends in a node (endN ode) that is chosen as follows. Before all we decide if endN ode belongs to the same community of startN ode (with probability p) or to another community (with probability 1 − p), then endN ode is chosen with a probability proportional to the number of its inlinks plus a constant Cin . A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
43
A Generalized Yule Stochastic Process for the Modelling...
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
5. m links are added to the existing nodes. Each link starts in a node that is chosen with a probability proportional to the number of its outlinks plus a constant Cout . The end node of the link is chosen as in point 3: before all we decide if the node belongs to the same community (probability p) then we choose the node with a probability proportional to the number of its inlinks plus Cin . 6. Steps from 2 to 5 are repeated N times in order to generate a graph with exactly N nodes.
4
Results
The simulation results show that the graph obtained with our model respects the most important properties of the Webgraph. Two plots illustrating the complementary cumulative function of the in-degree and out-degree of a graph generated by our model are reported in Figure 1 with the parameter’s values used for the simulation. We can notice that the in-degree and out-degree follow a power law with exponent 2.1 and 2.7 respectively, that are very close to those reported in several observations and models of the Webgraph [15], [12], [13]. These values can be derived from the slope of the line providing the best fit to the data in the figure, but this method could introduce systematic biases of the exponent value [9]. An alternative, simple and reliable method for extracting the exponent of a power law is to use the Hill estimator [10]. In order to avoid the noise in the tail due to sampling errors, we draw a plot of the cumulative distribution. We also found an average number of edges per vertex ( 7) in accordance to several studies [15], [13]. The equations 2 and 3 show the correspondence between the parameters of the model and the exponent for in-degree and out-degree distributions. γin 2 + γout 2 +
cin m + k0
(2)
cout + k0 m
(3)
We obtained these equations for analogy with the equation 1 obtained for the original Yule Process and opportunely adapted to our A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
44
G. Concas, M. Locci, M. Marchesi, S. Pinna, I. Turnu
2
10
P(X > x)
−2
10
−4
10
−6
10
0
10
1
2
10
3
10
4
10
10
5
10
X
(a) in-degree 0
10
−1
10
−2
10 P(X > x)
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
0
10
−3
10
−4
10
−5
10
−6
10
0
10
1
10
2
10 X
3
10
4
10
(b) out-degree
Figure 1: The complementary cumulative distribution of in-degree and outdegree for a simulation with the following parameters: N = 2 ∗ 105 nodes, NC = 800, p = 0.6, Cout = 0.3, Cin = 1.3, m = 4, K0 = 3, γc = 2.1
process. So, these equations are not demonstrated analytically but are supported by empirical results. The generated graph has links between nodes of the same community (intra-links) or different communities (extra-links). The ratio between the number of intra-links and the total number of links (intralinks + extra-links) is very close to the probability p, which is one of the model’s parameters. In order to verify the Small-World property, we calculated the average shortest path for graphs with a growing number of nodes N and we observed that ∼ lnN with a very good fit. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A Generalized Yule Stochastic Process for the Modelling...
5
45
Conclusions
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
In this work we have presented a model for the generation of graphs with embedded communities. The generated graphs are either scale free or small-world. The model, derived from the Yule stochastic process, can generate graphs that exhibit the main properties of the Webgraph. We have seen that by using equations 2 and 3 it is possible to set the parameters of the model in order to have in-degree and out-degree distributed with power laws of exponents 2.1 and 2.6 respectively. Moreover, the average number of edges per vertex is about 7 and the community size is distributed as power law. All these figures are very close to those found in literature for real Web data.
Acknowledgements This work was supported by MAPS (Agile Methodologies for Software Production) research project, contract/grant sponsor: FIRB research fund of MIUR, contract/grant number: RBNE01JRK8.
Bibliography [1] A.-L. Barabasi and R. Albert, Emergence of scaling in random networks, Science 286 (1999), 509. [2] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A.Tomkins, and J. Wiener, Graph structure in the web, Computer Networks 33 (2000), 309–320. [3] G. Cardarelli, R. Marchetti, and L-Pietronero, Europhys. Lett. 52 (2000), 386. [4] Soumen Chakrabarti, Mukul M. Joshi, Kunal Punera, and David M. Pennock, The structure of broad topics on the web, 2002. [5] Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Finding community structure in very large networks, Physical Review E 70 (2004), 066111. [6] G. Concas, M. Marchesi, S. Pinna, and N.Serra, On the suitability of yule process to stochastically model some properties of objectoriented systems, Phisica A,in press. (2006.). A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
46
G. Concas, M. Locci, M. Marchesi, S. Pinna, I. Turnu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[7] S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin, Principles of statistical mechanics of random networks, Nuclear Physics B 666 (2003), 396. [8] Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee, Self-organization of the web and identification of communities, IEEE Computer 35 (2002), no. 3, 66–71. [9] X. Gabaix and Y. M. Ioannides, The evolution of city size distributions, Handbook of Regional and Urban Economics, vol 4 (J. V. Henderson and Thisse J.F., eds.), North-Holland, 2004, pp. 2341– 2378. [10] B.M. Hill, A simple general approach to inference about the tail of a distribution, The Annals of Statistics 3 (1975.), 1163–1174. [11] L.R. Ford Jr. and D.R. Fulkerson, Maximal flow through a network, Canadian Journal of Mathematics 8 (1956), no. 3, 399–404. [12] P.L. Krapivsky and S Redner, A statistical physics perspective on web growth, Computer Networks 39 (2002), 261. [13] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, and Eli Upfal, The Web as a graph, Proc. 19th ACM SIGACT-SIGMOD-AIGART Symp. Principles of Database Systems, PODS, ACM Press, 15–17 2000, pp. 1–10. [14] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, Computer Networks (Amsterdam, Netherlands: 1999) 31 (1999), no. 11–16, 1481–1493. [15] L. Laura, S. Leonardi, G. Caldarelli, and P. Rios, A multi-layer model for the web graph, 2002. [16] Milgram, The small world problem., Psychol. Today 2 (1967), 61– 67. [17] M.E.J. Newman, Power laws, pareto distributions and zipf ’s law, Contemporary Physics 46 (2005), 323. [18] Albert R., Jeong H., and Barabasi A.-L., Diameter of world-wide web., Nature 401 (1999), 130131. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A Generalized Yule Stochastic Process for the Modelling...
47
[19] H. Simon., On a class of skew distribution functions., Biometrika 42 (1955.), 425–440. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[20] Vivek Tawde, Tim Oates, and Eric J.Glover, Generating web graphs with embedded communities., WAW, 2004, pp. 80–91. [21] D. Watts and S.H. Strogats, Collective dynamics of small world network., Nature 393 (1998), 440442. [22] G. Yule, A mathematical theory of evolution based on the conclusions of dr. j.c. willis, f.r.s., Philosophical Transactions of the Royal Society of London 213 (1925.), 21–87.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Enhancing JSR-179 for Positioning System Integration and Management
Paolo Bellavista, Antonio Corradi, Carlo Giannelli Dip. Elettronica, Informatica e Sistemistica - Universit`a di Bologna Viale Risorgimento, 2 - 40136 Bologna - ITALY {pbellavista, acorradi, cgiannelli}@deis.unibo.it
Abstract Several heterogeneous positioning systems are more and more widespread among client wireless terminals, thus leveraging the market relevance of Location Based Services (LBSs). Positioning techniques are very differentiated, e.g., in terms of precision, accuracy, and battery/bandwidth consumption, and several of them are simultaneously available at clients. That motivates novel middleware solutions capable of integrating the dynamically accessible positioning techniques, of controlling them in a synergic way, and of switching from a positioning system to another at service provisioning time by choosing the most suitable solution depending on application-level LBS context. In this perspective, the paper proposes the original PoSIM solution, which significantly extends the emerging JSR-179 standard specification to allow differentiated forms of visibility/control of low-level positioning characteristics, greater flexibility in location change-driven event triggering, and the simultaneous management of multiple and dynamically introduced location techniques. 49
50
1
P. Bellavista, A. Corradi, C. Giannelli
Introduction
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The growing availability of powerful mobile devices with relatively high wireless bandwidth, e.g., via UMTS, IEEE 802.11, and Bluetooth 2.0 connectivity, is going to leverage the widespread diffusion of Location Based Services (LBSs). LBSs can provide service contents depending on current user positions, on the mutual location of clients and accessed server resources, and on the mutual position of users in a group [1]. To enable LBSs, positioning techniques are crucial. Several research activities have deeply worked on evaluating mechanisms and technologies for positioning: some solutions have been specifically designed for determining location, e.g., the well known Global Positioning System (GPS); other proposals try to estimate positioning information by monitoring characteristics of general-purpose communication channels, such as the IEEE 802.11-based Ekahau [2]. For a more exhaustive positioning system survey please refer to [3, 4]. Currently available positioning solutions greatly differ on capabilities and provided facilities. For instance, they diverge in: • the representation model of the provided location information. That model could be either physical (longitude, latitude, and altitude triple), or symbolic (e.g., room X in building Y), or both; • the applicable deployment environment. For instance, GPS can work only outdoor, Ekahau primarily indoor; • accuracy and precision of the positioning information. Accuracy is defined as the location data error range (10 meters for GPS), while precision is the error range confidence (95% for GPS); • power consumption, which usually depends on location update frequency; • user privacy. For instance, client nodes that exploit IEEE 802.11based positioning have to disclose their location, to some extent and at a certain granularity, to be capable of communications (they must associate to an AP for communication purposes); • and additional supported features, which can be peculiar of specific positioning systems. For instance, some positioning solutions can provide location data as a probability distribution function. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
51
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
That heterogeneity among the available positioning systems, together with the fact that current wireless clients tend to simultaneously host several wireless technologies useful for positioning (e.g., terminals with Wi-Fi and/or Bluetooth connectivity and/or equipped with GPS), motivate the need for novel middleware solutions capable of integrating the available positioning techniques, of controlling them in a synergic way, and of dynamically selecting the most suitable solution depending on context. First of all, that middleware should allow to seamlessly switch from a positioning system to another based on availability, e.g., GPS outdoor and Ekahau indoor. Then, it should suggest exploiting, at any time, the positioning technique which best fits user preferences, application requirements, and device resource constraints: for instance, the positioning system with lower power consumption in the case of priority given to battery preservation, or the one with greater accuracy and precision, or the one with most frequent updates, or the one providing physical/symbolic location information. Moreover, when several positioning systems can concurrently work, the middleware could perform fusion operations on location data, e.g., to increase accuracy and/or confidence. For all these purposes, there is also the need to make low-level characteristics of positioning systems easily accessible to the upper layers (middleware and/or application levels), thus enabling applicationspecific control of positioning techniques, possibly by avoiding to complicate LBS development and deployment. The paper extensively discusses the integration and management of heterogeneous positioning systems. Section 2 describes related work about middleware solutions for positioning integration, while Section 3 focuses on JSR-179, an emerging standard API for positioning. Section 4 rapidly sketches our original PoSIM middleware and its main components, while Section 5 compares the proposed PoSIM API with the JSR-179 one. Conclusions and on going work end the paper.
2
Related Work
Several academic research activities have recently addressed the area of dynamically fusing positioning information from different sources. Here, we only present a few of them to point out the primary solutions related to context/location information and positioning system integration. Some positioning middleware proposals have the main goal to supA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
52
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
port the easy development and deployment of LBSs. The main idea is to hide the complexity due to the adoption of several, heterogeneous positioning systems, by providing integrated positioning services in a transparent manner. Every low level information and detail is hidden and there is no possibility to control positioning system behavior. Most of them make positioning system integration completely transparent from the LBS point of view. [5] focuses on the integration of several positioning systems through appropriate wrappers exploited to provide a uniform API to heterogeneous positioning systems. The goal is to exploit the positioning system which is currently available or which best fits accuracy requirements, eventually performing location data fusion. Moreover, [5] provides user-controlled privacy actively, by requesting explicit user permission before disclosing location information. [6] has the primary goal of seamless navigation, i.e., to provide location information regardless actually exploited positioning systems and maps. Its main solution guideline is to exploit middleware components called mediators-wrappers, to abstract from each exploited positioning system peculiarities and map implementations. In addition, it permits to dynamically change the exploited positioning system and map, in a transparent way from the user/application point of view. [7] supplies a specific interface to develop new LBSs. Moreover, it supports the introduction of new positioning systems through a plug-in architecture; the middleware kernel interacts with positioning systems in a standardized manner, via OSA/Parlay. Similarly, [8] tries to abstract from the adoption of several positioning system: it performs information abstraction through a multi-step architecture for location data fusion, generation of geometric relationships, and event-based information disclosure. [9] goes further by proposing different abstracting steps to provide high-level location data: positioning, modeling, fusion, query tracking, and intelligent notification. Moreover, it ensures privacy and security management, by controlling information disclosure, similarly to [5]; the positioning system integration is achieved by the Common Adapter Framework that provides standard APIs to fetch the location information of the mobile devices. The previously described middlewares integrate several positioning systems with the goal to facilitate LBS development. They tend to propose transparent approaches that hide applications from positioning complexity, but do not support any application-specific form of control of currently available positioning techniques. A few proposals start to delineate cross-layer supports that provide visibility of low-level details A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
53
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
and control features at the application level. In the following part of the related, we will focus on middleware solutions that, to some extent, propagate the visibility of low level details at the application level. MiddleWhere [10] offers some abstracting facilities, like previously described solutions, but also provides applications with low-level details. In particular, it can provide requesting clients with additional data about location resolution, confidence, and freshness. An adapter component acts as a device driver allowing MiddleWhere to communicate with positioning system implementations: the adapter makes location description uniform by hiding any positioning implementation peculiarity. [11] supports the integration and control of several positioning systems providing low-level details at the application layer. However, it performs integration and control in a hard-coded and not flexible manner. In addition, the visibility of data/features peculiar to a specific positioning system requires its full static knowledge, thus significantly increasing the LBS development complexity. Location Stack [12] represents a state-of-the-art model of solution for location/context fusion: it identifies several sequential components, deployed in a layered manner, which provide increasing abstraction: Sensors, Measurements, Fusion, Arrangements, Contextual Fusion, Activities, Intentions. However, the Unified Location Framework [13], a possible implementation of Location Stack, demonstrates that such a layered system does not easily allow to propagate the visibility of useful low-level data such as accuracy and precision. In fact, [13] points out that cross-layering is required both to supply low-level details to LBSs and to control positioning systems from the application level. In conclusion, most proposed middlewares mainly address the location fusion issue and tend to hide any low-level detail depending on positioning technique and implementation. [10], [11] and [13] offer some low-level details, but they have to be statically pre-determined. To the best of our knowledge, no middleware solution in the literature addresses the challenge of dynamic and integrated control of available positioning systems by considering application-level requirements in a flexible and extensible way.
3
The JSR-179 Location API
In the last years, the industrial research activity has primarily focused on the development of standards to address the wide heterogeneity of A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
54
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
available positioning systems. The JSR-179 API [14], also known as Location API for J2ME, represents the most notable result of that standardization effort for Java-based LBSs on mobile phones. JSR-179, inspired by the usual and widespread interface of the GPS solution, provides a standardized API to perform coarse-grained integration and control of positioning systems (location providers according to the JSR179 terminology). To better understand how JSR-179 provides location information, here we rapidly report its main characteristics and offered functions. The LocationProvider class is the JSR-179 API entry point. Applications invoke the getInstance() method of LocationProvider to retrieve an actual location provider implementation among the currently available ones. The actual location provider is the selected positioning system that returns location information to applications. When invoking the getInstance() method, an application optionally specifies particular criteria (Criteria class) that the actual location provider must satisfy. If several actual location providers are compatible with the passed criteria, LocationProvider selects the one which best fits the requirements according to a pre-determined strategy. Criteria can specify that the actual location provider must supply speed and altitude, and/or that the provided horizontal/vertical coordinates have to respect a minimum accuracy level. Moreover, it is possible to specify the desired power consumption (low, medium, or high). Let us notice that the passed criteria are exploited only at the moment of the selection of the actual location provider; they are completed neglected at provisioning time. Figure 1 depicts an example of application that requests an actual location provider implementation, by specifying the desired selection criteria. The result is the activation of the positioning system best fitting the criteria among the currently available ones (Location Provider 2 in the figure). Location Provider 2 is associated with the application until a new explicit request of location provider selection to the JSR-179 API. Location providers return location data in three different ways: • on demand, via the getLastKnownLocation() and getLocation(timeout) methods, which respectively provide cached and just updated location information, the latter actively requesting for new data to the underlying positioning system; • periodically at fixed time intervals, via the method setLocationA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
55
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 1: The JSR-179 API for criteria-based selection of an actual LocationProvider implementation. Listener(listener, interval, timeout, maxAge). Only one periodical listener at a time can be registered with each location provider instance; • in an event-driven fashion via the addProximityListener(listener, coordinates, proximityRadius) method. The only triggering event that can be exploited in JSR-179 is the proximity of the located client to specified coordinates. Several proximity listeners may contemporarily indicate multiple coordinates close to which a location provider triggers the events. The provided location information specifies qualified coordinates (physical location), address info (symbolic location), or both. Moreover, it may include additional data such as speed, timestamp, and the technology of the actual location provider. JSR-179 is a good example of standardization effort in the industrial research area to leverage the adoption of positioning systems and LBSs. Its architecture and API have the goal of representing a standardized model for every developer willing to provide new positioning systems or LBSs. However, we claim that JSR-179 does not provide a sufficiently expressive API to perform efficient integration and control of positioning systems. In particular, it supports neither the dynamic management of multiple location providers nor the provisioning of lowlevel system-specific details to the application level as required by many A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
56
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
LBSs. First of all, it does not support the dynamic and flexible management of dynamically retrieved location provider implementations. On the one hand, JSR-179 only permits to exploit one location provider at a time among the ones currently available at a client, even if several of them satisfy the specified criteria. On the other hand, according to the JSR-179 specification, LBSs have the full duty of monitoring the performance of the selected location provider and of taking suitable management operations consequently, e.g., requesting for a new location provider selection in response to accuracy degradation. In other words, once JSR-179 has selected a location provider, the specified criteria are no more considered even if the capabilities of the actual location provider do not satisfy the LBS requirements any more or if a new more suitable location provider becomes available at the client. In addition,the JSR-179 API assumes that the characteristics of location providers are statically identified and do not considerably change over time: that is partially true for static features, e.g., ability to provide speed/altitude or not, but not applicable to dynamic characteristics such as horizontal/vertical accuracy. For example, GPS accuracy may abruptly decrease when the user moves from an outdoor to an indoor environment. Moreover, JSR-179 has dynamicity and flexibility limitations also due to its impossibility to accommodate new positioning systems newly introduced at service provisioning time. The actual location provider implementation is determined only once at the moment of location provider instantiation; JSR-179 does not consider any context change after that instantiation, until a new LBS request for actual location provider determination. Another limitation of JSR-179 is that selection criteria are limited to few and statically pre-determined elements. It is possible to specify as requirements only the features defined in the criteria class before service provisioning. Moreover, also the event handling functions of JSR-179 exhibit non-negligible limitations, as already pointed out: only one type of triggering event is supported, the one related to proximity to a fixed location. But, according to our opinion, corroborated by our experience in developing and prototyping LBSs, JSR-179 exhibits the most relevant lack in its limited capabilities to propagate the visibility of low-level details of underlying location providers when needed. In fact, the only state information available about location providers is their availability status (available, temporarily unavailable, or out of order). This full and uniform transparency of low-level positioning system features does A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
57
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
not always fit the requirements of application-level visibility typical of LBSs. For example, a LBS would get and control peculiar positioning system functions, such as to get and possibly change the location provider privacy level. The academic research on the extension of JSR-179 capabilities to achieve greater flexibility and dynamicity is still at its very beginning, also due to the novelty of the standardization effort. [15] proposes the integration and management of multiple positioning systems via a JSR-179 fully compliant API. It tries to increase dynamicity by transparently switching among available positioning systems: in particular, it alternatively exploits either GPS/Bluetooth-based positioning dependently on client outdoor/indoor location. However, the proposal does support neither the dynamic change of positioning selection criteria (only system availability), nor the integration with new positioning systems at provisioning time. Moreover, it does not provide any function at all to control integrated positioning systems from the application layer.
4
The PoSIM Middleware
Our goal is to go further than just hiding positioning systems integration behind the JSR-179 API. The objective is to provide a middleware solution that significantly extends JSR-179 with new and more powerful features to support positioning system integration and management in a flexible, dynamic, and extensible way. At the same time, our middleware proposal should adopt an API similar to the JSR-179 one, at least when possible, to facilitate its adoption by developers of both positioning systems and LBSs. This section describes our Positioning System Integration and Management (PoSIM) middleware for the efficient and flexible integrated management of different positioning systems. In particular, PoSIM focuses on three aspects. First of all, it is capable of integrating positioning systems at service provisioning time in a plug-in fashion, by exploiting their possibly synergic capabilities and by actively controlling their features. Secondly, PoSIM allows positioning systems to flexibly expose their capabilities and location data at runtime and without requiring any static knowledge of positioning-specific data/functions. Third, it can perform location data fusion depending on applicable context, e.g., application-specific requirements about accuracy or client requirements A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
58
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
about device battery consumption. Furthermore, PoSIM enables differentiated visibility levels to flexibly answer all possible application requirements stemming from different LBS deployment scenarios and application domains. On the one hand, PoSIM enables LBSs to access and control the whole set of available location providers in a transparent way at a high level of abstraction: LBSs can simply specify the behavior positioning systems must comply with via declarative policies; PoSIM is in charge of actually and transparently enforcing the selected policies. On the other hand, PoSIM allows LBSs to have full visibility of the characteristics of the underlying positioning systems via a PoSIM-mediated simplified access to them. In this case, PoSIM provides LBSs with a uniformed API, independently of the specific positioning solution, that permits to access/configure heterogeneous location providers homogeneously and aggregately. We call translucent the original PoSIM approach that supports LBSs with both transparent and visible integrated access to available positioning solutions. Thanks to the translucent approach, two different classes of PoSIMbased LBSs can properly manage heterogeneous positioning systems: simple LBSs and smart ones. Via PoSIM, simple LBSs can interact transparently with location providers perceived as a single service exposing a JSR-179-like API. They can control positioning systems easily, just specifying the required behaviors via declarative policies or simply selecting the policies to enforce among pre-defined ones, e.g., by privileging low energy consumption or high location accuracy. Instead, smart LBSs, i.e., applications willing to have direct visibility and manage location information or peculiar capabilities of positioning systems, interact in a middleware-mediated aware fashion: they can have a PoSIM-based uniform access to all functions of underlying positioning solutions, even the system-specific ones, e.g., the possibility to limit Ekahau accuracy to reduce network overhead. Let us stress that we distinguish between positioning features and infos. Features describe positioning system characteristics and capabilities, possibly with settable values and useful for positioning system control, e.g., power consumption or ensured privacy level. Infos are location-related information, e.g., actual positioning data and their accuracy, not modifiable from outside the positioning systems. Infos are the only data provided to simple LBS. In the following, the section briefly presents all the PoSIM components, in order to point out how the translucent approach is impleA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
59
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
mented to provide a more dynamic, extensible and powerful version of the JSR-179 API. For further implementation details about PoSIM components, please refer to http://lia. deis.unibo.it/Research/PoSIM.
Figure 2: The PoSIM architecture (white arrows represent data flows, grey arrows control flows). To interact with positioning systems in a transparent manner, simple LBSs can exploit the Policy Manager (PM) and Data Manager (DM) components depicted in Figure 2. Via those APIs, simple LBSs can ask for pre-defined behaviors implemented as declarative policies, without any knowledge of how actually the integrated positioning systems are exploited. For example, the POWER USAGE LOW policy turns off all the positioning systems with high energy consumption by preserving application-specific requirements about precision and accuracy. PM is in charge of maintaining pre-defined policies and enforcing active ones; it is implemented on top of the Java-based rule engine Jess [16]. Via the DM component, PoSIM provides integrated positioning system info in an aggregated way as a single XML document, where tags are exploited to specify the content semantics, thus permitting a significantly higher level of dynamicity. In addition, PoSIM can offer location data for any integrated and currently active positioning system. Location data access retrieval is possible either on request, or specifying a time period, or via event notification. LBSs can easily specify the conditions to trigger XML document delivery. For instance, the pre-defined atLocation condition triggers location data notification only when the current physical loA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
60
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
cation of the user is close to a known location, similarly to the only possibility available in JSR-179 via the proximity listener. In addition, LBSs may request DM to work as a filter, e.g., the pre-defined highAccuracy data filter discards location information with accuracy below a given threshold. Note that the proper exploitation of filtering rules permits to reduce the network overhead due to non-relevant changes of location data. PoSIM implements triggering events and filtering rules as Java classes, which can be easily sub-classed to specify specialized triggers and filters. Let us stress that expert users, such as PoSIM administrators, can develop and deploy new policies, i.e., selection/fusion criteria, triggering events, and filtering rules. The PoSIM behavior can thus be specialized and extended with impact on neither its implementing code nor the application logic code. That permits to easily extend and personalize the PoSIM middleware. For instance, it is possible to dynamically extend PoSIM capabilities by introducing the atChanges condition that triggers location notification only when current and previous physical location differ more than a specified distance. Anyway, simple LBSs and novel developers can also work, simply and rapidly, by selecting among the existing set of most common policies, events, and filters. Smart LBSs and PM/DM can directly control positioning systems by exploiting the lower level API of the Positioning System Access Facility (PSAF). PSAF supports API to dynamically handle the insertion/removal of positioning systems and to retrieve/control data/features of all the currently available positioning systems. The only requirement is that positioning systems provide their data/features via a specified interface; that interface is the result of the wrapping of another PoSIM middleware component, i.e., the Positioning System Wrapper (PSW). PSAF exploits Java introspection to dynamically determine and access the set of data/features actually implemented by the underlying positioning solutions currently available in its deployment environment.
5
Comparing PoSIM and JSR-179 API
This section aims at pointing out and discussing the main differences between the PoSIM API and the JSR-179 one. As depicted in Figure 2, PoSIM offers two levels of visibility to LBS developers: a transparent API, which is similar to the JSR-179 one, and a visible API, which extends traditional JSR-179 by providing the capability to directly inA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
61
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
teract with integrated positioning systems. The transparent part of the PoSIM API, provided by PM and DM, is similar to the JSR-179 one. However, since PoSIM offers extended and richer integration functions, there are necessarily a few API differences also in the transparent part. Delving into finer details, PM supplies many capabilities that JSR179 does not provide. As already depicted, the JSR-179 API only exploits the location provider which best fits the criteria specified once at request time. On the contrary, PoSIM permits to specify and modify criteria at service provisioning time. In fact, PM accepts declarative criteria similarly to JSR-179, but it actively and dynamically controls positioning system behaviors instead of simply selecting the one which best fulfills the specified requirements. Furthermore, since PoSIM criteria are implemented as Jess rules, it is possible to create new criteria and provide them at runtime, without either recompiling or restarting the system. Also DM exposes many capabilities that the standard JSR-179 API cannot provide. First of all, since PoSIM can exploit several positioning systems at a time, it can also perform location fusion, for instance to possibly increase location information accuracy. Then, LBSs may take advantage of every available positioning system suitable to their requirements. For this reason, PoSIM provides location information as an XML document, not as a single Location class like JSR-179 does. Both JSR-179 and PoSIM may perform data delivery in a eventdriven fashion. However, while the JSR-179 API only supports statically determined triggering events, i.e., proximity-based event notification, PoSIM also provides the capability to exploit new events, specified and deployed at service provisioning time, thus relevantly increasing system flexibility and extensibility. For instance, it is possible to specify the aforementioned atLocation and atChanges triggering events, the former similar to the only supported JSR-179 proximity event, the latter not available through the standard JSR-179 API. Capabilities provided by PSAF are completely absent from JSR179. First of all, PSAF offers the possibility to integrate new positioning systems at service provisioning time. Newly integrated positioning systems can be immediately exploited by PoSIM. Active criteria will be dynamically applied to new positioning systems and their location information automatically inserted in the provided XML document. On the contrary, to exploit a new positioning system through JSR-179, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
62
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
LBSs must explicitly request for another location provider instance, by actively providing their specific selection criteria again. Finally, the JSR-179 API tends to hide application developers from low-level positioning system details. On the contrary, if necessary, the PoSIM API provides full visibility of and fine-grained control over the integrated positioning systems, by permitting the access to both standard and system-specific features/info.
Table 1: Relationships between the primary functions of JSR-179 and PoSIM API. Table 1 reports and compares the main functions for info delivery or positioning system control available in the JSR-179 and PoSIM API, categorized as either transparent or visible. For each JSR-179 API method, the table reports the corresponding PoSIM one, by underlining which PoSIM component provides it and by pointing out possible differences between JSR-179 and PoSIM implementation. In particular, a PoSIM method is classified as i) equivalent to the correspondent JSR179 one if and only if they offer exactly the same capability, ii) extended if providing more expressive and powerful features, and iii) additional if introducing completely new behaviors not available in JSR-179. Most PoSIM methods offer the capability to control and interact with integrated positioning systems in a transparent manner. By considering these first transparent functions and going into finer details: • both PoSIM and JSR-179 offer a getInstance() method, but with significantly different expressiveness. While JSR-179 selects only one location provider among the currently available ones dependently on given criteria, PoSIM just returns a middleware-mediated interface instance. Let us stress that the PoSIM getInstance() A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
63
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
method provides LBS developers with the capability to get multiple simultaneous location data from any integrated positioning system, while JSR-179 allows the access to only the actual location provider. • PoSIM onDemand(...)and JSR-179 getLastLocation() methods behave similarly. Both immediately provide the last known location information, even if PoSIM returns the data obtained by possibly fusing information from every integrated positioning system, while JSR-179 the data from the previously selected actual location provider. • addEvent(...) relevantly extends the expressive power of the correspondent addProximityListener(...). In fact, the former provides the capability to specify which kinds of event trigger location information delivery, while the latter can exploit only proximity events. • setLocationListener(...) and periodical(...) are almost equivalent. Both periodically provide location information at a given time interval. However, while the JSR-179 API specifies that only one location listener can be registered at a time, periodical(...) permits to register several listeners, also by possibly serving multiple applications with the same location data at a time. • The PoSIM addFilter(...) method permits to define new filters for location information (possibly after fusion). That capability is not supported at all in JSR-179 API. • The PoSIM activateCriteria(...) could seem similar to JSR-179 getInstance(...) since both permit to specify selection criteria. However, they are greatly different since the former activates a management policy exploited to control integrated positioning systems at service provisioning time, while the latter simply selects the actual location provider at invocation time. In addition to the above transparent functions, the following methods provide LBS developers with full but middleware-mediated visibility of the integrated positioning systems. • insertPosSys(...) is available only in PoSIM. JSR-179 does not provide any method to add new positioning systems at service provisioning time. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
64
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
• The JSR-179 getState() simply provides coarse-grained information about the availability of the actual location provider. PoSIM extends this function by providing a method to get all the available features of a given positioning system, getFeature(...), and a method to configure their values, setFeature(...), if allowed by the underlying positioning systems. Features are described in a portable and interoperable way according to the representation described at the PoSIM Web site lia.deis.unibo.it/Research/PoSIM. Let us rapidly observe that the PSAF getInfo(...) method is equivalent to the JSR-179 getLocation() one to get just updated info. The only difference is that the former provides information in a visible manner, the latter transparently.
6
Conclusions
The widespread diffusion of several and heterogeneous positioning systems pushes towards the adoption of widely accepted standard to provide location information. The already proposed JSR-179 tries to addresses issues raised from positioning system heterogeneity, by hiding positioning systems behind a well standardized API. However, it does not address the crucial issue of dynamically and flexibly integrating, with full access to fine-grained control features, several positioning systems at a time. The paper proposes the original translucent PoSIM approach: our middleware permits to control integrated positioning systems both in transparent and non-transparent way, respectively fitting simple and smart LBS requirements. In particular, the paper focuses on similarities and differences between the PoSIM API and the standard JSR-179 one, by pointing out how PoSIM relevantly extends JSR-179 capabilities, while mimicking its API to facilitate and accelerate adoption. The encouraging results already obtained in the PoSIM project are stimulating further related research activities. We are extending the middleware openness by including an additional wrapper for our original Bluetooth-based positioning system (at the moment the PoSIM prototype includes wrappers for GPS and Ekahau). Moreover, we are extending the set of pre-defined set of criteria, filter rules, and triggering events, to fit all the personalization requirements of most common LBSs by simply requesting developers to select the integration/control A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Enhancing JSR-179 for Positioning System Integration...
65
strategies to apply. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Acknowledgements Work supported by the MIUR FIRB WEB-MINDS and the CNR Strategic IS-MANET Projects.
Bibliography [1] G. Chen, D. Kotz [2000] A Survey of Context-Aware Mobile Computing Research. Dartmouth College Technical Report TR2000-381, http://www.cs.dartmouth. edu/reports/, 2000. [2] Ekahau. http://www.ekahau.com. [3] J. Hightower, G. Borriello [2001] Location systems for ubiquitous computing. Computer, Vol. 34, No. 8, Aug. 2001, pp. 57-66. [4] J. Hightower, G. Borriello [2001] Location Sensing Techniques.UW CSE 01-07-01, University of Washington, Department of Computer Science and Engineering, Seattle, WA, July 2001 . [5] J. Nord, K. Synnes, P. Parnes [2002] An Architecture for Location Aware Applications.35th Hawaii Int. Conf. on System Sciences, Hawaii, USA, Jan. 2002 . [6] Y. Hosokawa, N. Takahashi, H. Taga [2004] A System Architecture for Seamless Navigation. Int. Conf. on Distributed Computing Systems Workshops (MDC), Tokyo, Japan, Mar. 2004.. [7] M. Spanoudakis, A. Batistakis, I. Priggouris, A. Ioannidis, S. Hadjiefthymiades, L. Merakos [2003] Extensible Platform for Location Based Services Provisioning. Int. Conf. Web Information Systems Engineering Workshops, Rome, Italy, Dec. 2003. [8] G. Coulouris, H. Naguib, K. Samugalingam [2002] FLAME: An Open Framework for Location-Aware Systems. Int. Conf. on Ubiquitous Computing, Goteborg, Sweden, Sept. Oct. 2002. [9] Y. Chen, X.Y. Chen, F.Y. Rao, X.L. Yu, Y. Li, D. Liu [2004] LORE: An infrastructure to support location-aware services.IBM Journal of Research & Development, Vol. 48, No 5/6, Sept./Nov. 2004 . A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
66
P. Bellavista, A. Corradi, C. Giannelli
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[10] A. Ranganathan, J. Al-Muhtadi, S. Chetan, R. Campbell, M. D. Mickunas [2004] MiddleWhere: A Middleware for Location Awareness in Ubiquitous Computing Applications. ACM/IFIP/USENIX Int. Conf. on Middleware, Oct. 2004, Toronto, Ontario, Canada. [11] J. Agre, D. Akenyemi, L. Ji, R. Masuoka, P. Thakkar [2002] A Layered Architecture for Location-based Services in Wireless Ad Hoc Networks. Aerospace Conf., Big Sky, Montana, USA. [12] J. Hightower, B. Brumitt, G. Borriello [2003] The Location Stack: A Layered Model for Location in Ubiquitous Computing.IEEE Work. on Mobile Computing Systems and Applications, Callicoon, NY, USA. [13] D. Graumann, W. Lara, J. Hightower, G. Borriello [2003] Realworld Implementation of the Location Stack: The Universal Location Framework. IEEE Work. on Mobile Computing Systems and Applications, Monterey, CA, USA. [14] JSR-179. http://www.jcp.org/aboutJava/communityprocess/final/jsr179. [15] C. di Flora, M. Ficco, S. Russo, V. Vecchio [2005] Indoor and outdoor location based services for portable wireless devices. International Conference on Distributed Computing Systems Workshops (SIUMI), Columbus, Ohio, USA. [16] Jess. http://herzberg.ca.sandia.gov/jess/.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
A MultiAgent System for Personalized Press Reviews
Andrea Addis, Giancarlo Cherchi, Andrea Manconi, Eloisa Vargiu University of Cagliari Piazza d’Armi, I-09123, Cagliari, Italy {addis,cherchi,manconi,vargiu}@diee.unica.it
Abstract The continuous growth of Internet information sources, together with the corresponding volume of daily-updated contents, makes the problem of finding news and articles a challenging task. This paper presents a multiagent system aimed at creating personalized press reviews from online newspapers by progressively filtering information that flows from sources to the end user, so that only relevant articles are retained. First, newspaper articles are classified according to a high-level taxonomy that does not depend on a specific user. Then, a personalized classification is performed according to user needs and preferences. Moreover, an optional feedback provided by the user is exploited to improve the system precision and recall. The system is built upon a generic multiagent architecture that supports the implementation of personalized, adaptive and cooperative multiagent systems aimed at retrieving, filtering and reorganizing information in a web-based environment. Experimental results show that the proposed approach is effective in the given application task. 67
68
1
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
Introduction
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Nowadays, the World Wide Web offers a growing amount of information and data coming from different and heterogeneous sources. Unfortunately, it is very difficult for Internet users to select contents according to their personal interests, especially if contents are frequently updated (e.g., news, newspaper articles, reuters, rss feeds, and blogs). Supporting users in handling with the enormous and widespread amount of web information is becoming a primary issue. To this aim, several online services have been proposed (let us cite, for example: Google News1 and PRESSToday2 ). Unfortunately, they provide a personalization mechanism based on keywords, which is often inadequate to express what the user is really searching for. Moreover, users often must refine by hand the results provided by the system. In the literature, several approaches have been proposed to separately face with information extraction and text categorization. As for information extraction, several tools have been proposed to better address the issue of generating wrappers for web data extraction [16]. Such tools are based on several distinct techniques, like declarative languages [6, 12], HTML structure analysis [7, 22], natural language processing [10, 24], machine learning [13, 15], data modeling [1, 21], and ontologies [9]. As for text categorization, several machine learning techniques have been applied to text categorization [27]. Among machine learning techniques applied to text categorization, let us cite multivariant regression models [28], k-Nearest Neighbor (k-NN) classification [29], Bayes probabilistic approaches [25], decision trees [17], artificial neural networks (ANN) [26], symbolic rule learning [18] and inductive learning algorithms [4]. In this paper, we focus on the problem of generating press reviews by (i) extracting articles from italian online newspapers, (ii) classifying them using text categorization according to user’s preferences, and (iii) providing suitable feedback mechanisms. In particular, we propose a multiagent system tailored for this specific task. The motivation in adopting a multiagent system lies in the fact that a centralized classification system may be quickly overwhelmed by a large and dynamic document stream, such as daily-updated online news [11]. Furthermore, Internet is intrinsically a distributed system and offers the opportunity to take advantage of distributed computing paradigms and distributed 1 http://news.google.com/
2 http://www.presstoday.com/
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
69
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
knowledge resources for classification. ¿From a conceptual point of view, the proposed system creates press reviews in separate steps at different levels of granularity, whereas from a technological point of view, the system has been built upon the PACMAS architecture [2], giving rise to a personalized, adaptive, and cooperative multiagent system. The remainder of the paper is organized as follows: Section 2 illustrates the multiagent system from both the abstract and the concrete perspectives. Section 3 discusses preliminary experimental results. Section 4 draws conclusions and points to future work.
2
The Proposed MultiAgent System
In this section, we present a multiagent system suitably tailored for creating personalized press reviews. From a conceptual point of view, the system is organized into three logical layers – each one devoted to a specific task; from a technological point of view, the system has been built upon the PACMAS architecture [2] – giving rise to a personalized, adaptive, and cooperative multiagent system.
2.1
The Abstract Architecture
Automatically generating personalized press reviews involves three main activities: (i) extracting the required information, (ii) classifying them according to users preferences, and (iii) providing suitable feedback mechanisms to improve the overall performances. Furthermore, personalization and adaptation should be taken into account, in order to allow users to set their preferences in advance and provide their feedback while the system is running. Figure 1 shows a generic architecture able to perform these activities. In the following, each involved activity is illustrated with particular emphasis on text categorization. Information Extraction The information extraction module extracts data from web sources through specialized wrappers. In general, given a web source S, a specific wrapper WS must be implemented, able to map each web page PS , designed according to the constraints imposed by S, to a suitable description O, which contains relevant data in a structured form –such as title, author(s), text content, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
70
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 1: The Proposed Architecture. and figure captions. Thus, WS actually represents a mapping function WS : PS → O. Note that O is not specifically tailored for S, since, at least in principle, all articles extracted from the web by any WS must have the same structure. Currently, two kinds of wrappers have been implemented, depending on the supported Internet sources: the HTML / XHTML and the RSS wrapper, respectively. The former kind of wrapper extracts information by directly parsing Internet pages in HTML format. Let us point out that HTML is often bad-formed and thus needs ad-hoc algorithms to be correctly parsed. In particular, the process of extracting data from HTML pages typically involves of two steps: (i) learning page structure; (ii) performing structured data extraction. The first step, currently supervised, allows the wrapper to detect the tags containing objects in the set O. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
71
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The second step consists of applying the mapping function to populate the corresponding data repository. The latter kind of wrapper extracts information from online newspaper articles in RSS format3 . It is worth pointing out that the RSS format allows to easily implement the mapping function WS : PS → O, since, for each field in O, a corresponding RSS tag exists, making it very simple to process the pages. Text Categorization The text categorization module progressively filters information that flows from external sources (i.e., online newspapers) to the end user by retaining only the relevant articles. First, newspaper articles are classified according to a high-level taxonomy, which is independent from the specific user. Then a personalized classification is performed according to user needs and preferences. Being interested in classifying newspaper articles, we adopted the taxonomy proposed by the International Press Telecommunications Council4 (a fragment is depicted in Figure 2).
Figure 2: A fragment of the adopted taxonomy (in italian and english). For each entry of the taxonomy, a corresponding classifier, de3 Really
Simple Syndication
4 http://www.iptc.org/
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
72
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
vised to perform text categorization, has been trained by resorting to state-of-the-art algorithms that implement the k-NN technique (in its “weighted” variant [5]). The choice of adopting this particular technique stems from the fact that it is very robust wrt the impact of noisy data. Given a taxonomy, in our view, there are two ways of combining classifiers: “horizontal” and “vertical”. The former (i.e. “horizontal” composition) occurs in accordance with the typical linguistic interpretation of the logical connectives “and”, “or”, and “not”. In fact, typically, the user is not directly concerned with “generic” topics that coincide with classes of the given taxonomy. Rather, a set of arguments of interest can be obtained by composing such generic topics with suitable logical operators (i.e., and, or, and not ). For instance, a user might be interested in being kept informed about all articles that involve both defense and government. This “compound” topic can be dealt with by composing the defense and the government classifiers5. It is clear that, in text categorization, the most important connective is “and”, since the remaining ones can be easily dealt with after giving a suitable semantics to it. Hence, let us concentrate on how to cope with an “and-based” composition of classifier outputs. In the proposed system, we adopted a rather general soft boolean perspective, in which the combination is evaluated using P -norms (with the “p” parameter set to 5). Besides, a representation based on P -norms appears to be adequate to implement basic feedback techniques such as Rocchio’s [23], which can be used to progressively modify the parameters of the P -norm “query” according to the feedback provided by the user. The latter (i.e. “vertical” composition) exploits the ability of a pipeline of classifiers to progressively filter out non relevant information. In the proposed system, particular care has been taken in using pipelines to limit the phenomenon of FN, as a user may accept to be signaled about an article which is actually not relevant, but –on the other hand– would be disappointed in the event that the system disregards an input which is actually relevant. This behavior can be imposed in different ways, the simplest being lowering the threshold used to decide whether an input is relevant or not (recall that wk-NN classifiers are 5 The possibility of resorting to other, more specific, solutions is left to the knowledge engineer who is in charge of maintaining the taxonomy. If s/he deems that the user’s interests are too difficult to obtain through composition, the alternative solution would consist of training a specific classifier.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
73
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
used, which return a value in [0, 1]). In a typical text categorization system this operation may have a negative impact on FP, i.e. augmenting their presence. This phenomenon can be depicted by resorting to the ROC-curve analysis, in which the behavior of a classifier is reported in terms of precision (P) and recall (R) according to a moving threshold that discriminates between positive and negative instances. The ROC curve gives us information about the discrimination capability of a classifier wrt a given set of inputs. If the selected test set is statistically significant, the ROC curve actually provides information about the overall discrimination ability of the classifier and / or on the separability of the input space by means of the features selected for representing inputs. Since the given domain is typically affected by noise (i.e., it is very difficult to come up with a description able to enforce a good separation between relevant and non relevant inputs), moving the decision threshold in either direction typically affects both FN and FP. In particular, the attempt of lowering FN produces the effect of augmenting FP. To reduce the impact of this latter, unwanted, effect, we exploited the presence of a taxonomy of classifiers, which requires, to acknowledge an input as relevant, that it is in fact acknowledged by all classifiers in the corresponding pipeline(s). Furthermore, we expect that most articles are non relevant to the user, the ratio between negative and positive examples being very high (a typical order of magnitude is 102 − 103 ). Unfortunately, this aspect has a very negative impact on the precision of the system. On the other hand, combining classifiers allows to reduce this negative effect –in the best case exponentially with respect to the number of classifiers that occur in the combination. Experimental results confirm this hypothesis, although the actual impact of combination is not as high as the theoretical one, due to the existing correlation between the classifiers actually involved in the combination. To assess how much the pipeline of classifiers allows to counteract a lack of equilibrium between non relevant and relevant articles, let us assume that the normalized confusion matrix associated with a classifier is denoted as: c00 c01 c10 c11 where the first and the second index represent the actual “nature” of the input and the way it has been classified. For instance, c01 represents (an estimation of) the probability to classify as relevant (1) an input that is in fact non relevant (0). A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
74
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
If the set of inputs submitted to a classifier contains n0 non relevant inputs and n1 relevant inputs, the (expected) confusion matrix that characterizes the process is: n0 · c00 n1 · c10
n0 · c01 n1 · c11
„
«−1
Hence: P =
TP = TP + FP
R=
1+
TP = TP + FN
FP TP
„ 1+
FN TP
=
«−1
„ «−1 c01 n0 1+ · c11 n1 „ =
1+
c10 c11
«−1
(1)
(2)
To study how a taxonomy of classifiers affects the overall capability of classifying inputs, some simplifying assumptions have been made, which actually do not properly represent the real world but help to understand the basic underlying mechanisms. In particular, having to deal with a pipeline of k classifiers linked by an is − a relationship, let us assume –for the sake of simplicity– that (i) each classifier in the pipeline has the same (normalized) confusion matrix and that (ii) classifiers are independent. Under these simplifying assumptions, a classification can be seen as a pseudo-random process, which takes as input a set of article descriptions and outputs their classification, which necessarily fulfills –on the average– the requirements imposed by the corresponding confusion matrix. It can be easily verified that the effect of using a pipeline of k classifiers on precision and recall is: P (k) =
R(k) =
T P (k) = T P (k) + F P (k)
T P (k) = (k) TP + F N (k)
„
„ 1+
1+
F P (k) T P (k)
F N (k) T P (k)
«−1 =
«−1
„ «−1 ck n0 1 + 01 · ck11 n1
„ =
1+
c10 1 − ck11 · 1 − c11 ck11
(3) «−1 (4)
The equations above show that an unbalance of positive and negative examples (which is the usual case in text categorization problems) A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
75
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
can be suitably dealt with by keeping FN (i.e. c10 ) low and by exploiting the filtering effect of classifiers in pipeline. The former aspect affects the recall, whereas the latter allows to augment the precision according to the number of levels of the given taxonomy. As already pointed out, the above relations have been obtained by making simplifying assumptions. Nevertheless, our preliminary results show that their validity is maintained also in practice, provided that a low degree of correlation holds among classifiers in pipeline. As a final comment on “vertical” composition, let us point out that the strength assigned by a pipeline of classifiers to a relevant input (i.e. deemed relevant by all classifiers in the pipeline) is the minimum value in [0, 1] received by the input along the pipeline. User’s Feedback The user’s feedback module is devoted to deal with any feedback optionally provided by the end-user. So far, two trivial though effective solutions have been implemented and experimented, based on the neural and k-NN technology. The former solution consists of training an ANN with a set of examples classified as “of interest to the user” by the second layer. When the amount of feedback provided by the user has trespassed a given threshold, the ANN is trained again –after updating the previous training set with the information provided by the user. The latter solution consists of a k-NN classifier. When a non-interesting article is evidenced by the user, it is immediately embedded in the training set of the k-NN classifier. A suitable check performed on this training set after inserting the negative example allows to trigger a procedure entrusted with keeping the number of negative and positive examples balanced. In particular, when the ratio between negative and positive examples exceeds a given threshold, some examples are randomly extracted from the set of “true” positive examples and embedded in the training set. The solution based on the k-NN technology has shown to be slightly better than the one based on ANNs, although this result should be validated by further and more detailed experiments.
2.2
The Concrete Architecture
The functionalities of the abstract architecture, described in the previous section, have been implemented exploiting the PACMAS architecture. This section briefly recalls the PACMAS architecture, and deA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
76
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
Information Sources The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
INFORMATION LEVEL FILTER LEVEL MID-SPAN LEVELS TASK LEVEL INTERFACE LEVEL
User
Figure 3: The PACMAS Architecture. scribes all customizations made to create press reviews. The PACMAS Architecture PACMAS, which stands for Personalized Adaptive and Cooperative MultiAgent System, is a generic multiagent architecture, aimed at retrieving, filtering and reorganizing information according to the users’ interests. The PACMAS architecture (depicted in Figure 3) encompasses four main levels (i.e., information, filter, task, and interface), each being associated to a specific role. The communication between adjacent levels is achieved through suitable middle agents, which form a corresponding mid-span level. At the information level, agents are entrusted with extracting data from the information sources. Each information agent is associated to one information source, playing the role of wrapper. At the filter level, agents are aimed at selecting information deemed relevant to the users, and cooperate to prevent information from being overloaded and redundant. Two filtering strategies can be adopted: generic and personal. The former applies the same rules to all users; whereas the latter is customized for a specific user. At the task level, agents arrange data according to users’ personal A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
77
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
needs and preferences. In a sense, they can be considered as the core of the architecture. In fact, they are devoted to achieve users’ goals by cooperating together and adapting themselves to the changes of the underlying environment. At the interface level, a suitable interface agent is associated with each different user interface. In fact, a user can generally interact with an application through several interfaces and devices (e.g., pc, pda, mobile phones, etc.). At the mid-span level, agents are aimed at establishing communication among requesters and providers. Agents at these architectural levels can be implemented as matchmakers or brokers, depending on the specific application [8]. PACMAS agents can be personalized, adaptive, and cooperative, depending on their specific role. As for personalization, an initial user profile is provided to represent users’ interests. The information about the user profile is stored by the interface agents, and flows up from the interface level to the other levels through the middle-span levels. In particular, agents belonging to mid-span levels (i.e., middle agents) take care of handling synchronization and avoiding potential inconsistencies in the user profile. As for adaptation, different techniques may be employed depending on the application to be developed. In particular, the user behavior is tracked during the execution of the application to support explicit feedback in order to improve her/his profile as well as the system performances. As for cooperation, agents at the same level exchange messages and/or data to achieve common goals, according to the requests made by the user. Cooperation is implemented in accordance with the following modes: centralized composition, pipeline, and distributed composition (see Figure 4). In particular: (i) centralized compositions can be used for integrating different capabilities, so that the resulting behavior actually depends on the combination activity; (ii) pipelines can be used to distribute information at different levels of abstraction, so that data can be increasingly refined and adapted to the user’s needs; and (iii) distributed compositions can be used to model a cooperation among the involved components aimed at processing interlaced information. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
78
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 4: Agents Connections. PACMAS for Text Categorization In this section, we describe how the generic architecture has been customized to implement a prototype of the system devoted to create personalized press reviews. In particular, we illustrate how each level of PACMAS supports the implementation of the proposed application. The prototype has been implemented using JADE [3] as the underlying framework. Information Level The agents at this architectural level are devoted to perform information extraction. In particular, in the current implementation, a set of agents wraps italian online newspapers containing articles in RSS and HTML format. Information agents are not personalized, not adaptive, and not cooperative (shortly P AC). Personalization is not supported at this level, since information sources are the same for each user. Adaptation is also not supported, since we assume that information sources are invariant for the system. Cooperation is also not supported by the information agents, since each agent retrieves information from a different source, and each information source has a specific role in the chosen application. Filter Level At the filter level, a population of agents processes the information belonging to the information level through suitable filtering strategies, preparing the information for the text categorization phase. First, a set of filter agents removes all non-informative words A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
79
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
such as prepositions, conjunctions, pronouns and very common verbs by using a standard stop-word list. After removing the stop words, a set of filter agents, performs a stemming algorithm [20] to remove the most common morphological and inflexional suffixes from all the words. Then, for each class, a set of filter agents selects the features relevant to the classification task according to the information gain [19] method. Filter agents are not personalized, not adaptive, and cooperative (shortly P AC). Personalization is not supported at this level, since all the adopted filter strategies are user-independent. Adaptation is also not supported, since all the adopted strategies do not change during the agents activities. Cooperation is supported by the filter agents, since agents cooperate continuously in order to perform the filtering activity according to the pipeline mode.
Task Level At the task level, a population of agents has been developed to perform the text categorization activities. The high-level text categorization is made by a k-NN classifier, embedded in a corresponding task agent. All the involved agents have been trained in order to recognize a specific class. Given a document in the test set, each agent, through its embedded classifier, ranks its nearest neighbors among the training documents to a distance measure, and uses the most frequent category of the k top-ranking neighbors to predict the categories of the input document. Each task agent is also devoted to measure the classification accuracy according to the confusion matrix [14]. To perform the personalized text categorization activity, some agents at this architectural level are devoted to take into account users preferences automatically composing topics. Composition has been performed through the cooperation of the involved task agents. For instance, the “compound topic” defense and government is obtained by the cooperation of the task agent expert in recognizing defense together with the task agent expert in recognizing government. Task agents are personalized, not adaptive, and cooperative (shortly P AC). Personalization is supported by the task agents, since they perform the classification taking into account users needs and preferences. Adaptation is not supported by the task agents since all the adopted strategies do not change during the agents activities. Cooperation is supported by the task agents, since agents have to interact each other in order to achieve the classification task. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
80
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 5: Interface for the news classifying system.
Interface Level At the interface level, agents are aimed at interacting with the user. In the current implementation, agents and users interact through a suitable graphical interface that runs on a pc (see Figure 5). Interface agents are also devoted to handle user profile and propagate it by the intervention of middle agents. Interacting with the interface agent, the user sets her/his preferences. In particular, s/he can set preferences regarding the information sources, and the topics of the required press review. Moreover, the interface agent is also devoted to deal with the feedback provided by the user. Interface agents are personal, adaptive, and not cooperative (shortly P AC). Personalization is supported to allow each user the customization of her/his interface. Adaptation is supported to take into account the user’s feedback. Cooperation is not supported by agents that belong to this architectural level. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
81
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 6: Accuracy of the system.
3
Experimental Results
To evaluate the effectiveness of the system, several tests have been conducted using articles belonging to the following online newspapers: www.repubblica.it, www.corriere.it, and www.espressonline.it. First, experiments have been performed to set the optimal parameters for the training activity. In particular, the experiments have been conducted changing the number of documents forming the dataset, the percentage of positive examples, and the number of features to be considered. As for the training activity6 , task agents have been provided with a set of newspaper articles previously classified by experts of the domain. For each item of the taxonomy, a set of 200 documents has been selected to train the corresponding classifier. Subsequently, to validate the training procedure of the first step of classification, the system has been fed by the same dataset used in the training phase, showing an accuracy7 between 96% and 100%. Then, random datasets for each category have been generated to test the performance of the system. The global accuracy for fourteen 6 In 7
fact, training for k-NN classifiers typically consists of storing known examples.
T P +T N F P +F N+T P +T N
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
82
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
categories is summarized in Figure 6. On the average, the accuracy of the system is 80.05%. It is worth pointing out that, in this specific task, the accuracy should not be directly considered as a measure of the system performance. On the other hand, it becomes important since the accuracy of a classifier (evaluated on a balanced test set, i.e., with a number of negative examples that does not differ much from the number of positive ones) indirectly affects the recall, under the hypothesis that classifiers are (dynamically) combined using logical operators and/or (statically) combined according to the given taxonomy (in this latter case, they are in fact in a pipeline). As for the composition of taxonomy items, once trained the task agents, several experiments have been performed to test the performance of the system. The existence of a classifiers’ taxonomy and the ability of resorting to combinations of classifiers allowed to reach a recall comparable with state-of-the-art systems by resorting to the composition of three classifiers (on average) and within a taxonomy with depth three. This result has been obtained by imposing a ratio between negative and positive instances of 102 and with an accuracy measured on single classifiers tested with an equal number of negative and positive examples (on average) between 90 and 95%. The categorization capability has been evaluated using several newspaper articles previously classified by hand by domain experts. Results are very encouraging and show that the proposed approach is effective in the given application task, also taking into account that the system can be improved in several and important aspects.
4
Conclusions and Future Work
In this paper, we presented a multiagent system aimed at creating personalized press reviews from online newspapers by progressively filtering information that flows from sources to the end user, so that only relevant articles are retained. From an abstract point of view, the proposed system is organized into three logical layers, each one devoted to perform a specific task (i.e., information extraction, text categorization, and user’s feedback). From a concrete point of view, the proposed system has been built upon the PACMAS architecture that supports the implementation of personalized, adaptive, and cooperative multiagent systems in a web based environment. Experimental results are encouraging and show that the proposed approach is effective in the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
83
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
task of creating personalized press reviews. As for the future work, we are implementing a new release of the user’s feedback module within a framework based on evolutionary computation. Furthermore, a graphical interface to compose items of the taxonomy is under study.
Acknowledgements We would like to thank Ivan Manca for participating in the development of the system.
Bibliography [1] B. Adelberg, Nodose a tool for semi-automatically extracting structured and semistructured data from text documents., Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, pp. 283–294. [2] Giuliano Armano, Giancarlo Cherchi, Andrea Manconi, and Eloisa Vargiu, Pacmas: A personalized, adaptive, and cooperative multiagent system architecture, Workshop dagli Oggetti agli Agenti, Simulazione e Analisi Formale di Sistemi Complessi (WOA 2005), November 2005. [3] Fabio Bellifemine, Agostino Poggi, and Giovanni Rimassa, Developing multi-agent systems with jade, Eventh International Workshop on Agent Theories, Architectures, and Languages (ATAL2000), 2000. [4] W.W. Cohen and Y. Singer, Context-sensitive learning methods for text categorization, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Hans-Peter Frei, Donna Harman, Peter Schaauble, and Ross Wilkinson, eds.), ACM Press, New York, US, 1996, pp. 307– 315. [5] WS. Cost and S. Salzberg, A weighted nearest neighbor algorithm for learning with symbolic features, Machine Learning 10 (1993), 57–78. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
84
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
[6] V. Crescenzi and G. Mecca, Grammars have exceptions, Information Systems 23(8) (1998), 539–565. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[7] W. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards automatic data extraction from large web sites., Proceedings of 27th International Conference on Very Large Data Bases, 2001, pp. 109–118. [8] K. Decker, K. Sycara, and M. Williamson, Middle-agents for the internet, Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI 97), 1997, pp. 578–583. [9] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, Y.K. Ng, D. Quass, and R.D. Smith, Conceptual-model-based data extraction from multiple-record web pages, Data Knowledge Engineering 31(3) (1999), 227–251. [10] Dayne Freitag, Machine learning for information extraction in informal domains, Ph.D. thesis, Carnegie Mellon University, 1998. [11] Y. Fu, W. Ke, and J. Mostafa, Automated text classification using a multi-agent framework., Proceedings of JCDL, 2005, pp. 157–158. [12] J. Hammer, H. Garc´ıa-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos, Template-based wrappers in the TSIMMIS system, 1997. [13] C.N. Hsu and M.T. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems 23(8) (1998), 521–538. [14] R. Kohavi and F. Provost, Glossary of terms, Special issue on applications of machine learning and the knowledge discovery process, Machine Learning 30 (1998), no. 2/3, 271–274. [15] Nicholas Kushmerick, Wrapper induction: Efficiency and expressiveness, Artificial Intelligence 118(1-2) (2000), 15–68. [16] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, and J.S. Teixeira, A brief survey of web data extraction tools, SIGMOD Rec. 31(2) (2002), 84–93. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
A MultiAgent System for Personalized Press Reviews
85
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[17] David D. Lewis and Marc Ringuette, A comparison of two learning algorithms for text categorization, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US), 1994, pp. 81–93. [18] I. Moulinier, G. Raskinis, and J.-G. Ganascia, Text categorization: a symbolic approach, Proceedings of 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US), 1996, pp. 87–99. [19] International Conference On, Yang, y., pederson, j. o. (1997). feature selection in statistical learning of text categorization. [20] M.F. Porter, An algorithm for suffix stripping, Program 14 (1980), no. 3, 130–137. [21] B.A. Ribeiro-Neto, A.H.F. Laender, and A.S. da Silva, Extracting semi-structured data through examples., Proceedings of CIKM, 1999, pp. 94–101. [22] Arnaud Sahuguet and Fabien Azavant, Building intelligent web applications using lightweight wrappers, Data Knowledge Engineering 36(3) (2001), 283–316. [23] Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (2002), no. 1, 1–47. [24] Stephen Soderland, Learning information extraction rules for semi-structured and free text, Machine Learning 34(1-3) (1999), 233–272. [25] Konstadinos Tzeras and Stephan Hartmann, Automatic indexing based on Bayesian inference networks, Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval (Pittsburgh, US) (Robert Korfhage, Edie Rasmussen, and Peter Willett, eds.), ACM Press, New York, US, 1993, pp. 22–34. [26] E. Wiener, J.O. Pedersen, and A.S. Weigend, A neural network approach to topic spotting, Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US), 1995, pp. 317–332. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
86
A. Addis, G. Cherchi, A. Manconi, E. Vargiu
[27] Y. Yang, An evaluation of statistical approaches to text categorization, Information Retrieval 1 (1999), no. 1/2, 69–90. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[28] Y. Yang and C.G. Chute, An example-based mapping method for text categorization and retrieval, ACM Transactions on Information Systems 12 (1994), no. 3, 252–277. [29] Yiming Yang and Xin Liu, A re-examination of text categorization methods, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, US) (Marti A. Hearst, Fredric Gey, and Richard Tong, eds.), ACM Press, New York, US, 1999, pp. 42–49.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
On the Design of a Transcoding Component for a Dynamic Adaptation of Multimedia Resources
Enrico Lodolo†, Fabrizio Marcolini†, Silvia Mirri∗ † Dipartimento di Elettronica, Informatica e Sistemistica - Universit`a di Bologna Viale Risorgimento, 2 - 40126 Bologna, Italia ∗ Dipartimento di Scienze dell’Informazione - Universit` a di Bologna Mura Anteo Zamboni 7 - 40127 Bologna, Italia
[email protected],
[email protected],
[email protected] Abstract Nowadays we face the need to serve users who adopt multiple and different devices in various locations. This environment is implicitly heterogeneous and poses new challenges on how the information and the services of interest are given to the users. To cope with this problem the common approaches identify a limited number of device classes, based on the level of computational power and memory resources. Following this approach, modern organizations usually develop different contents separately, for what, at the very end, is the same information or service. Novel middleware components should provide content adaptation at finer-grained levels so that users may be able to access information and services using any device, from anywhere, at anytime. Proper abstraction of the multimedia resources is needed in order to transform the contents suitably and provide different presentations to the users. Apart from the heterogeneity in devices, the proposals should also take 87
88
E. Lodolo, F. Marcolini, S. Mirri
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
into account the preferences and requirements of the users, including their capabilities. In this paper we claim that a unified approach for the adaptation of multimedia resources has to consider all the different prospected aspects and the dynamic nature of the context. We therefore present the design of a resulting Transcoding Component and its first implementation prototype.
1
Introduction
The increasing number of different available devices, such as Personal Digital Assistant (PDA), handheld computers, smart phones (and some other mobile and multimodal devices), is giving users the opportunity of accessing Internet in a new dimension. As a result, the need to provide users heterogeneous content delivery is emerging [8], as well as a new generation of pervasive computing devices. Typically, offering suitable multimedia services to such mobile and multimodal devices requires re-evaluating multimedia provision in an anytime, anywhere, anyone and any device dimension. In this context, the most common useful approach is content transcoding and adaptation. Different solutions are available and from an architectural point of view content adaptation approaches are mainly the client-side, the server-side, the proxy-based and the emerging service-based ones. Clientside solutions can be classified in two main categories with different behaviors: the clients receive more formats and then select the most appropriate or the clients compute an optimized version from a standard one. In server-side approach, the server receives the user profile and sends back to the client the right content format, which can be computed in a dynamic way or selected from a set of available versions. In proxy-based approach, a proxy acts as a Transcoding Component between the clients and the server. Basically, the proxy selects and performs the appropriate content adaptation before sending the server reply to the client. The service-based approach is the newest one and its main aim is to better distribute roles and computational load [1] [2] [6], in order to obtain a modular architecture, allowing new needed transcoding services addition. A main role is played by user and device profile in the most of these transcoding solutions. In order to profile both devices capabilities and users preferences it is possible to use different solutions [7] [9], but actually these standards do not have an enough and adequate support, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
89
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
hence it is necessary to develop specific solutions in order to overcome this lack. In this context, this paper presents the design and the implementation of a Transcoding Component which adapts multimedia resources in a service-based, suitable and dynamic way. Our work is placed in a wider project that is considering a middleware design and development. This middleware main aim is to provide suitable services and content to new user needs, due to emerging mobile and ubiquitous scenarios, where users require to access traditional services and contents in an anywhere, anytime, anyone and any device way. Hence the entire project proposes solutions to address and support mobility, content adaptation and context awareness, multimodality and service composition. In particular the Transcoding Component is devoted to the content adaptation, considering multimedia resources, several and different transformations and the context identification needs, in order to profile users and devices, as described in the following section. The remainder of the paper is organized as follows. Section 2 outlines main design issues, while in Section 3 we present the transcoder architecture. Section 4 shows technical implementation details and Section 5 describes some experimental scenarios. Finally, Section 6 concludes the paper and plans for future work.
2
Design Issues
The Transcoding Component design has taken into account different research issues, which converge in order to reach our objectives. We can summarize these issues in the following ones: content description, transformations and user and device profiling. Regarding content description, first of all it is necessary to divide multimedia resources in primary types: image, video, audio and text. A descriptor is assigned to every resource in order to describe its own characteristics. In particular, some examples of information described are (but are not limited to): • format, language, font family, number of characters for textual resources; • format and bit-rate for audio resources; • format, spatial and color resolution for image resources; A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
90
E. Lodolo, F. Marcolini, S. Mirri
• format and frame resolution for video resources. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Moreover, semantic information (such as the subject shown in an image) and transformation constraints (such as minimum resolution accepted for a video) can be included in these descriptors. Different resource description standards are available (such as MIME Type description) and certainly one of the most significant is MPEG-7 [5]. In this initial phase of our work we decided to adopt a simpler but effective manner to describe the multimedia resources, using MIME Type description. Multimedia transformation is obviously one of the main issues of our Transcoding Component design. We can assume several transcoding services, operating both in intramedia and in intermedia dimensions. Intramedia transcodings transform the same media in different format, size, etc., while intermedia transcodings convert a media type in a different one. Transcoding service descriptors are needed to define transcoding capabilities. In order to describe user operating context it is necessary to create and consider an appropriate allocation profile. This profile is composed by a set of proprieties describing users, their preferences, their requirements and used devices capabilities. Some of information included in the allocation profile are (but are not limited to): display spatial and color resolution, supported audio format (and related bit-rate), available bandwidth, output user preferences. In device profiling, most significant standards are User Agent Profile (UAProf [7]), Composite Capabilities/Preferences Profile (CC/PP [9]) and FIPA (Foundation for Intelligent Physical Agents) Device Ontology [3]. In order to manage the device capabilities description in an effective way we have used WURFL (Wireless Universal Resource File Library DB [10]). WURFL is a free and open source project that provides information about device capabilities and currently describes about 5.000 different devices, identifying 300 characteristics. WURFL represents one of the most complete and updated XML-based, open database of mobile devices. At the moment no standards are available to describe user preferences and requirements, hence we have defined a preferences table (Table 1), in order to identify content modality fruition users prefer. In particular, through this table we point out equivalent alternative media to a specific type of media, in order to transform multimedia resources on the basis of user preferences. Each user can manually modA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Input Text Image Audio Video
1st Output Audio Image Audio Image
2nd Output Text Text Text Video
91
3rd Output Audio Text
Table 1: Equivalent alternative media and preferred output types
ify her/his preferences, while the Transcoding Component can change them in some limited and particular cases (e.g. when a user is driving a car while he/she is accessing the system, the first output preference is audio media). The information contained in Table 1 is a first example of the user preferences considered, we are similarly working on the definition of more detailed user preferences and requirements.
3
Architecture
In this section we describe the Transcoder architecture, its subcomponents and their working operation. Since the component is placed inside a middleware architecture we also give a brief description of the other components, namely the Service Repository, the Profile Manager and Plans Warehouse and the Plan Execution Engine and Deliverer.
Figure 1: Transcoder Architecture The system architecture is depicted in figure 1. The Transcoder is A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
92
E. Lodolo, F. Marcolini, S. Mirri
composed of three subcomponents: The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
• the Descriptors Extractor and Scanner; • the Adaptation Evaluator; • the Core Orchestrator. The goal of the Transcoder is to give out the Transcoding Plan, which is a description of the steps necessary to transform the input multimedia resources so that they are compatible with the current profile. Descriptors Extractor and Scanner. This subcomponent receives from the outside the resources needed to be delivered to the user devices. The resources can be either accompanied with descriptors or without any added explicit information. In the first case the Scanner part of the subcomponent reads the resource descriptors, validates them and passes them to the Adaptation Evaluator. In the second case the Extractor part of the subcomponent tries to build descriptors from the sole resources and passes its results to the Scanner which then forwards the validated descriptors to the Evaluator. We point out that the Extractor part can operate on one resource even in case of resource accompanied with descriptor, so that there can be a verification of the provided information. In anyway, the results of this subcomponent are given as input to the Adaptation Evaluator. Adaptation Evaluator. This subcomponent executes the transcoding algorithm, which is at the base of the whole transcoding process. To perform its work, the Adaptation Evaluator relies heavily on the last subcomponent, the Core Orchestrator. The Evaluator gets the inputs from the results of the Descriptors Extractor and Scanner component. Using the provided descriptors, this subcomponent evaluates whether the corresponding resources need transcoding or not, based on the profile properties. This component maintains data on the possible relations between resource descriptors attributes and profile properties. In the transcoding algorithm, resource descriptors attributes are compared to the corresponding profile properties. If there is a match then there is no need for transcoding operation, otherwise the Core Orchestrator is invoked with input the original resource descriptor and with output the wanted resource descriptor. The latter is actually the original resource descriptor with the non-matching attributes substituted by the corresponding profile properties. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
93
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Based on the responses from the Core Orchestrator, the Adaptation Evaluator returns the Transcoding Plan. Core Orchestrator. This subcomponent is in charge of finding suitable transformations for the resource descriptors given by the Adaptation Evaluator. It accomplishes its task querying the repository, looking for services that permit to pass from the input resource descriptor to the output resource descriptor given by the Evaluator. If the Orchestrator can find a suitable composition of services, it builds the corresponding subplan and returns it to the Evaluator, otherwise it returns indication that a transformation was not found. In the following section we describe an implementation of this subcomponent. We briefly introduce the other components of the middleware: • Service Repository. Transcoding services are registered and catalogued in the service repository. Every service is described by a set of properties and is searchable, through the specification of property-value pairs. As we described above, it is the Core Orchestrator subcomponent of the Transcoder to deal with the Service Repository. • Profile Manager and Plans Warehouse. This component is in charge of managing the user and device profile and storing the plans currently in execution or already executed. Every user and device profile is described by a set of properties and is accessible by the Adaptation Evaluator subcomponent of the Transcoder. When this latter Component gives out a Transcoding Plan, it is registered in the Plans Warehouse. This component also has the logic necessary to trigger new requests to the Transcoder when the modifications that occur in user or device profile exceed the predefined thresholds. • Plan Execution Engine and Deliverer. This component takes the Transcoding Plan as input, executes it on the original multimedia resources and delivers the adapted multimedia resources to the device or devices currently adopted by a user.
4
Implementation
In this initial phase we concentrate our development efforts on the central part of the Transcoding Component, the Core Orchestrator, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
94
E. Lodolo, F. Marcolini, S. Mirri
thus in this section we show it and the effective implementation details of main related issues. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
4.1
Multimedia resource descriptions
We decided to use the MIME standard [4] for the resource descriptions. One MIME type is represented through a simple text string composed of a primary type and a secondary type (or subtype). Well known examples of MIME types are: text/xml, text/plain, audio/wav, image/jpeg. The primary type operates at the highest abstraction level and describes if the corresponding resource is a text, audio, image or video. The subtype describes the specific format of the multimedia resource: for example one text resource can be in xml or plain format. Multimedia resources can, in general, have a much richer description. We have already analyzed the MPEG-7 standard [5] and soon we will use the derived resource descriptions in the implementation prototype. Anyway, the MIME types describe the primary features of the resources, which are part of every other resource description system, including MPEG-7, and so represent a good start option for the prototype development.
4.2
Transcoding service descriptions
We used XML descriptors in which, among the others, figure the input and output MIME types, the necessary information for the search of compositions. As we described above, the storing and cataloguing of the transcoding services take place in the Service Repository.
4.3
Transcoding Graph
The component maintains a transcoding graph. The transcoding graph is a structure which corresponds to the possible transformations of all the various multimedia contents, based on the transcoding services present in the repository. The elements that compose the graph are the following: • the graph vertices, which correspond to the MIME types that identify the multimedia resources; A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
95
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
• the graph arcs corresponding to the transcoding services, which transform the original MIME type into a different destination MIME type. The transcoding graph is built considering only directed and weighted arcs. More in detail, it is a multigraph, because this structure allows the association of more arcs in the same direction between the same origin and destination vertices. Each one of these arcs is weighted and the weight is the measure of the Quality of Service (QoS) of the corresponding transcoding operation. The transcoding graph is initially empty but is expanded every time the Transcoder identifies one unknown configuration, not previously solved.
4.4
Core Orchestrator implementation details
The implementation of the Core Orchestrator component is realized in a J2EE environment, in particular using the open source application server JBoss. The prototype is composed of three main parts, which are three different Beans: • One stateless session bean (EJB) which determine the more suitable transcoding service composition. • One entity bean (EJB) with which the transcoding graph is stored persistently on the database. • One MBean, which initializes the graph as empty graph when JBoss starts up and is used also to know the graph state and modify it at run-time through the JMX console provided by JBoss. In addition to these three beans, other utility classes were implemented, among which those necessary for handling the XML service descriptors. When the EAR package is deployed, the beans start up and the MBean creates the empty transcoding graph, which is stored in the database. The active component of the Transcoder is the stateless session bean, which receives and manages the requests for the transcoding of the multimedia resources. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
96
E. Lodolo, F. Marcolini, S. Mirri
4.5
Transcoding operations
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
When the component receives a transcoding request it gets the input MIME type (the one of the resource to transcode) and the output MIME type (the one of the wanted transcoded resource). The Bean retrieves the transcoding graph from the database and, firstly, checks the graph to verify the presence among its vertices of the two MIME types forming the request. In case this verification is successful, the Bean checks whether one transformation path from the origin MIME type to the destination MIME type exists or not. In case there is such a path in the graph, then this is returned in output from the bean. In case this path is not present in the transcoding graph, then the bean queries the repository, with the aim of finding possible compositions of services which allow the desired transformation from the starting MIME type to the destination MIME type. Thus new vertices and new arcs are added to the transcoding graph, which is initially empty but grows in dimension over time, based on the repository responses, which permit to add vertices corresponding to the MIME types and transcoding arcs to the graph. So, in case the graph does not contain the vertices corresponding to the two MIME types involved or arcs corresponding to the necessary transformations, one transcoding request causes the following steps: 1. initially the destination MIME type is explored, querying the repository, possibly with corresponding graph expansion. To do this, we chose the breadth-first exploration algorithm, because it privileges a small number of transcoding blocks; 2. then the shortest path is calculated, between the graph vertex corresponding to the starting MIME type and the graph vertex corresponding to the destination MIME type. The algorithm chosen for this task is the Dijkstra Algorithm.
5
Experimental Scenario
In this section we describe the Transcoder functionalities through the presentation of some use cases. Let’s firstly consider the situation in which a user wants to get the news from one RSS feed on his/her mobile phone in SMS format. As we know, the Transcoder uses MIME types for the description of the resources. The information we need to A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
97
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
deliver to the user resides in the RSS and we know that the MIME type of such resource is text/xml. We also know that the user wants to get news in SMS format, which is a text/plain MIME type. Thus, the Transcoder needs to look for a transcoding service which has text/xml as its input and text/plain as its output. Following, we illustrate how the Transcoder performs its work (based on the algorithms described in the previous section): the Transcoder is deployed on the JBoss application server as an EAR package. During its initialization the component creates the empty graph and memorizes it in the database. The Transcoder receives a request for a transformation of a resource with MIME type text/xml into a resource with MIME type text/plain. The graph is currently empty and the Transcoder cannot serve the request, therefore it starts the breadth-first search, querying the repository. The repository receives the following query: find all the services which have output MIME type text/plain. Let’s suppose that the Repository contains two of such services: one with input MIME type text/xml and the other with input MIME type audio/wav. The Repository in its request back to the Transcoder provides the descriptors for the two Services.
Figure 2: Transcoding Graph The Transcoder creates three vertices in the transcoding graph A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
98
E. Lodolo, F. Marcolini, S. Mirri
and two transcoding arcs (as depicted in Figure 2): The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
• the vertices corresponding to the MIME types text/plain, text/xml and audio/wav; • the following transcoding arcs: text/xml to text/plain and audio/wav to text/plain with related QoS. The Transcoder stores the descriptors received from the Repository related to the two services found. The Dijkstra algorithm is then executed given the source MIME type text/xml and the destination MIME type text/plain. The algorithm returns the path from the two MIME types, which is composed of one transcoding arc, and the component then creates the Transcoding Plan. The Transcoder receives another request, for which it is needed the transformation from audio/wav to text/plain. This request could be derived for example from the situation in which a user wants to chat to another user, the first using a mobile phone microphone, the second using a chat text client. The current graph is the one shown in figure 2. In this situation the Transcoder executes the following steps: 1. The Transcoder finds in the graph the two MIME types corresponding to the request and then executes the Dijkstra algorithm between the source vertex audio/wav and the destination vertex text/plain. 2. One path is returned from the algorithm and the Transcoder creates the corresponding Transcoding Plan. The transcoding graph remains unchanged and the Repository was not queried. Let’s now suppose that the Transcoder receives another request, with the input MIME type text/plain and the output MIME type audio/wav. The request could be the complementary of the previous one, for which the user with the chat text client wants to send his/her messages to the user with the mobile phone. Let’s see how the Transcoder works in this situation: 1. the component finds both the MIME types in the graph but executing the Dijkstra algorithm results in no path found. 2. Thus the Transcoder queries the Repository, exploring the output MIME type audio/wav through the breadth-first search. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
99
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 3: Transcoding Graph 3. Let’s suppose that the Repository returns two service descriptors, one with input MIME type audio/mp3 and the other with input MIME type text/plain. 4. The Transcoder updates the transcoding graph, adding one vertex, corresponding to the MIME type audio/mp3 and two transcoding arcs: audio/mp3 to audio/wav and text/plain to audio/wav. The resulting graph is shown in Figure 3. 5. The Transcoder then executes the Dijkstra algorithm between the following two vertices: text/plain and audio/wav. 6. The algorithm returns the path and the Transcoder creates the corresponding Transcoding Plan. We point out that the Transcoder could then receive a request for the transformation of one audio/mp3 resource into a text/plain resource. The Transcoder, without querying the Repository, would find a possible transformation, made by two services, the first one from audio/mp3 to audio/wav and the second one from audio/wav to text/plain. We conclude this experimental scenario stating that similar use cases apply to situations in which also other resource types are present, images and videos. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
100
6
E. Lodolo, F. Marcolini, S. Mirri
Conclusions and Future Work
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
In this paper we provided our solution to the problem of content presentation. Starting from the consideration of three main aspects, contents, transformations and context as first-class design issues, we designed and developed an architecture which has the potential to greatly increase the accessibility of the information and services of interest to the users. The solution exploits two key considerations. The first one is the need for fine-grained adaptations, accomplished using detailed descriptions and a newly designed transcoding algorithm. The second one is the necessity for the adaptation engine to follow dynamically the changes in the context, accomplished using profile properties modification thresholds, which permit to trigger new requests to the Transcoding Component only when needed. Through the implementation prototype we demonstrated the feasibility of our approach, showing the central part of the component, the Core Orchestrator. We showed the algorithms used by this subcomponent and one experimental scenario which leverages its capabilities. Based on the results obtained, we will further proceed in the prototype implementation. Firstly, we will use resource descriptors in MPEG-7 format. Then, we will exploit the functionalities of the Profile Manager. These important extensions to the prototype will permit to fully reach the stated goals. Afterwards, performance measures and scalability issues will be evaluated in a J2EE clustered environment.
Bibliography [1] Berhe G., Brunje L., Pierson J. [2004] Modeling service-based multimedia content adaptation in pervasing computing. In Proceedings of the first Conference On Computing Frontiers, 2004, pp 60-69. [2] Colajanni M., Lancellotti R. [2004] System architectures for Web content adaptation services. IEEE Distributed Systems online. [3] FIPA [2002] FIPA Device Ontology Specification. Available at http://www.fipa.org/specs/fipa00091/index.html. [4] IETF [1996] RFC 2046 MIME Part Two: Media Types. November 1996. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Design of a Transcoding Component for a Dynamic...
101
[5] ISO/IEC JTC 1/SC 29/ WG 11 (MPEG) [2004] MPEG-7 (Multimedia Content Description Interface). The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[6] D. Jannach, K. Leopold, C. Timmerer, and H. Hellwagner [2004] Toward Semantic Web Services for Multimedia Adaptation.Lecture Notes in Computer Science, Springer-Verlag, 2004, pp. 641–652. [7] Open Mobile Alliance Profile Data (OMA) [2003] User Agent Profile (UAProf ). Available at http://www.openmobilealliance.org/tech/profiles/index.html. [8] J. R. Smith, R. Mohan, and C. Li [1998] Transcoding Internet Content for Heterogeneous Client Devices.in Proceedings of IEEE International Conference on Circuits and Systems (ISCAS), May 1998. [9] W3 Recommendation [2004] Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies 1.0. Availbale at http://www.w3.org/TR/2004/REC-CCPP-struct-vocab-20040115. [10] WURFL [2006] Wireless Universal Resource File Library. Available at http://wurfl.sourceforge.net/.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with Contextual Service Provisioning for Mobile Internet Users
Vincenzo Pallotta, Amos Brocco, Dominique Guinard, Pascal Bruegger, Pedro de Almeida Department of Computer Science University of Fribourg, Switzerland {
[email protected]} Abstract RoamBlog is a mobile ubiquitous computing architecture that enables roaming internet users equipped with mobile networked devices such as Smartphones, PDAs and JavaPhones to continuously interact with geocontextualized services. Within the RoamBlog architecture, we propose a novel user interface for mobile ubiquitous computing systems, the Kinetic User Interface (KUI), where physical motion along geographical locations is recognized as an “embodied” pointing device. We present a few usage scenarios for which applications have been designed on top of the RoamBlog architecture.
1
Introduction
Internet and mobile computer technology are changing the way users retrieve information and interact with media and services. Personal computing in its original form is disappearing and it is taking the shape 103
104
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
of a global distributed system made of several interconnected heterogeneous computational components with different degrees of mobility and computing power. This paradigm-shift supports the vision of what is commonly called ubiquitous computing [24]. As stated in [23] there are two important issues concerning Ubiquitous Computing: context and scale. Context typically includes information about the actual usage situation of a computing device by a specific type of user in a distributed system, such as the spatio-temporal location of the device, the type of the device, the user and his/her preferences, and basically all additional information that can be captured by environmental sensors physically or logically connected to the device and its user. Scale refers to how suitable a given type of device is to carry out certain tasks, or better, what are the computational requirements and constraints to execute a given task. These types of information can be used to change the behaviour of an application running on a mobile device in order to adapt its use to different situations. Context-awareness is also crucial in mobile service provisioning and orchestration [14]. Services providers and service composition platforms can take advantage of additional information to provide flexible and adaptive services [4]. There are many ways of capturing and representing context as well as techniques to reason about context. This topic is of course important but outside of the scope of this paper. The interested reader might refer to [3]. An important dimension of context is the user’s actual spatiotemporal information, which includes time-stamped geographical coordinates of the user’s location and motion parameters (e.g. speed, acceleration, direction). This type of information plays an essential role in mobile ubiquitous computing applications as it provides the user not only with a context but also with an additional input modality, which can be used alone or in combination with other ordinary, possibly unobtrusive and natural input modalities (e.g. voice, gesture recognition). In the simplest cases, users can simply provide as an input their current geographical location obtained either from an outdoor localization system (e.g. GPS or Wireless Cell Triangulation) or from physical presence/motion detection system in indoor situations (e.g. “smart-buildings” equipped with RFID readers). Mobile ubiquitous applications usually exploit geographical information in order to allow the user access geo-contextualized services, publish multimedia data captured during their roaming (e.g. geo-tagging and geo-blogging) and interact with remote applications by motion. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
105
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
In this paper we describe the basic features of the RoamBlog architecture on top of which a wide class of mobile ubiquitous computing applications can be designed and implemented. The types of applications we are targeting are those requiring full user interactions with a number of integrated geographically contextualized services. Interaction can be achieved either by using ordinary user interfaces on mobile computing devices or by the recognition of user’s goals and intentions from physical motion by means of what we call the Kinetic User Interface (KUI). Three motivating usage scenarios are also described which demand the particular features of the architecture we proposing.
1.1
Motivations and Goals
Our involvement in the research area of “Mobile and Ubiquitous Services and Applications” is motivated by the increasing availability of wireless Internet connectivity (e.g. GPRS/UTMS, WiFi, EDGE) and by the upcoming availability of enhanced satellite navigation systems, such as Galileo1 . The success of Internet “blogs” and the increasing availability of Internet-enabled mobile phones suggest us to consider types of applicative scenarios where mobile Internet users can not only access to contextualized content and services but also produce them while roaming, without taking care of all details of the process (i.e. manually establishing an internet connection, authenticating, typing text, and uploading media). In fact, RoamBlog allows the continuous tracking of mobile internet users and the transfer multimedia data enriched with timestamps and geographical coordinates onto a remote server which might hosts, for instance, the users’ blog pages. The RoamBlog architecture thus enables scenarios in which the easy authoring of spatio-temporally annotated web content is made possible. For instance, combining RoamBlog applications with “mashups” techniques2 on blog clients, readers can enjoy a really immersive experience by suitably retrieve the author’s data scattered over time and places, and replay the author’s roaming experience. Moreover, if realtime geographical tracking is enabled, the blog’s readers can follow the authors and interact with them during their roaming directly from a web/desktop/mobile interface, either by following their paths on a three dimensional map (e.g. Google Earth3 ) containing the geo-located 1 http://www.esa.int/esaNA/galileo.html 2 http://en.wikipedia.org/wiki/Mashup 3 www.earth.google.com
(web application hybrid)
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
106
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
multimedia data, or by suggesting directions and targets which are delivered to the roaming users in a suitable way (e.g. SMS, WAP, Instant Messaging, Voice over IP, ordinary phone calls). Another type of scenario is geo-contextualized service provisioning. Services delivered on mobile devices can be either automatically invoked or geo-contextualized when explicitly requested. In the first case, when the user motion or location is detected, such as entering in a building or a room or passing by a given landmark motion is interpreted by the ubiquitous application as context shift. In such a case, subscribed context-aware services are automatically invoked and delivered. Additionally, the application can perform inferences on the context change history and recognize implicit user’s goals and intentions. This type of scenario includes situations like guided tours, remote assistance in health institutions, smart home/building environments, etc. Our goal is to elaborate the concept of Kinetic User Interface as well as a middleware architecture to support it. KUI is a user interface for ubiquitous computing applications where physical motion is used as a main input device. Motion is recognized by local or global positioning devices and location information is made available to pervasive applications that will react as in a usual mouse-based desktop interface by selecting geo-located items, showing pop-up information, and more generally triggering geo-contextualized actions.
1.2
Related Work
Outdoor and indoor motion tracking is an essential aspect of our work. Many commercial GPS tracking applications are already on the market such as Trimble Outdoors4 or ESRI ArcPad5 . Unfortunately very few of them are integrated with Internet service provisioning or serve as user interfaces. An interesting example of commercial indoor tracking system is provided by Ubisense6 which proposes both a hardware and software solution. Ubisense allows users to wear a light smart-badge with high-accuracy indoor motion tracking. Ubisense provides APIs and visual tools for developing customized applications. From application side, with the advent of Web2.07 , a great deal of applied research and development has been recently devoted to embed 4 http://www.trimbleoutdoors.com/TrimbleOutdoors.aspx
5 http://www.esri.com/software/arcgis/about/arcpad.html 6 http://www.ubisense.net/
7 http://en.wikipedia.org/wiki/Web
2.0
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
107
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Internet technology in everyday life, ranging from pure entertainment to critical applications such as healthcare, national security, military (C´aceres et al., 2006). The rationale behind these efforts is to provide the mobile internet user with a great flexibility in authoring, publishing and retrieving information, as well as accessing services that are relevant in a given situation. A remarkable example Web2.0 mobile application for geo-tagging is SocialLight8 , which allows the tagging of geographical location with multimedia tags (i.e. shadow-tags). The roaming users can geo-tag a place either when they are physically present using a mobile phone, or by attaching the tag on the SocialLight web page by a GoogleMap mash-up. Compared to RoamBlog, SocialLight has similar design goals, but only fulfil them partially. Therefore, we are aimed at a more general framework as the one recently proposed for Blogjects [5], which includes SocialLight-like applications as special cases. Although context-aware mobile (or nomadic) computing is still in its infancy, a number of interesting ad-hoc applications has been developed in recent years, mostly in the academic research domain (see [8]for a survey). Academic research projects such as Mobile Media Metadata9 [9] are among the first attempts in developing this new trend of pervasive mobile internet technology, while we expect to more similar projects to be developed in the next future relying on pervasive positioning technology [20][2]. We observe, however, that the main focus in context-aware mobile computing is on adapting mobile applications to one particular context at the time rather than taking into account context shift and context history. Towards this alternative direction an interesting work is that of Cyberguide [1], Cyberguide is one of the few attempts in taking user’s motion into account (see also [21]). A tourist equipped with indoor (IR beacons) and outdoor (GPS) localization devices can receive relevant information on a PDA and feed a trip journal.
2
The Kinetic User Interface
We observe nowadays a great effort in porting ordinary (i.e. PC-based) user interfaces to mobile computing devices (e.g. PDA, SmartPhones, Media Players). It is clear however that new types of interaction are 8 http://www.socialight.com
9 http://garage.sims.berkeley.edu/research.cfm#MMM
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
108
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
required in order to unleash the usability of ubiquitous computing applications. Most of current context-aware mobile applications are based on small computer devices, but there is not much difference between a desktop PC and a PDA except the size and the responsiveness of the graphical interface. With the exception of sophisticated wearable interfaces equipped of multiple sensors and virtual reality rendering devices, input modalities for hand-held computers are limited to standard GUI with additional handwriting and voice recognition. We believe that physical motion over the geographical dimension can itself be used as an additional input modality in context-aware mobile applications. We claim that geographical motion can be used not only as a source of contextual information but also as an input modality that reflects the user’s goals and intentions, and we describe three scenarios where spatial motion plays an essential role. In other words, we would like to push the user to intentionally cause events by moving from one place to another, by following a certain path or by performing a given motion pattern. We elaborate here the concept of “Kinetic User Interface” (KUI) as a way of endorsing the Ubiquitous Computing vision and the new vision of Continuous Computing [19]. KUI is neither a Graphical User Interface (GUI) nor a Tangible User Interface [13], but rather an interface through which the motion, in terms of physical displacement within the environment, determines the execution of actions (e.g. service requests, database updates). In other words, motion is considered as an input for an application. Similarly to “hovering” the mouse over a desktop, the user can either trigger events from the environment by just moving from a location to another, or just “passing by” a given object or landmark. The user can “click” the physical space by executing actions/operations on physical objects such as grabbing, touching, moving, juxtaposing, and operating portable devices through their own interfaces. Following a path or executing a pre-defined motion pattern can be related to mouse “gestures” and consequently trigger reactions by the system. This way motion becomes a full-fledged interaction modality that can be afforded by the user, which can be used alone or in combination with other modalities. Our approach builds on and extends the notion of context-awareness as defined in [10] where the focus is shifted from on-demand contextualized services only to include context-triggered ones. Context-aware applications traditionally make use of a static snapshot of context with A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
109
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
the aim of contextualizing the service offered. In contrast, motion in a KUI represents the principal way of (embodied) interaction: context changes trigger system’s reactions directly, rather than only providing a source of complementary information used to deliver a service adapted upon user’s request. In particular we focus on an explicit and deliberate change of context ; in other words, the user is “aware” what the current context is and decides to act in order to cause a reaction from the computational environment. For instance, a typical context change situation can be detected when a person is moving from an indoor to an outdoor environment. To make a parallel with ordinary, mouse-controlled user interfaces, moving the pointer on the screen is an intentional act recognized by the system that can trigger reactions specified in the application currently running on the computer. Additionally, the system might be able to recognize and make use of several other parameters of motion like speed, acceleration, pauses, direction, etc. Although the user might be not aware of what reactions will be caused by its moving, he is indeed aware that motion will be taken into account by the system. Furthermore, a mouse-based user interface usually shows a predictable behaviour (i.e. hovering the mouse does not usually cause unexpected reactions from the system); being predictable is an essential feature of user interfaces, because it allows the user to feel progressively comfortable in using the device: a suitable amount of feedback is required to ease user’s understanding of the system (i.e. the pointer arrow is displayed according to mouse motion). In case of body motion in a physical space, feedback cannot be given in the same way as for pointing devices on computer displays. Generally, we need to give back only the minimal amount of information required to inform the user that his body motion has been recognized by the system, because the system should avoid interfering too much with user’s current activity. Nevertheless, a feedback control mechanism is necessary for other reasons such as privacy: to grant a certain level of protection, the user must be notified somehow when his/her presence and motion is being tracked. Indeed, he/she must be always given the possibility to stop the tracking, more or less in the same way as the user could stop using the mouse and keep only using the keyboard (which might in turn render the interaction with the computer very difficult if not impossible). Motion alone is not sufficient for carrying out a complete interaction with an “ambient” computer, in a similar way as it might just be A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
110
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
insufficient and unreliable to use a mouse without its buttons. Furthermore, relying only on motion can make the interface difficult to interact with, and misinterpretations of movements can lead to unexpected results. For this reason, the KUI should also support other interaction modes such as voice, tangible objects or gesture recognition in order to execute complete actions recognized by the system. For instance, it might be conceivable that in a Smart Home environment, the user asks the system to move a media being played in the room where is currently located to another room to which he/she is moving (provided that the media can be played with suitable devices in both rooms), by just carrying a meaningful object (e.g. the remote control). We distinguish two main types of physical motion detection: Discrete motion detection: one single step of motion is detected no matter of what trajectory is followed between the start point and the endpoint. This is the case moving from a room to another or passing by a landmark. Continuous motion detection: the motion trajectory is reconstructed by the system by interpolating points got from periodical probes. This is the case, for instance, of a travelling vehicle tracked by GPS, or a moving person spotted on a video camera stream. Several parameters can be recorded such as speed, acceleration, direction, height, etc. Services offered by the back-end application in reaction to events produced by the KUI can be divided in two classes: subscribed: the user has previously subscribed to a service and the service is activated when a change in the context is detected. on-demand : the user makes a request and the systems provides the service by taking into account the current context. Subscribed services are normally delivered as reactions to notifications of context change (callbacks), whereas on-demand services only require the ability to access context information (current location, time, etc.). As motion detection itself is not sufficient and sometimes impractical to interact with the application, an additional source of control can be provided by physical sensors (e.g. phidgets 10 ), or system-recognized “smart” objects. Coupling different input sources with body motion gives the user additional control over the system: in the previous Smart Home example, carrying the remote control is interpreted as “dragging” the media being shown from one room and “dropping” to another. Mov10 http://www.phidgets.com/
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
111
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
ing without the remote control should not trigger any media related “Drag&Drop” action. It is important to note that the “Drag&Drop” feature has to be programmed as part of application’s KUI, which would then generate an event in response to the context change event (e.g. moving from one room to another with the remote control). The remote control object is what we call a “cursor” which is linked to objects that can be moved but cannot move by themselves. The remote control example also shows how our notion of context can also describe the physical or logical relations between objects (e.g. containment, proximity, adjacency, ownership), and the fact that it is possible to conceive a nested or linked contexts views (e.g. nested containers). The fact that a person is moving from one room to another constitutes a context, if the person is also carrying an object can be considered as a particular sub-context of the former instead of an unrelated new context. Also, the co-presence in the same location of two or more object can be used to logically bind the objects together. For instance, if a person (a cursor object) and a remote control (a dragged object) are moving into the same location, the system can infer a link between them. In contrast, the link can be broken if the cursor object moves out without the dragged object. As pointed out before, carrying a detectable object while moving can lead to a system reaction (corresponding to the notion of “dragging” in the mouse-based GUI). It is worth to note that KUI supports the “situated plan” view [22]. In fact, the application constantly monitors the user’s behaviour (at least for what concerns the part of context which is submitted to the KUI-manager) and it does not follow a “rigid” plan. Rather it is “flexible” and “adaptive” in the sense that it might assume an explicit overall goal communicated by the user and it “collaborates” with the user to its achievement. Actions that are recognized as not being part of the workflow related to the stipulated goal are just ignored. Applications are responsible to select “relevant” actions. Specifically, actions are interpreted by the application and filtered according to contextual information (e.g. the action of stopping when there are no relevant nearby landmarks is filtered out).
3
RoamBlog Architecture
The RoamBlog architecture is a layered middleware for mobile ubiquitous computing applications based on standard enterprise application A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
112
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
server architectures that implements the KUI interface. RoamBlog allows the geographical contextualization of mobile services (e.g. a service requested by a user in a given time and place) as well as the integration of geographical information into back-end applications (e.g. data-based updates triggered by motion detection). Mobile devices carried by the roaming user can be connected to the RoamBlog architecture as either client or servers depending on the above two situations. Corresponding to the two types of usage there are two mechanisms for exploiting and producing geographical information (coordinates): • Pull: the coordinates are obtained from the proximity of an electronically tagged object (e.g. a badge) to a geo-referenced device (e.g. a RFID reader). The information read from the object is used to create a logical link between the user carrying the tagged object and the application that the user is currently using (provided that the reader is part of the application’s infrastructure and the badge is bound to the user who is currently requesting the service). For instance, if a service is requested on a PDA just after that a RFID badge has been detected, the service will be aware of the current location of the PDA and the service will be contextualized accordingly. • Push: the coordinates are generated by the user’s device (which knows its own location) and sent to the application. In this case, the geographical information is locally gathered (e.g. a GPS receiver, fixed location RFID tag) and sent either as a parameter of service invocation or used in a KUI interaction. The KUI interface is integrated in the RoamBlog three-layered architecture (as shown in Figure 1). The Topology Layer adds an overlay structure to the physical world made of references (geographical points) or aggregate references (areas): this information is linked with localization objects in the real world such as fixed RFID antennas, GPS receivers, etc. The KUIdget Space Layer is an object-oriented framework that contains a set the context-carrying objects (KUIdgets) found in the real world that the service provider is interested in; a KUIdget can reflect a real entity (i.e. persons, places, and physical objects) or be completely virtual (e.g. a geo-tagged media). Each KUIdget is identified with a universally unique identifiers (UUID), and can be linked A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
Application
Application
Application
113
Application
High-level events
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Activity
Interpretation (history / patterns of events)
Inference
ACTIONS KUIdget Space
KUIdget state update / querying
ACTIVITY
events
IKUIdgetRelation
Manager
IKUIdget
IFixedKUIdget
OPERATIONS
Relations management KUIdget state update
IMobileKUIdget
Filtering / Aggregation
+isCursor()
OBSERVATIONS / NOTIFICATIONS
Topology
ILocationPrimitive
Point AGENT
+distance(obj:ILocationPrimitive) +includes(obj:ILocationPrimitive)
GPS
Area
RFID Antenna
Figure 1: The three-layered RoamBlog architecture with a data structure in the Topology Layer to provide direct localization; indirect localization is obtained from other KUIdgets through chains of KUI relations (e.g. containment or proximity): for example, to find the location of the mobile phone of a user driving in a GPSequipped car, it is possible to inherit the position from the container cursor KUIdgets (in this case the user from the car, and the mobile phone from the user). We also propose the use of weights so that the relationship and dependence between objects can be better managed and exploited: high values means that there is a very strong locationrelationship between two objects, whereas with low values it is possible to express a weaker relationship (objects are loosely coupled), meaning that indirect positioning cannot be inferred reliably. Changes in relations and state (represented as object properties) of KUIdget generate events that can be captured by related KUIdgets or by the upper layer, the Activity layer, which manages higher-level semantic contexts and emits context change signals that are sent to the applications. Relations can be created either by the KUIdget’s internal logic in response to events (or by fetching information from the lower level) or by upper layers (to reflect an inferred situation). For example, in the first case a containment relation is created when a use carrying a RFID badge enters a room equipped with a RFID antenna at the door. In the second case the user explicitly communicate his/her position to the application (for instance, using a mobile communication device), and the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
114
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
system can relate her with KUIdgets in the range. The importance of keeping relations in the KUIdget Space Layer derives from the fact that when context changes within a container KUIdget are projected onto all the contained KUIdgets which, in turn, might themselves generate a context change event. This is important because an application can be only aware of the contained KUIdget’s context dynamics and not of the containing one, which is nonetheless responsible of providing context information to the inner one. RoamBlog also enables social interaction. When two users both carrying a KUI-enabled device find themselves close to each other or in the same place (e.g. the same room, the same train car, or attending the same event), if they both subscribed to the same RoamBlog-based social network service they can be alerted of their co-presence. Depending on the subscribed service they can be prompted with relevant information and possibly get to know each other. The user might, for instance, specify filtering criteria, temporarily disable the subscription (e.g. do not disturb), or even request a server to locate other users who match certain criteria (e.g. find any Italian speaking users subscribing a specific service within a range of 1 Km.).
4
Pilot Projects
We present in this section two projects that demonstrate the use of KUI and contextualized service provisioning. While they both exploit the concepts outlined in the RoamBlog architecture, they only illustrate them separately. In particular, the UbiBadge project adopts the indoor discrete motion KUI interface with contextual service provisioning, while GlideTrack illustrates a case for the outdoor continuous KUI.
4.1
UbiBadge
We outline the main aspect of a pilot project conducted at the University of Fribourg and based on the RoamBlog architecture. The idea behind this project is to offer contextualized services to students carrying an RFID badge, based on their wandering in the University’s building. Events can be triggered, for example, by stopping at the student association billboard, entering a room or by leaving it. These events cause a series of contextualized reactions, for example the delivery of the today’s menu on the student’s mobile phone when entering the uniA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
KUIdget Space Manager
Raw data / Antenna
Service Contextualization
115
Service delivery
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Event Manager Application Management RFID Hardware KUIdgets
Persistent Storage
Smart Web Server JMS
RIED KUIdget Space
MySQL
Topology Layer
Sensing Layer
RoamBlog middleware
Application
Figure 2: UniBadge Architecture
versity’s canteen, or sending news when the student stops in front of a notice board. The UniBadge architecture sits on top of the RoamBlog middleware, where the University’s topology (rooms, landmarks) and mobile entities (students) are abstracted to KUIdgets. The localization service is provided by the RFID Locator detailed in [11] and also developed at the University of Fribourg. Rough events gathered from RFID hardware are sent to corresponding KUIdgets which then spawn higher-level events (e.g. entering or leaving a room) that are captured by the Activity Layer. Based on this information, the latter updates the actual context and informs the application of the context changes, in order to deliver the corresponding service.
GPRS
KUIdget Space Manager
Service Contextualization
MySQL
Bluetooth GPS KUIdgets
Persistent Storage
Web Server
KUIdget Space Topology Layer
Sensing Layer
RoamBlog middleware
Visualization Client (e.g. Google Earth)
Application
Figure 3: GlideTrack Architecture
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
116
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
4.2
GlideTrack
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The GlideTrack project, whose architecture is shown in Figure 3, is an example of the use of KUI-based interaction for the case of continuous motion. The project is meant to provide a simple and portable solution for people to get traced during a trip. The users of the GlideTrack application have an associated KUIdget whose context can be updated by simply receiving the actual coordinates from a GPS enabled cell phone through a GPRS internet connection. Context changes are stored in a database, and can be retrieved and mashed up onto a geographical map (e.g. Google Earth).
5
UbiTour Scenario
We present an overview of an ongoing project on e-Tourism, combining indoor-outdoor KUI, geo-blogging and geo-contextualized service provisioning.
5.1
Scenario description
Discovery and surprise are among the main motivations of tourists. Tourists like to see new things and find themselves in unexpected, but still controlled situations. On the one hand, if tourists had to fully plan their vacations then they are likely to loose the surprise dimension and will find their vacation less satisfying as expected. On the other hand, tourists would like to have their trips organized by a tour operator in order to minimize the risk of finding themselves in unexpected bad situations. As an optimal trade off between fully planned and completely blind vacations, tourists might like to have a certain degree of flexibility, and thus have the possibility to cancel some already planned travel decisions that might prevent them to enjoy unexpected situations which they could not be aware of until the actual time of the trip. Finding the right trade-off between a rigid, full organized tour, and a fully flexible, unorganized vacation, is the problem we would like to address here. UbiTour is our solution to this problem. UbiTour allows its users to start with a minimally organized vacation and incrementally and contextually plan the next steps “on-the-go”. Moreover, UbiTour provides contextual help and recommendations which the user is still free to adopt or ignore. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
117
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
This is clearly a scenario where RoamBlog could play a fundamental role. To some extent, the user is always (transparently and unobtrusively) followed by a virtual assistant which can be either contextually queried or can take initiative in reaction to relevant context changes. Moreover, UbiTour tracks users’ displacements and collects data about their experiences by updating travel blogs. In this way, users do not have to worry about collecting their travel notes in a journal or upload their pictures/movies on their blogs11 . As a desired side effect, safety issues are also addressed in this scenario. The interruption of the information flow from the roaming user might signal an emergency situation that could be detected and pro-actively handled by the UbiTour assistant. The UbiTour application defines a geographical KUI space where the user can freely move, select items, receive and send information using personal devices, and interact with local connected devices (e.g. touch screens). As an example of the types of services offered by UbiTour, imagine that a user spent a whole day driving along the Californian coast where she stopped several times as she spotted an interesting landscape. At 8PM she is tired and wants to find an overnight accommodation in the closest urban center. She is approaching San Luis Obispo and she informs the system that she needs a place to sleep. The system already knows her accommodation preferences and when she passes by a hotel she spots (by decreasing the speed or stopping nearby) the system checks in a database if the hotel suits her preferences and if a room is available. It then informs her of the research results and starts a confirmation dialogue using a suitable modality (e.g. voice). Let’s look now the details what happens at RoamBlog level. First, the user activates the KUI interface by informing the UbiTour application that she is looking for a Motel where she intends to spend the night. UbiTour selects a number of relevant landmarks (i.e. Motels connected to the Ubitour network) and start recognizing when the user is approaching or stopping at one of them by listening to location change events generated by the KUI middleware. Technically, the KUI-space Layer sends events when the user enters or leaves the area associated with landmark (i.e. a proximity relation has been created). These events are mapped onto higher-level actions of the “Motel selection” task in the overall activity of “finding an overnight accommodation”. The next 11 Nowadays
an important concern of tourists is allowing relatives and friends to remotely follow them during their vacation. This is typically done by periodical calls or manually updating a travel journal in a public Blog. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
118
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
steps would be making the reservation and checking in by possibly using an ordinary interface on a mobile device. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
6
Conclusions
The original contribution of this work to geo-contextualized service provisioning is the seamless and transparent integration of indoor and outdoor motion detection within context-aware mobile Internet infrastructure. The RoamBlog architecture enables users of mobile Internet applications to interact with a back-end system by means of a new user interface based on motion tracking, the Kinetic User Interface, with no manual switching between indoor and outdoor modality. This brings the advantage of reduced user intervention on the carried mobile devices, as well as the better awareness of the services being activated by the user’s motion. We described the basic features of the RoamBlog architecture as well as three scenarios where this type of architecture shows its practical usefulness.
6.1
Future Works
RoamBlog is a new project and thus full of shortcomings and possible extensions. It does not make sense to fully address them here. However, it is worth to note that RoamBlog stems from ideas related to Activity Theory [6][15] and Multi-Agent Systems [18]. Our goal is to better ground the RoamBlog framework on insightful theories and develop a clear software engineering methodology for the rational design of ubiquitous computing applications. We are also interested in usability issue as, for instance, in studying the new types of affordances12 and interfaces that this kind of “invisible computer” might offer. As pointed out by [16], the power of ubiquitous computing lies in the power of the infrastructure. We add that the infrastructure as it is now is not “usable” in the sense that users are still constrained by the old-fashioned WIMP13 user interfaces on mobile devices. In constrast, Information Appliances (i.e. intelligent and networked consumer appliances) and Blogjects (i.e. location-aware real or virtual active object with Internet connectivity) will be the types of devices we will have to cope with. The 12 Affordances
are operations made available to users by an object which are selfevident by its essential features [12]. 13 WIMP stands for Windows, Icons, Menus, and Pointing. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
119
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
productive coordination among these entities in is not trivial and will definitely lead to the creation new concepts in distributed computing and human computer interaction.
Acknowledgements RoamBlog is part of the COPE project on Computing in the Physical Environment: Context-Aware and Intelligent Acting, currently on-going at the Pervasive & Artificial Intelligence research group14 , Department of Computer Science, University of Fribourg, directed by Prof. B´eat Hirsbrunner.
Bibliography [1] G. D. Abowd, C.G. Atkeson, J.I. Hong, S. Long, R. Kooper, M. Pinkerton [1997] Cyberguide: A mobile context-aware tour guide. Wireless Networks 3(5). [2] D. Ashbrook, K. Lyons and J. Clawson [2006] Capturing Experiences Anytime, Anywhere. In IEEE Pervasive Computing (Vol. 5, No. 2) April-June 2006. pp. 8-11. [3] M. Baldauf, S. Dustdar [2004] A Survey on Context-Aware Systems. Technical Report n. TUV-1841-2004-24 (30 November 2004), University of Wien. [4] L. Baresi, D. Bianchini, V. De Antonellis, M.G. Fugini, B. Pernici, and P. Plebani [2003] Context-aware Composition of E-Services. in Proceedings of TES03 (Technologies on e-Services) Workshop in conjunction with VLDB 2003, Berlin, Germany, September 2003, LNCS 2819, pp. 28-41, Springer-Verlag, 2003. [5] J. Bleecker [2006] Why Things Matter: A Manifesto for Networked Objects. Cohabiting with Pigeons, Arphids and Aibos in the Internet of Things, 2006. Available at http://research.techkwondo.com/files/WhyThingsMatter.pdf. [6] S. Bodker [1991] Through the Interface. A Human Activity Approach to User Interface Design. Lawrence Erlbaum Associates, 1991. 14 http://diuf.unifr.ch/pai/research/
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
120
V. Pallotta, A. Brocco, D. Guinard, P. Bruegger, P. de Almeida
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[7] C. C´ aceres, A. Fern´andez, S. Ossowski [2006] CASCOM - ContextAware Health-Care Service Co-ordination in Mobile Computing Environments. (URJC). ERCIM News num. 60, 2006. [8] G. Chen and D. Kotz [2000] A Survey of Contex-Aware Mobile Computing Research. Darthmouth Science Technical Report TR2000-381. [9] M. Davis et al. [2004] From Context to Content: Leveraging Context to Infer Media Metadata. In Proceedings of the 12th Annual ACM International Conference on Multimedia (MM 2004), ACM Press, 2004, pp. 188–195. [10] A.K. Dey, D. Salber, G.D. Abowd [2001] A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of ContextAware Applications. anchor article of a special issue on ContextAware Computing of Human-Computer Interaction (HCI) Journal, Vol. 16 (2-4), 2001, pp. 97-166. [11] P. Fuhrer, D. Guinard, and O. Liechti [2006] RFID: From Concepts to Concrete Implementation. In Proceedings of the International Conference on Advances in the Internet, Processing, Systems and Interdisciplinary Research, IPSI - 2006 Marbella, February 10-13, 2006. [12] J.J. Gibson [1979] The Ecological Approach to Visual Perception. Lawrence Eribaum, Hillsdale, NJ. 1979. [13] L. Holmquist, A. Schmidt, B. Ullmer [2004] Tangible interfaces in perspective Guest editors’ introduction. Personal and Ubiquitous Computing 8(5), 2004. pp. 291-293. [14] S. Kouadri Most´efaoui [2003] ‘Towards a Context-Oriented Services Discovery and Composition Framework. in Proc. AI Moves to IA: Workshop Artificial Intelligence, Information Access, and Mobile Computing, held in conjunction with the 18th Int’l Joint Conf. Artificial Intelligence (IJCAI ’03), 2003. [15] B. Nardi [1996] Context and Consciousness: Activity Theory and Human-Computer Interaction. Cambridge, MA. MIT Press, 1996. [16] D.A. Norman [1999] The Invisible Computer. Cambridge, MA. MIT Press, 1999. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RoamBlog: Outdoor and Indoor Geo-blogging Enhanced with...
121
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[17] P. Prekop, and M. Brumett[2002] Activities, Context and Ubiquitous Computing. Computer Communications, special Issue on Ubiquitous Computing. 2002. [18] A. Ricci, M. Viroli, A. Omicini [2006] Construenda est CArtAgO: Toward an Infrastructure for Artifacts in MAS.In Proceedings of European Meeting on Cybernetics and Systems Research (EMCSR’06), April 18 - 21, 2006. University of Wien. [19] W. Roush What is Continuous Computing? In Continuous Computing Blog, http://www.continuousblog.net/2005/05/what is continu.html [20] B.N. Schilit, A. LaMarca, G. Borriello, W.G. Griswold, D. McDonald, E. McDonald, A. Balachandran, J. Hong, V. Iverson [2003] Challenge: Ubiquitous Location-Aware Computing and the “Place Lab” Initiative. In Proceedings of WMASH’03, September 19, 2003, San Diego, California, USA. [21] B.N. Schilit [1995] System architecture for context-aware mobile computing. Ph.D. thesis, Columbia University, 1995. http://citeseer.ist.psu.edu/schilit95system.html. [22] L. Suchman [1987] Plans and Situated Action: The problem of human-machine communication. Cambridge, UK: Cambridge Univ. Press, 1987. [23] M. Weiser [1991] The Computer for the 21st Century. Scientific American 265, No. 3, 94-104. September 1991. [24] M. Weiser [1993] Hot topic: Ubiquitous computing. IEEE Computer, pages 71–72, October 1993.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
On the Value of Query Logs for Modern Information Retrieval
Domenico Laforenza∗ , Claudio Lucchese‡∗ , Salvatore Orlando‡∗ , Raffaele Perego∗ , Diego Puppin∗ , Fabrizio Silvestri∗ ∗
ISTI-CNR, Consiglio Nazionale delle Ricerche, Pisa, Italy Dip. di Informatica, Univ. Ca’ Foscari - Venezia, Italy {d.laforenza,r.perego,d.puppin,f.silvestri}@isti.cnr.it, {clucches,orlando}@dsi.unive.it
‡
Abstract Query Logs collected by a Web Search Engine (WSE) constitute a valuable source of information which can be used in several ways to enhance efficiency and efficacy of the complex process of searching. This paper surveys the results recently achieved by our group in the design of innovative solutions targeting parallel Information Retrieval (IR) systems. Our solutions exploit the knowledge deriving from the patterns of common usage of the system extracted from query logs. Such knowledge has been used: (1), to devise an effective policy for caching WSE query results; (2), to drive the partitioning of the inverted index among the nodes of a term-partitioned, parallel IR system; (3), to perform document partitioning and effective collection selection in a document-partitioned, parallel IR system. The techniques and algorithms used vary from simple statistical analysis, to frequent pattern mining, and document/query co-clustering. The have the common denominator of exploiting past usage information, and of granting remarkable improvements in efficiency or efficacy. The paper briefly describes the proposals and the framework of their 123
124
Domenico Laforenza et al
application, and reports the results of experiments conducted on large query logs of real WSEs. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
1
Introduction
WSEs, which have permitted users to face Web information overload, perhaps constitute the most important novelty of the last decade in the information technology field. Even if current WSEs are based on techniques of distributed and parallel IR, their design has created new challenges for researchers working in the field. One of the most important one deals with the very large and growing amount of information present in the Web. This large amount of information has called for novel IR approaches that not only improve scalability and efficiency of WSEs, but also the efficacy in the quality and relevance of discovered contents. In order to improve Web surfing, nowadays it has become very important to learn about the behavior of communities or individual users. This is useful not only for customizing Web information or giving personalized recommendations for user navigation, but also for other purposes, concerning the management and efficiency of a Web system, or the effective design of a Web site. It is worth remarking that to learn about user preferences and past requests, a common trend is to exploit Web Mining, which corresponds to the use of Data Mining (DM) techniques to (semi-)automatically discover and extract information from Web documents and services, as well as from their usage. In our recent research work, we applied Web Mining to the optimization of WSE design, where we exploited the knowledge about its usage by analyzing in various ways the logs of user queries. These logs can be collected in several loci, on clients, WSE servers, Web servers, and proxies. Note that the logs constitute a sort of secondary data derived from the user interactions with the Web and a WSE, while the primary data are the Web contents or the associated WSE indexes. Since we focused on the analysis of logs, we exploited a particular research subject in the area of Web Mining [7, 10], called Web Usage Mining. Caching Query Results. The first use case, in which we exploited the past knowledge about query logs, regards the design of a serverside cache of query results, to be deployed on the front-end of a high A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
125
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
performance parallel (and distributed) WSE to enhance its responsiveness and throughput. In more detail, we proposed SDC, a new caching strategy aimed to efficiently exploit the temporal and spatial locality present in the stream of processed queries. The Web usage mining analysis carried out by SDC on query logs is very simple: we identify the most frequently submitted queries, and store them along with the associated pages of results in a static, read-only portion of the cache. The remaining entries of the cache are dynamically managed according to a given replacement policy, and are used for those queries that cannot be satisfied by the static portion. Since SDC is driven by statistical analyses of usage data, it may suffer from performance degradation due to problems concerning the freshness of query logs. From our tests we verified that when statistics are drawn over a relatively long period of time, the hit ratio in the static set of our SDC-based cache degrades quite slow. Partitioning the lexicon of a Parallel WSE. We proposed a supervised strategy to subdivide the vocabulary of terms (lexicon) of a document collection. The goal was to enhance the performance of a term-partitioned, large-scale, parallel IR system, implemented on a shared-nothing, distributed memory, platform. In order to improve the throughput of a WSE, we thus exploited the usage patterns extracted from past query logs to optimize the placement of terms and associated postings lists on the parallel servers making up a WSE. In this case, our analysis on past query logs was more complex than SDC, since we were interested in evaluating the correlations between distinct terms that frequently occur in past queries. It is worth remarking that the aim of our optimization technique was twofold: we aimed not only at balancing the workload among the parallel servers, but also at reducing the average number of servers activated per each query, thus lowering the overall communication volume. We thus investigated how the proposed index organization affects the query processing performance, and how the critical parameters driving our technique can be tuned in order to maximize its performance. Query-Driven Document Partitioning and Collection Selection. While in the previous case we studied how to increase the performance of a term-partitioned organization of a parallel WSE, in this case we dealt with a more common document-partitioned organization, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
126
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
where local indexes are separately built by indexing non overlapping sub-collections of documents. In more detail, we improved the performance of the WSE by devising a novel strategy to partition the whole document collection onto the various servers. The queries are then answered by performing a collection selection, i.e., by selecting a subsets of servers – each containing the local index of a distinct subcollection – that can likely resolve a query with high precision. Our method for query resolution is different from the usual approach followed in document-partitioned WSEs, where each query is processed by all servers, and the lists of partial results retrieved from the various sub-collections are finally merged. The method used to partition the collection is once more based on the analysis of past query logs. In this case the Web mining technique is based on a novel document representation, called query-vectors model, according to which a document is represented by a list recording the (past) queries for which the document itself was a match. Since we aimed at both partitioning the document collection and building an effective function for collection selection, we used a method to co-cluster queries and matching documents. The document clusters are then assigned to the underlying servers of our distributed WSE, while the corresponding query clusters exactly represent queries that return similar results, and are thus used for collection selection. We showed that this document partition strategy greatly boosts the performance of standard collection selection algorithms. The rest of the paper is organized as follows. Section 2 discusses the use of query logs for enhancing the caching policy of WSE query results. In Section 3 we show how the knowledge extracted from past query logs can be used to improve the performance of a term-partitioned parallel WSE. Section 4 discusses how we can use the same logs to improve document partitioning and collection selection in a documentpartitioned parallel WSE. Finally, Section 5 draws some conclusions.
2
Caching Query Results
Caching is routinely used in the Web since it allows the bandwidth consumption to be reduced, and the user-perceived quality of service to be improved [18]. Due to the high locality present in the stream of queries processed, caching the results of the queries submitted by users to a WSE is a very effective technique to increase the throughput [8, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
127
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
11, 14, 21]. It is virtually certain that some commercial Web search companies adopted query results caching from the beginning of their existence. Our work on query results caching started from an in depth analysis of the usage of real WSEs. The plots reported in Figure 1 assess the locality present in three query logs: Tiscali, a trace of queries submitted to the Tiscali WSE engine (www.janas.it) on April 2002, Excite, a publicly available trace of the queries submitted to the Excite WSE (www.excite.com) on September 16th 1997, and Alta Vista, a query log containing queries submitted to Alta Vista (www.altavista.com) on the Summer of 2001. In particular Figure 1.(a) plots, using a log-log scale, the number of occurrences within each log of the most popular queries, whose identifiers have been assigned in decreasing order of frequency. We can see that in the three logs about 1000 different queries are repeated more than 100 times. Note that the trend of the three curves follow that of a Zipf ’s law. We briefly recall that the occurrences y of a phenomenon is said to follow a Zipf ’s law if y = Kx−α , where x is the rank of the phenomenon itself. In our case K varies according to the query log analyzed, while α ≈ 0.66 in all the three cases. By assuming that Zipf’s law governs the query to a WSE, it is clear that caching WSE query results can lead to very high hit ratios. However, the above analysis does non give any hint about the best policy to be adopted for managing the cache. Is a simple LRU cache the best choice to capture the high locality present in the stream of queries? In order to answer to this question we conducted some tests on the Alta Vista log aimed at estimating how many times a frequent query q should be evicted from a pure LRU cache because the distance between two successive requests for q (in terms of distinct queries) is greater than the size of the cache itself. The Alta Vista log contains 7, 175, 648 queries spanning a period of about 4.7 days. We extracted from it the 128, 000 most popular queries, and measured the number of distinct queries received in the interval between one and the next submission of each one of these popular queries. Figure 1.(b) plots the cumulative number of occurrences of each distance, measured as the number of distinct queries received by Alta Vista in the interval between two successive submissions of each frequent query. The surprising result was that many popular queries are resubmitted after long intervals. These queries would surely cause cache misses in a LRU cache having a size smaller than the distance plotted in the x-axis. In particular, we measured that 698, 852 repetiA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
100
(a)
10000
Distribution of distances
1000
10
1
1e+07 1e+06 100000 1000 10000 distances (#distinct queries) 100 10 1
(b)
Figure 1: Analysis of the locality present in the query logs.
tions of the most popular queries occur at distances lower than 128, 000, while 164, 813 at distances greater than 128, 000. Thus, the adoption of a LRU cache of 128, 000 blocks should surely result in 164, 813 misses
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
No. of occurrences
Domenico Laforenza et al
128
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
129
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
for these most popular queries. On the other hand, we saw that there are a lot of queries that are popular only within relatively short time intervals, and that can be effectively handled by a LRU-based dynamic cache. This analysis suggested us SDC our hybrid, static and dynamic, cache. SDC is a two-level policy which makes use of two different sets of cache entries. The first level contains the Static Set consisting in a set of statically locked entries filled with the most frequent queries appeared in the past. The Static Set is periodically refreshed. The second level contains the Dynamic Set. Basically, it is a set of entries managed by a classical replacement policy (i.e. LRU, SLRU, etc.). The behavior of SDC in the presence of a query q is very simple. First it looks for q in the Static Set, if q is present it returns the associated page of results back to the user. If q is not contained within the Static Set, then it is looked for in the Dynamic Set. If q is not present, then SDC asks the WSE for the page of results and replaces an entry of the Dynamic Set according to the replacement policy adopted. The implementation of the first level of the cache is very simple. It basically consists of a lookup data structure that allows to efficiently access a set of fstatic · N entries, where N is the total number of entries of the whole cache, and fstatic the factor of locked entries over the total. fstatic is a parameter of our cache implementation whose admissible values ranges between 0 (a fully dynamic cache) and 1 (a fully static cache). Each time a query is received, SDC first tries to retrieve the corresponding results from the Static Set. On a cache hit, the requested page of results is promptly returned. On a cache miss, we also look for the query results in the Dynamic Set. The Dynamic Set relies on a replacement policy for choosing which pages of query results should be evicted from the cache as a consequence of a cache miss and the cache is full. Literature on caching proposes several replacement policies which, in order to maximize the hit-ratio, try to take the largest advantage from information about recency and frequency of references. SDC surely simplifies the choice of the replacement policy to adopt. The presence of a static read-only cache, which permanently stores the most frequently referred pages, makes in fact recency the most important parameter to consider. The plots of Figure 2 report the cache hit-ratios obtained by SDC on the Alta Vista, and Tiscali query logs by varying the ratio fstatic between the sizes of the static and dynamic sets. Each curve correA. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Domenico Laforenza et al
130
(b)
Figure 2: Hit-ratios achieved on the Alta Vista and Tiscali query log for different replacement policies and varying values of fstatic .
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
(a)
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
131
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
sponds to tests conducted by adopting a different replacement policy for the dynamic portion of the cache. The value of fstatic was varied between 0 (a fully dynamic cache) and 1 (a fully static cache), while the replacement policies exploited were LRU, FBR, SLRU, 2Q, and PDC. The total size of the cache was fixed to 256, 000 blocks. We can note that the hit-ratios achieved are in some cases impressive, although the curves corresponding to different query logs have different peak values and shapes, thus indicating different amounts and kinds of locality in the query logs analyzed. Second, and more importantly, we see that SDC remarkably outperformed in all the tests either purely static (fstatic = 1), or purely dynamic caching (fstatic = 0) policies. Finally, we can see that the different replacement policies for the Dynamic Set behave similarly with respect to the SDC hit-ratio. For example, on the Alta Vista log, all the curves but the FBR one resulted almost completely overlapped. This behavior seems to suggest that the most popular queries are satisfied by the Static Set. Thus, the Dynamic Set is only exploited by those queries which we can define Burst Queries, i.e., which appear frequently just for a brief period of time. For these queries, the small variations in the hit-ratio figures seems to not justify the adoption of a complex replacement policy rather than a simple LRU one. Another important issue for a caching policy which is driven by statistical data, is that it may suffer from performance degradation due to problems concerning the freshness of data from which statistics have been drawn. To estimate the possible degradation of the contents of the Static Set, we analyzed the Alta Vista log for measuring how the frequent queries used to fill the Static Set are referred to in the time. To this end, we partitioned the log into two distinct sets: the Training set, and the Test set. From the Training set we extracted the S most frequent queries, where S is the size of the Static Set. For these experiments, we fixed S to 128, 000 elements. By varying the time on which the Static Set is trained, we measured the Static Set hit ratio on the remaining part of the log (i.e. the Test set), and we plotted it at regular time intervals. Figure 3.(b) show the results of the experiment, where each curve represents the trend of the hit ratio value as a function of the time for a different training window. When trained with the queries submitted to Alta Vista during one hour only, the hit ratio on the Static Set was initially very high, but degraded rapidly to very low values (below 8%). As we increased A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
132
Domenico Laforenza et al
Finally, when trained for two days, the performance improved further. In this case however, the remaining portion of the log used as test set was too small to allow us to draw strong conclusions. The freshness of the Static Set thus does not appear to be a big issue. When statistics are drawn over a relatively long period of time, the degradation rate of the hit ratio in the Static Set is quite slow. Thus, a Static Set refreshed daily should grant good and quite stable performances for a relatively long period.
Query log: Altavista. Static Set size: 128,000 entries. 24 Training time = 1 hour Training time = 7 hours Training time = 13 hours Training time = 20 hours Training time = 1 day Training time = 2 days
22 20 18 Hit ratio (in %)
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
the training period, the curves became higher and flatter since the hit ratio degraded less remarkably as time progressed. For example, when the Static Set was trained with the queries of one day, the hit ratio achieved on the remaining test set of about three days ranged from 19% to 17%, with a slow progressive degradation.
16 14 12 10 8 6 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time elapsed from the begin of the query log (in days)
Figure 3: Static Set hit ratio on the Alta vista query log as a function of the time. The size of the Static Set has been set to 128, 000 elements, while the training time was varied between one hour and two days.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
3
133
Partitioning the lexicon of a Parallel WSE
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
In order to grant scalability and high throughput, Parallel Information Retrieval Systems (PIRSs) are usually deployed on large clusters of servers running multiple IR core modules, responsible for searching a partition of the whole inverted index [2]. When each sub-index is relative to a disjoint sub-collection of documents, we have a Local, or Document Partitioned, index organization. Conversely, when the whole index is split, so that each partition refers to a subset of the distinct terms contained in all the documents, we have a Global, or Term Partitioned, index organization. In both cases, in front of the IR core servers, we have an additional machine hosting a broker, which has the task of scheduling the queries to the various servers, collecting and ranking the results, eventually producing the final list of matching documents. In a document-partitioned PIRS, all the servers generally contribute to the processing of each query in parallel on their own subindexes, thus the load is almost evenly distributed among them. In a term-partitioned index organization, for each query the broker has to determine which servers hold the inverted lists relative to each of the q terms of the query. In the worst case, each one of the q inverted lists is held by a distinct server, but, on average several terms may be assigned to the same server, and the total number of servers involved is likely to be q ≤ q. Therefore, the broker has to split a query into q subqueries, and send them to the corresponding servers. Unfortunately, document ranks computed locally by each server might not be significant, since only a portion of the whole query has been considered. Thus the whole list of partial results has to be sent, in principle, to the broker to choose the r most relevant documents. Several papers investigated partitioning schemes for the inverted index [1, 2, 4, 5, 9, 13, 16, 20, 22]. Since in large scale IR systems disk access time may be a significant part of query-time, the term-partitioned approach is attractive because it minimizes disk activity [22]. On the other hand, the high computational load in the broker can starve the IR core servers. In addition, load imbalance may become a serious issue. Implementations of term-partitioned systems usually subdivide the vocabulary randomly or according to the lexicographic order of terms. In [9] the authors demonstrate that load imbalance can be mitigated by exploiting the knowledge about the frequencies of terms occurring in the queries. However, the partitioning techniques proposed so far fails A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
134
Domenico Laforenza et al
in granting an even distribution of subqueries among all the servers. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
In [16], Moffat et al. introduce a novel pipelined query evaluation methodology, based on a term-partitioned index, in which partially evaluated queries are passed through the set servers that host the query terms. The experiments are conducted in a very realistic setting making use of a large collection of data (426GB) and a long stream of actual queries. The new pipelined approach remarkably outperforms traditional term-partitioned implementations, while still maintaining the same desirable I/O properties. The drawback of the new method is once again the poor balancing of servers workload, which becomes a more serious problem as the number of index partitions increases. For this reason the authors indicate as future directions of research the exploration of load balancing methodologies based on the exploitation of term usage patterns present in the query stream. Such patterns can drive both the dynamic reassignment of lists while the query stream is being processed, and the selective replication of the most accessed inverted lists. A recent work of Moffat et al. study the effect of term distribution and replication in a pipelined term-distribute parallel IRs [15]. They called the distribution strategy tested fill smallest and, basically it consists in applying a bin-packing strategy to fill up the partitions weighting each term with its frequency of occurrence within a stream of queries. As the replication strategy they replicated in different ways the index entries associated with a subset of the most frequent terms. Our proposal goes exactly in the direction indicated in [16], and [15] and demonstrates the feasibility and efficacy of exploiting a statistically driven partitioning of the vocabulary of terms among the servers, in order to: (1) reduce the average number of IR core servers activated per each query, thus globally reducing communication volume; (2) evenly balance the number of subqueries among the IR core servers. We think that in order to pursue both these goals, splitting the vocabulary according to the relative frequencies of terms appearing in the queries is not sufficient. Differently from [9,15] we believe that also correlations among distinct terms have to be considered in order to globally reduce communication volume, and achieve a well balanced workload. Due to space limitations we cannot present the complete theoretical framework surrounding our work. Though, we can give the intuition behind the optimization problem that we want to face. First of all we can consider valid two working hypothesis for a pipelined, term-distributed parallel IR systems, like the one described in [15]: A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
135
(Φ) , 1. the throughput of the system can be considered O |Φ|/L The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
where Φ is the stream of queries submitted (|Φ| is its length), and (Φ) is the total computation time of the slowest server involved L in the processing of queries in Φ. ω(Q) = 2. the averag query response time can be considered O Q∈Φ |Φ| O (ω (Φ)), where ω (Q) is the number of servers involved in the resolution of the query Q. Given that these two hypothesis hold, we can now define the metric we are going to use in our problem definition Definition 1 [α-weight] Given a query stream Φ we define α-weight on Φ according to some term assignment as: Ω (Φ) = α ·
(Φ) ω (Φ) L + (1 − α) · Nω NL
(1)
where 0 ≤ α ≤ 1, is a parameter that can be used to tune our αweight. Values of α < 0.5 will make the α-weight more biased toward optimizing the completion time, values α > 0.5 will bias the weight toward the throughput. Nω and NL are constants aimed to normalize (Φ) /NL ≤ 1. the two factors, so that 0 ≤ ω (Φ) /Nω ≤ 1 and 0 ≤ L Note that, in both cases 0 (1) is the best (worst) value. Given the above hypotheses and the α-weight definition, we have (Φ), while query that throughput can be maximized by minimizing L response time can be minimized by minimizing ω (Φ). Unfortunately, these two measures cannot be optimized independently. As an example, an assignment of terms that puts all the terms that appear in Φ into the same server, would obviously achieve the best value of ω (Φ), but (Φ). surely a very bad value of L The α-weight definition allows us to consider both aspects of the problem, and devise a good tradeoff between them. Furthermore, the α parameter give us the possibility of weighting the importance of throughput and response time in optimizing the term assignment function. Now we have all the figures needed to define the Term-Assignment Problem: A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
136
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Servers 1 2 3 >3
baseline cases random bin packing 29 29 39 39 21 21 11 11
term assignment α = 0.9 41 38 16 5
Table 1: Percentages of queries as a function of the number of servers involved in their processing.
The Term-Assignment Problem. Given a value α, 0 ≤ α ≤ 1, a query stream Φ, and p servers of a pipelined term-partitioned PIRS, the TermAssignment Problem asks for finding the partitioning which minimizes Ω (Φ). The term-assignment is a very difficult problem which has some points in common with bin-packing, a classical NP-hard problem. A heuristic for the bin-packing problem which uses popularity as weight for fitting terms into the partitions (as [9, 15]) will obviously optimize (Φ) part of the Ω (Φ) metric, but would ignore the other term the L representing the response time. To find a good trade-off and allowing the optimization of both aspects, we used instead a term clustering technique based on the results of a scalable Frequent Itemset Mining algorithm [17]. The clustering algorithm packs together terms that frequently co-occur in many queries. We do not enter to much into the details of the algorithm. Those who are interested in reading a more technical description can refer to [12]. We will briefly show here some of the results we obtained by simulating the execution of a term-partitioned parallel IR system using our partitioning heuristic. In order to validate our term assignment strategy we used the AltaVista query log. It contains 7,175,648 queries composed by 895,792 distinct terms. The average length of queries in the log is 2.507 terms per query. Each record of a query log refers to a query submitted to the web search engine for requesting a single page of results. The log was preliminarily cleaned and transformed into a transactional form, where each query is simply formed by a query identifier and the list t1 , t2 , . . . , tq of terms searched for. The first 2/3 of the query A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
137
On the Value of Query Logs for Modern Information Retrieval AltaVista:disk-dominant 3
maximumload averageload averagewidth binpackingaveragewidth
2.2e+06
2.5
1.6e+06 2
1.4e+06
querywidth
serverload
1.8e+06
1.2e+06 1.5
1e+06 800000 600000
0
0.2
0.4
0.6
0.8
1
1
alpha
(a) AltaVista:network-dominant 3
maximumload averageload averagewidth binpackingaveragewidth
1.8e+06 1.6e+06
2.5
1.2e+06 2 1e+06 800000
querywidth
1.4e+06 serverload
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
2e+06
1.5
600000 400000
0
0.2
0.4
0.6
0.8
1
1
alpha
(b)
(Φ) and ω (Φ) as a function of the Figure 4: (a), and (b): values of L tuning parameter α;
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
138
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
log (Φtraining ) was used to extract frequent itemsets exploited by our algorithm, while the last part (Φtest ) was used to test the partitioning obtained. Unfortunately, we do not have real collections of documents available, coherent with the query log, and we could not test our algorithm on a real web search engine. Therefore, we validated our approach by simulating a broker and assuming constant times for disk access, intersection computation, and communication setup disregarding the lengths of the posting lists. In all the tests discussed here we considered a partitioning of the index among p = 8 servers. Tests were also conducted with different numbers of partitions. Since trends and macroscopic behaviors do not change remarkably, we will not report these results. As baseline competitors of our approach, we consider a random assignment of terms, and the bin packing strategy called fill smallest in [15]. In the random case each term was randomly assigned to one of the p servers. The bin packing strategy considers the number of occurrences of each term in Φtraining as a weight, and assigns the terms to the p servers by means of a simple greedy heuristic. This heuristic takes a term at a time in decreasing order of frequency, and always assigns it to the most under-loaded server. The plots reported in Figure 4 show the effectiveness and flexibility of our optimization function and its α-weighting. Each plot reports, (in the yas a function of α, both the values measured on Φtest of L axis on the left of each plot) and of ω (in the y-axis on the right of each plot). Furthermore, in each plot we reported the average load L, and the baseline value of ω resulting from the bin packing heuristic. The plot in 4.(a) refers to a disk-dominant setting. The plot in 4.(b) is relative to a network-dominant one. In all the cases we can see that our approach almost evenly bal are ances the workload among servers. In fact, the values of L and L very close for all values of α up to 0.8. On the other hand, the curves ω decrease for increasing values of α. On the query log tested the value of ω becomes smaller than 2, which means that the most of the queries are answered using only one server. As expected, a setting of α which weights too much ω results in a unbalanced assignment of terms to some does not improve the server, while a setting which weights too much L average query width. The table reported in Tab 3 shows the percentage of queries occurring in the test sets Φtest of the query log as a function of their width. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
139
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Our assignment strategy allows to remarkably increase the number of queries involving only one server: the number of queries served by a single server were 50% more than with the random and bin packing assignments. As a consequence, the number of queries requiring more than one server decreases correspondingly. We can see that the number of queries solved by more than 3 servers is halved.
4
Query-Driven Document Partitioning and Collection Selection
Document partitioning is the strategy usually chosen by the most popular web search engines [3]. In the document-partitioned organization, the broker may choose among two possible strategies for scheduling a query. A na¨ıve, yet very common, way of scheduling queries is to broadcast each of them to all the underlying IR cores. This method has the advantage of enabling a perfect load balancing among all of the servers. On the other hand it has the major drawback of exploiting all the servers for each query submitted. The other possible way of scheduling is to choose, for each query, the most authoritative server(s), thus reducing the number of IR cores queried. Relevance of each server to a given query is computed by means of a collection selection function that is built, usually, upon statistics computed over each sub-collection. In our work [19], we used query-log analysis to drive the assignment of documents, with the goal of putting together the documents that answer a given query. We do this, first, by representing each document as a query-vector, i.e. a (sparse) vector listing the queries to which it answers, weighted with the search score and, second, by performing co-clustering [6] on the query-document contingency matrix. The resulting document clusters are used to partition the documents among the servers, while the query clusters are used to guide our collection selection strategy. Typically, partitioning and selection strategies are based on information gathered from the document collection. Partitioning, in particular, is either random, or based on clustering documents, e.g., using k-means [25]. In both cases, documents are partitioned without any knowledge of what queries will be like. We believe that information and statistics about queries may help in driving the partitions to an optimal choice. Our goal is to cluster the most relevant documents for A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
140
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
each query in the same partition. The cluster hypothesis states that closely associated documents tend to be relevant to the same requests [23]. Clustering algorithms, like the k-means method cited above, exploit this claim by grouping documents on the basis of their content. We instead based our method on co-clustering queries with the documents returned in reply to each one. The algorithm we adopt is described in [6] and is based on a model exploiting the joint probability of picking up a given couple (q, d), where q is a given query and d is a given document. All these probabilities values are collected into a contingency matrix. Given a contingency matrix, co-clustering is the general problem of performing simultaneously a clustering of columns and rows, in order to maximize some clustering metrics (e.g. inter-cluster distance). The way we consider documents for co-clustering can be seen as a new way of modeling documents. So far, two popular ways of modeling documents have been proposed: bag-of-words, and vector space. Since we know which documents are given as answers to each query, we can represent a document as a query-vector. Query-vector model. Let Φ be a query log containing queries q1 , q2 , . . . , qm . Let di1 , di2 , . . . , dini be the list of documents returned as results to query qi . Furthermore, let rij be the rank value associated with the pair (qi , dj ). A document dj is represented as an m-dimensional T vector δj = [χij ] , where χij ∈ [0, 1] is the rank of documents dj returned as an answer to query qi . The entries of χij are then normalized in order to sum to 1. According to this query-vector models, documents that are not hit by any query in the query log are represented by null query-vectors. This is a very important feature of our model because it allows us to remove more than half of the documents from the collection without losing precision. This will be described in detail below. The contingency matrix introduced above, can now be formally defined as Υ = [δi ]1≤i≤n where n is the number of distinct documents 1 that are returned as answers to the queries. For each i, j each entry Υij = rij / rij is the rank of the document dj for the query qi i∈D j∈Φ
normalized so that Υ entries sum up to one. The contingency matrix 1 Empty
documents, i.e. documents never recalled by any query, are removed from the matrix, to speed up the convergence of the algorithm. Only recalled documents are considered. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
141
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
just defined can be used into the co-clustering algorithm to obtain the document clusters identifying the partitions. Furthermore, co-clustering considers both documents and queries. We thus have two different kind of results: (i) groups made of documents answering to similar queries, and (ii) groups of queries with similar results. The first kind of results is used to build the document partitioning strategy, while the second is the key to our collection selection strategy (see below). The result of co-clustering is a matrix P defined as: P (qca , dcb ) =
rij
i∈qcb j∈dca
In other words, each entry P (qca , dcb ) sums the contributions of rij for the queries in the query cluster a and the documents in document cluster b. We call this matrix simply PCAP. The values of PCAP are important because they measure the relevance of a document cluster to a given query cluster. This induces naturally a simple but effective collection selection algorithm. We used the ideas illustrated above to design a novel documentpartitioned PIRS. Our strategy is as follows. First, we train the system with the query log of the training period, by using a reference centralized index to answer all queries submitted to the system. We record the top-ranking results for each query. Then, we perform co-clustering on the query-document matrix. The documents are then partitioned among several servers according to the results of clustering. Note that, besides the clusters identified by our strategy, we also need a further cluster containing the documents that are not returned by any query. For querying our document-partitioned PIRS, we perform collection selection. The servers holding the selected collections are queried, and results are merged. In order to have comparable document ranking within each core, we distribute the global collection statistics to each server. So, the ranking functions are consistent, and results can be very easily merged, simply by sorting documents by their rank. We keep logging the results of each query, also after the end of the training period, in order to further train the system and also to accommodate any topic shift. Topic shift refers to the fact that, over time, the interests of the users of a search engine can change. For instance, A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
142
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
in the case of an unexpected calamity, there can be a sudden increase of queries about the issue. To adjust to this possibility, the co-clustering algorithm can be periodically performed by an off-line server and documents can be moved from one cluster to another in order to improve precision and reduce server load. One interesting thing is that there is no need for a central server running a centralized version of the search engine because the rank returned by the individual server is consistent with the one a central server would return. Our collection selection strategy is based on the PCAP matrix returned by the co-clustering algorithm. The queries belonging to each query cluster are joined together into query dictionary files. Each dictionary files stores the text of each query belonging to a cluster, as a single text file. When a new query q is submitted to our PIRS, we use the TF.IDF metric to find which clusters are the best matches: each dictionary file is considered as a document, which is indexed with the usual TF.IDF technique. This way, each query cluster qci receives a score relative to the query q (rq (qci )). Note that even if we used TF.IDF, we could employ any other ranking metric computed on the basis of text statistics. This ranking is finally used to weight the contribution of PCAP P(i, j) for the document cluster dcj , as follows: rq (dcj ) = rq (qci ) × P(i, j) i
The last server, i.e., the one including the documents that are not returned by any query in the log, is always queried as the last one, because the PCAP matrix chooses only among the clusters actually returned by our co-clustering strategy. We performed our test using the WBR99 collection. WBR99 consists of 5,939,061 documents documents, about 22 GB uncompressed, representing a snapshot of the Brazilian Web (domains .br) as spidered by www.todobr.com.br. It comprises about 2,700,000 different terms. We could use also the query log of www.todobr.com.br for the period January through October 2003. Due to the nature of data, we do not have a list of human-chosen relevant documents for each query. WBR99 includes only 50 evaluated queries. Thus, following the example of previous works [24], we consider the top-ranking pages returned by a central index to be relevant. In A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
143
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
particular, when measuring precision at 5, we consider only the top five documents to be relevant. Similarly, for precision at 10 and so on. For our experiment, we used Zettair2 , a compact and fast text search engine designed and written by the Search Engine Group at RMIT University. We modified it so to implement our collection selection strategies (CORI and PCAP). Due to limited space, we cannot show that our partitioning strategy outperforms the random allocation when using CORI. Also, we have exidence supporting the fact that our partitioning greatly outperforms a partitioning based on k-means. Here, we measure the performance of our collection selection strategy w.r.t. CORI. In this case, we test two different allocation strategies on the same document allocation, as generated by the co-clustering algorithm. For precision at 5, we consider only the five top-ranking documents (on the full index) to be relevant. Similarly, for precision at 10 we observe the top 10, and so on. The first experimental results (Table 2) show that, in the fourth week (the first after training), PCAP is performing better than CORI: the precision reached with the first cluster is improved of a factor between 11% and 15% (highlighted entries). We also proved that the training is robust to topic shift. We did this by using the fifth week (the second after training) as our benchmark. We did not measure great differences in precision.
5
Conclusions
In this paper we presented three different techniques for enhancing efficiency and/or efficacy of modern Web IR systems. These three case studies have the common denominator of exploiting past usage information to devise knowledge useful to guide different policies and optimizations. In the first use case, we discussed SDC, our caching policy, which exploits the knowledge about the queries submitted to the WSE in the past to improve hit ratios. The analysis of past queries showed that many popular queries are resubmitted after long intervals. These queries would surely cause cache misses in an LRU cache. On the other hand, several queries are popular only within relatively short time intervals, and can be effectively handled by an LRU-based dynamic cache. 2 Available
under a BSD-style license at http://www.seg.rmit.edu.au/zettair/.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
144
Domenico Laforenza et al
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
CORI Precision at 5 10 20 PCAP Precision at 5 10 20
1 1.57 3.06 6.01 1 1.74 3.45 6.93
2 2.27 4.46 8.78 2 2.30 4.57 9.17
4 2.99 5.89 11.64 4 2.95 5.84 11.68
8 3.82 7.56 15.00 8 3.83 7.60 15.15
16 4.89 9.77 19.52 16 4.85 9.67 19.31
17 5.00 10.00 20.00 17 5.00 10.00 20.00
Table 2: Precision of the CORI and PCAP strategy, when using the first 1, 2, 4, 8, 16 or 17 clusters. Queries from the fourth week.
SDC thus maintains the most popular queries and associated results in a read-only static section of the cache. Only the queries that cannot be satisfied by the static cache section compete for the use of a dynamic, second-level cache. The benefits of adopting SDC were experimentally shown. In all the tests conducted, our strategy remarkably outperformed either purely static or dynamic caching policies. In the second use case, recurrent PIRS usage patterns were used for assigning the terms of the lexicon to the various partitions of a termpartitioned index in a way that allows the average number of servers activated per each query to be reduced, and the workload among the servers to be evenly balanced. The partitioning algorithm is driven by a Frequent Itemset Mining technique. Frequent patterns succinctly enclose a global knowledge of PIRS usage that is exploited in a simple greedy algorithm to assign each term of the lexicon to the partition that minimize our cost function. Experiments showed that the devised term assignments remarkably improved PIRS throughput and query response time with respect to both random and bin packing term assignments. Finally, we presented a novel approach to document partitioning and collection selection, based on co-clustering queries and documents. Our new representation of documents as query-vectors allowed us to efficiently perform an effective partitioning. It also induces a very compact representation of the resulting collections used for implementing A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
145
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
the collection selection function. We showed that our selection strategy out-performed CORI by a factor of 10%, with a smaller representation of the available collections. The three applications discussed are remarkable examples of the increasing importance of Web Mining techniques for improving Web surfing. We proved that Query Logs collected by WSEs are valuable sources of information to learn about the way in which communities or individual users interact with the search tools. Such knowledge can be used to enhance both efficiency and efficacy of the Web IR process.
Bibliography [1] Claudine Santos Badue, Ricardo A. Baeza-Yates, Berthier A. Ribeiro-Neto, and Nivio Ziviani, Distributed query processing using partitioned inverted files., Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE 2001), 2001, pp. 10–20. [2] L. A. Barroso, J. Dean, and U. H¨ olze, Web search for a planet: The google cluster architecture, IEEE Micro 22 (2003), no. 2, 22–28. [3] S. Brin and L. Page, The Anatomy of a Large–Scale Hypertextual Web Search Engine, Proceedings of the WWW7 conference / Computer Networks, vol. 1–7, April 1998, pp. 107–117. [4] Brendon Cahoon, Kathryn S. McKinley, and Zhihong Lu, Evaluating the performance of distributed architectures for information retrieval using a variety of workloads, ACM Trans. Inf. Syst. 18 (2000), no. 1, 1–43. [5] Owen de Kretser, Alistair Moffat, Tim Shimmin, and Justin Zobel, Methodologies for distributed information retrieval, ICDCS ’98: Proceedings of the The 18th International Conference on Distributed Computing Systems (Washington, DC, USA), IEEE Computer Society, 1998, pp. 66–73. [6] I. S. Dhillon, S. Mallela, and D. S. Modha, Informationtheoretic co-clustering, Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003), 2003, pp. 89–98. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
146
Domenico Laforenza et al
[7] Oren Etzioni, The world-wide web: Quagmire or gold mine?, Commun. ACM 39 (1996), no. 11, 65–68. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[8] Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando, Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data, ACM Trans. Inf. Syst. 24 (2006), no. 1, 51–78. [9] Byeong-Soo Jeong and Edward Omiecinski, Inverted file partitioning schemes in multiple disk systems, IEEE Trans. Parallel Distrib. Syst. 6 (1995), no. 2, 142–153. [10] Kosala and Blockeel, Web mining research: A survey, SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM 2 (2000). [11] Ronny Lempel and Shlomo Moran, Predictive caching and prefetching of query results in search engines, WWW ’03: Proceedings of the 12th international conference on World Wide Web (New York, NY, USA), ACM Press, 2003, pp. 19–28. [12] C. Lucchese, S. Orlando, R. Perego, and F. Silvestri, Statistics Driven Term Partitioning to Enhance Performance of Parallel IR Systems, Submitted Paper, 2006. [13] I.A. Macleod, T.P. Martin, B. Nordin, and J.R. Phillips, Strategies for building distributed information retrieval systems, Information Processing & Management 6 (1987), no. 23, 511–528. [14] Evangelos P. Markatos, On caching search engine results, Proceedings of the 5th International Web Caching and Content Delivery Workshop, 2000. [15] A. Moffat, W. Webber, and J. Zobel, Load Balancing for TermDistributed Parallel Retrieval, Proceedings of the SIGIR 2006 conference, ACM, 2006. [16] Alistair Moffat, William Webber, Justin Zobel, and Ricardo BaezaYates, A pipelined architecture for distributed text query evaluation, Manuscript submitted for publication. [17] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri, Adaptive and resource-aware mining of frequent sets, Proc. The 2002 IEEE International Conference on Data Mining (ICDM ’02), 2002, pp. 338– 345. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
On the Value of Query Logs for Modern Information Retrieval
147
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[18] Stefan Podlipnig and Laszlo Boszormenyi, A survey of web cache replacement strategies, ACM Comput. Surv. 35 (2003), no. 4, 374– 398. [19] D. Puppin, F. Silvestri, and D. Laforenza, Query-driven document partitioning and collection selection, Proceedings of the 1st INFOSCALE Conference, May - June 2006, pp. 107–117. [20] Berthier Ribeiro-Neto, Edleno S. Moura, Marden S. Neubert, and Nivio Ziviani, Efficient distributed algorithms to build inverted files, SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA), ACM Press, 1999, pp. 105– 112. [21] Paricia Correia Saraiva, Edleno Silva de Moura, Nivio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Riberio-Neto, Rankpreserving two-level caching for scalable search engines, SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA), ACM Press, 2001, pp. 51–58. [22] Anthony Tomasic and Hector Garcia-Molina, Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems, PDIS ’93: Proceedings of the second international conference on Parallel and distributed information systems (Los Alamitos, CA, USA), IEEE Computer Society Press, 1993, pp. 8–17. [23] C.J. Van Rijsbergen, Information retrieval, Butterworths, 1979. [24] Xu, Jinxi, and W.B. Croft, Effective Retrieval with Disributed Collections, Proceedings of SIGIR98 conference (Melbourne, Australia), August 1998. [25] Jinxi Xu and W. Bruce Croft, Cluster-Based Language Models for Distributed Retrieval, Research and Development in Information Retrieval, 1999, pp. 254–261.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
RAIS - An Information Sharing System for Peer-to-peer Communities
Marco Mari, Agostino Poggi, Michele Tomaiuolo Dipartimento di Ingegneria dell’Informazione - Universit` a degli Studi di Parma Viale delle Scienze, 181A – 43100 – Parma {mari, poggi, tomamic}@ce.unipr.it Abstract This paper presents RAIS, a peer-to-peer multi-agent system for the sharing of information among a community of users connected through the Internet. Each agent platform acts as a “peer” of the system and is based on three agents: a personal assistant, an information finder and a directory facilitator; moreover, another agent, called personal proxy assistant, allows a user to remotely access her/his agent platform. RAIS has been designed and implemented on the top of well known technologies and software tools with the aim of supporting authorized and authenticated users in information retrieval. RAIS offers a similar search power of Web search engines, but avoids the burden of publishing information on the Web and guaranties a controlled and dynamic access to information. The use of agent technologies has made straightforward the realization of three of the main features of the system: i) filtering of information coming from different users, on the basis of the previous experience of the local user, ii) pushing of new information that can be of interest for a user, and iii) delegation of access capabilities, on the basis of a reputation network, built by the agents of the system on the community of its users. 149
150
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Keywords Information sharing, multi-agent systems, peer-to-peer.
1 1.1
Introduction Peer-to-peer systems. . .
The rising popularity of peer-to-peer systems brought a constant, rapid development of new protocols and tools for exchanging files, and the number of users of peer-to-peer software is constantly growing. Recently, the CacheLogic research about peer-to-peer systems in 2005 [1], estimated that over the 60% of the whole Internet traffic is generated by peer-to-peer traffic, while other studies estimate a percentage still higher. Despite the development of peer-to-peer, the usage of such systems is basically unchanged from their first apparition: the user inserts some keywords for searching and he/she receives a list of files containing one or more of the requested keywords in their names (or in some associated metadata like ID3 [2] for mp3 files). This kind of behaviour is particularly suitable for multimedia files, and the already mentioned CacheLogic research [1] confirms that the 61,44% of peer-to peer traffic is generated by video files and the 11,34% by audio files. The greater part of the remaining 27,22% is composed by CD images and UNIX orientated file types, almost exclusively found on the BitTorrent network [3]. In fact, BitTorrent is more and more used to share legitimate content, for example games demos or Linux distributions. However, the sharing of documents as Microsoft Office (DOC, XLS, PPT, . . . ) or Adobe Acrobat (PDF) files represents only a minimum part of the peer-to-peer generated traffic. The main reason of this poor diffusion is probably due to the difficulty of searching such files: as a matter of fact, it’s nearby impossible to describe the content of a large document (a research, a paper, a book, . . . ) in its title only. For searching and sharing this kind of files it would be more appropriate to search the whole content of the document, using a content-sharing approach rather than a classical file-sharing approach. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
1.2
151
. . . and Desktop Search Applications
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The storage capability of hard disks is constantly growing, while the new available space is quickly filled with a large amount of data in different and heterogeneous formats. The ordering of such data is a time wasting and often boring task: the result is that more and more files, containing important information, are lost or forgotten on hard drives. The classical file searching tools (e.g., Windows search function) are not effective: for each request they analyse the whole drive, and they support only few file formats. In last months, the providers of the main Web search engines (Google, Microsoft with MSN and Yahoo!) have released desktop search tools that make a research on a local drive as easy and fast as a Web search. These tools run in the background (when CPU load is low) and index the content of a wide range of file formats (e.g.: Office, PDF, e-mail, HTML, . . . ) in a way similar to a Web crawler. Moreover, Google has released an SDK [9] for its desktop search software [10]. The SDK empowers developers to write plug-ins thanks to a set of APIs using COM and HTML/XML. This paper presents a system, called RAIS, that tries to couple the capability to share documents in a peer-to-peer network with the search power of a desktop search application. The next section introduces the main features and the behaviour of the RAIS system. Section three describes how this system has been designed and implemented by using some well-known technologies and software tools. Section four presents our future research directions and finally, section four gives some concluding remarks.
2
RAIS
RAIS (Remote Assistant for Information Sharing) is a peer-to-peer and multi-agent system composed of different agent platforms connected through the internet. Each agent platform acts as a “peer” of the system and is based on three agents: a personal assistant, an information finder and a directory facilitator; moreover, another agent, called personal proxy assistant, allows a user to remotely access her/his agent platform. Figure 1 shows the RAIS multi-agent system architecture. A personal assistant (PA) is an agent that allows the interaction between the RAIS system and the user. This agent receives the user’s queries, forwards them to the available information finders and presents A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
152
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
the results to the user. Moreover, a PA allows the user to subscribe her/him to be notified about new documents and information on some topics in which she/he is interested. Finally, a PA maintains a profile of its user preferences: in fact, the user can rate the quality of the information coming from another user for each search keyword (the utility of this profile will be clear after the presentation of the system behaviour).
Figure 1: The RAIS multi-agent system.
An information finder (IF) is an agent that searches information on the repository contained into the computer where it lives and provides this information both to its user and to other users of the RAIS system. An IF receives users’ queries, finds appropriate results and filter them on the basis of its user’s policies (e.g.: results from nonpublic folders are not sent to other users). Moreover, an IF monitors the changes in the local repository and pushes the new information to a PA when such information matches the interests subscribed by this PA. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
153
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
A personal proxy assistant (PPA) is an agent that represents a point of access to the system for users that are not working on their own personal computer. A PPA is intended to run on a pluggable device (e.g., a USB key) on which the PPA agent is stored together with the RAIS binary and configuration files. Therefore, when the user starts the RAIS system from the pluggable device, her/his RPA connects to the user’s PA and provides the user with all the functionalities of her/his PA. For security reasons, only a PA can create the corresponding PPA and can generate the authentication key that is shared with the PPA to support their communication. Therefore, for a successful connection, the PPA has to send the authentication key, and then the user must provide his username and password. Finally, the Directory Facilitator is responsible to register the agent platform in the RAIS network. The DF is also responsible to inform the agents of its platform about the address of the agents that live in the other platforms available on the RAIS network (e.g., a PA can ask about the address of the active IF agents).
2.1
Searching and Pushing Information
In order to understand the system behaviour, we can present two practical scenarios. In the first, a user asks her/his PA to search for some information, while in the second the user asks to subscribe her/his interest about a topic. In both cases the system provides the user with a set of related information. In the first scenario, the system activity can be divided in four steps: i) search, ii) result filtering, iii) results sending and presentation, and iv) retrieval. Search: the user requests a search to her/his PA indicating a set of keywords and the maximum number of results. The PA asks the DF for the addresses of available IF agents and sends the keywords to such agents. The information finders apply the search to their repositories only if the querying user has the access to at least a part of the information stored into its repositories. Results filtering: each IF filters the searching results on the basis of the querying user access permissions. Results sending and presentation: each IF sends the filtered list of results to the querying PA. The PA orders the various results as soon as it receives them, omitting duplicate results and presents them to its user. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
154
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Retrieval : after the examination of the results list, the user can ask her/his PA for retrieving the information corresponding to an element of the list. Therefore, the PA forwards the request to the appropriate IF, waits for its answer and presents the information to the user. In the second scenario, the system activity can be divided in five steps: i) subscription, ii) monitoring and results filtering, iii) results sending and user notification, iv) results presentation and v) retrieval. Subscription: the user requests a subscription to her/his PA indicating a set of keywords describing the topic in which she/he is interested. The PA asks the DF for the addresses of available IF agents and sends the keywords to such agents. Each IF registers the subscription if the querying user has the access to at least a part of the information stored into its repository. Monitoring and result filtering: each IF periodically checks if there are some new information satisfying its subscriptions. If it happens, the IF filters its searching results on the basis of the access permissions of the querying user. Results sending and user notification: each IF sends the filtered list of results to the querying PA. The PA orders the various results as soon as it receives them, omitting duplicate results and stores them in its memory. Moreover, it notifies its user about the new available information sending her/him an email. Results presentation: the first time the user logs into the RAIS system, the PA presents her/him the new results. Retrieval : in the same way of the previous search scenario, the user can retrieve some of the information indicated in the list of the results. As introduced above, a PA receives from the user a constraint on the number of results to provide (Nr ) and uses it to limit the results asked to each IF agent. The number of results that each IF agent can send is neither Nr nor Nr divided to the number of IF agents (Nr /Nif ), but a number (between Nr and Nr /Nif ) for which the PA is quite sure to provide at least Nr results to its user without the risk of receiving a burden of unnecessary data. Moreover, each IF, before sending the list of results, creates a digest1 of each result and sends them together with the list. Therefore, the PA causes the digests to omit duplicate results coming from different IF agents. 1 A digest is a compact representation given in the form of a single string of digits that has the property to be different for data that are different [12]. A digest is usually used to compare remote files without the need of moving the files.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
155
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
After the reception of results and the filtering of duplications, the PA has the duty of selecting Nr results to send to its user (if they are more than Nr ) and order them. Of course, each IF orders the results before sending them to the PA, but the PA has not the information on how to order results from different IF agents. Therefore, the PA uses two more simpler solution on the basis of its user request: i) the results are fairly divided among the different sources of information, ii) the results are divided among the different sources of information on the basis of the user preferences. User preferences are represented by triples of the form <source, keyword, rate> where: source indicates an IF, keyword a term used for searching information, and rate a number representing the quality of information (related to the keyword) coming from that IF. Each time a user gets a result, she/he can give a rate to the quality of the result and, as consequence, the PA can update her/his preferences in the user profile that the PA maintains. The information stored into the different repositories of a RAIS network is not accessible to all the users of the system in the same way. In fact, it’s important to avoid the access to private documents and personal files, but also to files reserved to a restricted group of users (e.g.: the participants of a project). The RAIS system takes care of users’ privacy allowing the access to the information on the basis of the identity, the roles and the attributes of the querying user defined into a local knowledge base of trusted users. In this case, it is the user that defines who and in which way can access to her/his information, but the user can also allow the access to unknown users enabling a certificate based delegation built on a reputation network of the users registered into the RAIS community. For instance, if the user Ui enables the delegation and grants to the user Uj the access to its repository with capabilities C0 and Uj grants to the user Uk the access to its repository with the same capabilities C0 , then Uk can access Ui ‘s repository with the same capabilities of Uj . The trust delegation can be useful when the system is used by open and distributed communities, e.g. to share documents among the members of an Open Source project.
2.2
Security
The information stored into the different repositories of a RAIS network is not accessible to all the users of the system in the same way because it’s important to avoid the access to private documents and personal files. The RAIS system takes care of users’ privacy allowing the access to A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
156
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
the information on the basis of the identity, the roles and the attributes of the querying user. Of course, different levels of privacy can be assigned to the information stored into the same repository. Various models exist to deal with the authorization problem [17]. The best known is the Discretionary Access Control (DAC) model. It is the traditional model, based on Access Control Lists. In this model, each user is associated with a list of granted access rights. On the basis of this list of permissions, he will be allowed or denied access to a particular resource. A resource administrator is responsible for editing the Access Control Lists. Another popular model is the Mandatory Access Control (MAC), used to implement Multilevel Secure (MLS) systems. In these systems, each resource is labeled according to a security classification. Correspondingly, each principal is assigned a clearance, which is associated with a classification list. This list contains all the types of resources the principal should be allowed to access, depending on their classification. The multilevel security is particularly popular in the military field, and in inherently hierarchical organizations. Another interesting model is the Role Based Access Control (RBAC) model. This model is centered around a set of roles. Each role can be granted a set of permissions, and each user can be assigned to one or more roles. A many to many relationship binds principals and the roles they’re assigned to. In the same way, a many to many relationship binds permissions and the roles they’re granted to, thus creating a level of indirection between a principal and his access rights. This also leads to a better separation of duties (between the assignment of principals to roles and the definition of role permissions), to implement privilege inheritance schemes among superior and subordinate roles and to permit temporary delegations of some of the assigned roles towards other principals. Following the RBAC model, each resource manager of our system (i.e. each node in the peer-to-peer network) has to deal with three main concepts: principals (i.e. authenticable entities which act as users of resources and services), permissions (i.e. rights to access resources or use services) and roles. The fundamental principle here is that each node is in charge of defining its own roles, and of assigning principals to them. In RAIS, authentication and authorization are performed on the basis of the local knowledge base of trusted users, though they can be A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
157
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
delegated to external entities through an explicit, certificate based, delegation [4]. In this sense, the system completely adheres the principles of trust management. The definition of roles and attributes is also made in a local namespace, and the whole system is, in this regard, completely distributed. Local names are distinguished by prefixing them with the principal defining them, i.e. an hash of the public key associated with the local runtime. Links among different local namespace, again, can be explicitly defined by issuing appropriate certificates. In this sense, local names are the distributed counterpart of roles in Role Based Access Control frameworks [14]. Like roles, local names can be used as a level of indirection between principals and permissions. Both a local name and role represent at the same time a set of principals, as well as a set of permissions granted to those principals. But, while roles are usually defined in a centralized fashion by a system administrator, local names, instead, are fully decentralized. This way, they better scale to internet-wide, peer-to-peer applications, without loosening in any way the principles of trust management. In RAIS, the user can not only provide the permission to access his own files, but can also assign the permission to upload a new version of one or more existing files. In this case the PA informs his/her user about the updated files the first time he/she logs in. This functionality can be useful for the members of a workgroup involved in common projects or activities.
2.3
Mobile User Support
People travelling for work may often be in need of access from a remote system their own computer. In this situation, a solution could be to install a VNC server on the desktop computer and to find a system with a VNC client while travelling. This solution has the advantage that the user gains the complete control on his remote PC, but it has also two main drawbacks: it’s not easy to find computers with VNC clients available and the VNC connects only to one computer, not to the whole set of files and information of a workgroup. For users that don’t require a complete control over a remote computer, but need to search and access a distributed set of documents, we have included in our system a remote search feature. The user can ask his/her PA to create a PPA on a pluggable device, e.g., an USB key or a removable hard disk. The PA copies on the device the RAIS run-time, the RPA and the authentication key shared by the PPA and the PA A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
158
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
itself. When the user inserts the pluggable device on another computer, he can immediately launch his PPA and connect to its corresponding PA. Therefore, the way of using the RAIS system is analogous to the situation in which the user works on her/his own computer, except for the interactions between the RPA and the PA, that, however, are transparent to the user. In fact, at the initialization, the PPA sends an authentication key to the PA. If the key matches those of the PA, the user can provide his/her username and password and enter the system (step that must be done by the user when she/he uses the RAIS system from her/his own computer too). After these two steps, the PPA acts as a simple proxy of the remote PA.
3
RAIS Development Components
The RAIS system has been designed and implemented taking advantage of agent, peer-to-peer, information retrieval and security management technologies and, in particular, of three main software components: JADE [12], JXTA [8][13] and Google Desktop Search [10]. RAIS agent platforms have been realized by using JADE: JADE (Java Agent Development Framework) [12] is probably the most known agent development environment enabling the integration of agents and both knowledge and Internet-oriented technologies. JADE allows to build agent systems for the management of networked information resources in compliance with the FIPA specification [5]. JADE provides a middleware for the development and execution of agent-based applications which can seamless work and interoperate both in wired and wireless environment. Moreover, JADE supports the development of multi-agent systems through the predefined programmable and extensible agent model and a set of management and testing tools. Currently, JADE is considered the reference implementation of the FIPA specifications and is one of the most used and promising agent development frameworks. In fact, is available under an LPGL open source license, it has a large user group, involving more than two thousands active members, it has been used to realize real systems in different application sectors, and its future development is guided by a governing board involving some important industrial companies. The JADE development environment does not provide any support for the realization of real peer-to-peer systems because it only A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
159
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
provides the possibility of federating different agent platforms through a hierarchical organization of the platform directory facilitators on the basis of a priori knowledge of the agent platforms addresses. Therefore, we extended the JADE directory facilitator to realize real peer-topeer agent platforms networks thanks to the JXTA technology [13] and thanks to two preliminary FIPA specifications for the Agent Discovery Service [6] and for the JXTA Discovery Middleware [7]. JXTA technology [8][13] is a set of open, general-purpose protocols that allow any connected device on the network (from cell phones to laptops and servers) to communicate and collaborate in a peer-to-peer fashion. The project was originally started by Sun Microsystems, but its development was kept open from the very beginning. JXTA comprises six protocols allowing the discovery, organization, monitoring and communication between peers. These protocols are all implemented on the basis of an underlying messaging layer, which binds the JXTA protocols to different network transports. FIPA has acknowledged the growing importance of the JXTA protocols, and it has released some specifications for the interoperability of FIPA platforms connected to peer-to-peer networks. In particular, in [7] a Generic Discovery Service (GDS) is described, to discover agents and services deployed on FIPA platforms working together in a peer-topeer network. RAIS integrates a JXTA-based Agent Discovery Service (ADS), which has been developed in the respect of relevant FIPA specifications to implement a GDS. This way, each RAIS platform connects to the Agent Peer Group, as well as to other system-specific peer groups. The Generic Discovery Protocol is finally used to advertise and discover df-agent-descriptions, wrapped in Generic Discovery Advertisements, in order to implement a DF service, which in the background is spanned over a whole peer group. Different techniques and software tools can be used for searching information in a local repository. If the information is stored in form of files, the Google desktop search system [9] can be considered a suitable solution because Google provides an SDK for developing plug-ins based on its desktop search system. The API that comes with the SDK uses COM objects, so it’s not directly available for JAVA development, but a bridge between the API and JAVA is provided by the Open Source project GDAPI [11]. Google desktop search indexes the content of the files of a local drive in a way similar to the Google Web crawler, providing a searching engine fast, effective and with the support for a wide range of file formats. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
160
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
As introduced before, authentication and authorization are performed locally, based on local knowledge and local trust relationships. But local authorization decisions can be extended to external entities, also, through an explicit, certificate based, delegation. In fact, the theory of RAIS delegation certificates is founded on SPKI/SDSI specifications [4], though the certificate encoding is different. As in SPKI, principals are identified by their public keys, or by a cryptographic hash of their public keys. Instead of s-expressions, RAIS uses XML signed documents, in the form of SAML assertions [16], to convey identity, role and property assignments. As in SPKI, delegation is made possible if the delegating principal issues a certificate whose subject is a name defined by another, trusted, principal. The latter can successively issue other certificates to assign other principals (public keys) to its local name. In this sense, local names act as distributed roles [14]. Finally, the extraction of a digest for each search result is required to avoid the presentation of duplicate results to the user. This feature is provided by a Java implementation of the hash function MD5 [15].
4
Future Work
As a first step in our future research directions we are planning to support other desktop search tools. Google Desktop Search is an optimal choice, but not all users want to install it and the project has the limitation to be hooked to both Google Desktop Search and GDAPI. To overcome such a limitation, we are considering to develop a simple desktop search tool directly integrated in our application. This choice would have two main advantages: • to remove our project dependencies from third-party tools and APIs • to fasten the application: a dedicated desktop search would index only shared contents (not the whole hard disk) and it would be tailored on our project requirements A good starting point to develop a desktop search application is Lucene [18], the wide-used textual search engine released with an Open Source license by the Apache Software Foundation. Moreover, a Lucene subproject, Nutch [19], provides some useful add-ons to index not only A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
161
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
textual contents, but also the most diffused file formats (DOC, PDF, . . . ). We are also studying the problem of search results presentation. At the moment, RAIS presents the results in the order they arrive. For this kind of application it could be useful to present results ordered by relevance, but this is a complex task for mainly two reasons: • desktop search applications only provide a list of ordered results, but not the score given to each document, probably because it would reveal too much of the indexing algorithm • even with a score available, it would refer to the system on which the document is located, not to the whole network Our intent is to develop our dedicated desktop search application with an algorithm for sorting a list of documents distributed over a network of peers. The starting point we are studying is once again Lucene, that uses a slightly modified version of the well-known TF-IDF [20] algorithm, but a deeper discussion of these problems is beyond the scope of this paper. Finally, we are applying the concepts that guided the design of RAIS to a large-scale project: while RAIS implements security and delegation features that make the system particularly suitable for small and medium communities, we want to follow the same approach with the support of the wide-used Gnutella peer-to-peer network. The goal is to enhance a peer-to-peer client with the capabilities of a desktop search application. Thanks to this enhancement, the results of every user’s search are no more only dependent on the titles of shared files, but also on the content of such files.
5
Conclusion
In this paper, we presented a peer-to-peer multi-agent system, called RAIS (Remote Assistant for Information Sharing), supporting the sharing of information among a community of users connected through the internet. RAIS is implemented on the top of well known technologies and software tools for realizing: i) the agent platforms, i.e., JADE, ii) the peer-to-peer infrastructure, i.e., JXTA, iii) the searching of information into the local repository, i.e., Google Desktop Search, and iv) the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
162
M. Mari, A. Poggi, M. Tomaiuolo
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
authentication and authorization infrastructure, i.e., SPKI/SDSI specifications and the SAML assertions. Therefore, RAIS can be considered something more than a research prototype that couples the features of Web searching engines and of peer-to-peer systems for the sharing of information. First, RAIS improves the security rules provided to check the access of the users to the information. In addiction, it offers a similar search power of Web search engines, but avoids the burden of publishing the information on the Web and guaranties a controlled and dynamic access to the information. Moreover, an agent based implementation of the system simplifies the realization of three main features of the system: i) the filtering of the information coming from different users on the basis of the previous experience of the local user, ii) the pushing of the new information that can be of possible interest for a user, and iii) the delegation of authorization on the basis of a network of reputation built by the agents of the system on the community of its users.
Figure 2: RAIS search graphical user interface. A first prototype of the RAIS system has already completed and A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
RAIS - An Information Sharing System for Peer-to-peer...
163
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
experimented. The prototype includes all basic features and a graphical user interface simplifies the interaction between the user and the system (see figure 2). Practical tests on the first prototype were done installing the system in different labs and offices of our department asking some students and colleagues to use it for sharing and exchanging information. Moreover, we tested the system setting some computers of a Lab with different access policies and distributing information on their repositories to have different copies of the same information on different computers. The tests covered with success all system features: basic research, user defined policies, results filtering, duplicate results management and remote search through a proxy personal agent. The tests results were fully satisfactory and, in particular, the involved users were interested in continuing its use for supporting their work activities. The successful experimentation encouraged us in the further development of the system and we are currently working on studying the best way for introducing new types of information that can be managed (e.g., information stored into databases) and for including new techniques for the searching of such information (e.g., semantic Web techniques).
Bibliography [1] CacheLogic [2005], Peer-to-Peer in 2005, available http://www.cachelogic.com/research/p2p2005.php
from:
[2] ID3 metadata specifications. available from: http://www.id3.org [3] BitTorrent home page, available from: http://www.bittorrent.com [4] Ellison, C., Frantz, B., Lampson, B., Rivest, R., Thomas, B., Ylonen, T. [1999] SPKI Certificate Theory. RFC 2693, 1999. [5] FIPA Specifications. Available from http://www.fipa.org. [6] FIPA Agent Discovery Service Specification [2003] Available from http://www.fipa.org/specs/fipa00095/PC00095.pdf. [7] FIPA JXTA Discovery Middleware Specification [2003] Available from www.fipa.org/specs/fipa00096/PC00096A.pdf. [8] Gong, L. [2001] JXTA: A network programming environment. IEEE Internet Computing, 5:88-95, 2001. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
164
M. Mari, A. Poggi, M. Tomaiuolo
[9] Google Desktop SDK. Available at: http://desktop.google.com/developer.html. The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[10] Google Desktop Search. Available at: http://desktop.google.com. [11] GDAPI, Google Desktop Search Java API. Available at: http://gdapi.sourceforge.net. [12] JADE software development http://jade.tilab.com.
framework.
Available
from
[13] JXTA technology. Available from http://www.jxta.org. [14] Li, N., Mitchell, J.M. [2003]. RT: A Role-based Trust-management Framework. In Proc of the Third DARPA Information Survivability Conference and Exposition (DISCEX III), pp. 201-212, 2003. Washington, D.C. [15] Rivest, R.L. [1992] The MD5 Message Digest Algorithm. Internet RFC 1321. 1992. [16] SAML - Security Assertion Markup Language. Available from http://xml.coverpages.org/saml.html [17] Sandhu, R., Samarati, P. [1994] Access controls, principles and practice. IEEE Communications, 32(9), pp 40-48, 1994. [18] Lucene project home page, available at: http://lucene.apache.org [19] Nutch project home page, available at: http://lucene.apache.org/nutch [20] TF-IDF algorithm definition on http://en.wikipedia.org/wiki/Tf-idf
Wikipedia,
available
at:
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Extracting Dependency Relations for Opinion Mining
Giuseppe Attardi, Maria Simi Dipartimento di Informatica, Universit` a di Pisa
[email protected] Abstract Intent mining is a special kind of document analysis whose goal is to assess the attitude of the document author with respect to a given subject. Opinion mining is a kind of intent mining where the attitude is a positive or negative opinion. Techniques based on extracting dependency relations have proven more effective for intent mining than traditional bag-of-word approaches. We propose an approach to opinion mining which uses frequent dependency sub-trees as features for classifying documents and extracting opinions. We developed an efficient multi-language dependency parser to analyze documents and extracting dependency relations which can be used on large scale collections. An opinion retrieval system has been built and is being tested on the TREC 2006 Blog Opinion task.
1
Introduction
When searching the Web to find a solution for a technical problem, it is often frustrating to be referred to pages where other people ask about the same problem but offer no solution. This happens for instance when searching for a solution or an explanation for an error message displayed by an application. One has to dig through a number of result pages where people ask the same question before finding one that reports 165
166
G. Attardi, M. Simi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
an actual solution to the problem. This is an example where it would be useful if results were marked or grouped according to what they intend to express, such as: problem (description, solution), agreement (assent, dissent), preference (likes, dislikes), statement (claim, denial ). We may call this intent classification as a generalization of sentiment classification which focuses on opinions, like preference or agreement. The ability to identify and group documents by intent may lead to new tools for knowledge discovery, for instance for generating a research survey that collects relevant opinions on a subject, for determining prevalent judgments about products or technologies, for analyzing reviews, for gathering motivations and arguments from court decision making or lawmaking debates, for analyzing linkages in medical abstracts to discover drug interactions. We propose an approach on opinion mining, which exploits techniques for extracting dependency relations from textual documents. The techniques of statistical and semantic document analysis that we are exploring can be used for intent classification in such applications as ranking search engine results, in desktop search tools, or opinion mining for supporting decision making at the enterprise level. The techniques for extracting relations from dependency parse trees can also be applied for automatically generating semantic annotations for use in Semantic Web applications.
2
Approach
While traditional text classification tries to assign predefined categories to a document, such as spam/no-spam for e-mail, sentiment or intent identification is a different and challenging task whose goal is the assessment of the writer’s attitude toward a subject. Examples include categorization of customer e-mails and reviews by types of claims, modalities or subjectivities. Learning algorithms for text classification typically represent text as a bag-of-words, where the word order and syntactic relations appearing in the original text are ignored. Despite such naive representation, text classification systems have been quite successful [8]. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
3
167
Dependency Relations
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
For sentiment classification, techniques based on extracting dependency relations have proven more effective than traditional bag-of-word approaches [9]. Dependency relations allow distinguishing statements of opposite polarity, e.g. “I liked the movie”, “I didn’t like it”. Dependency relations are a mean to represent the linguistic structure of sentences, as an alternative to phrase structure trees, as in classical linguistics. The dependency relation is an asymmetric relation between a head word and a dependent word. A dependency has an associated type, expressing the grammatical function of the dependency (Subject, Object, Predicate, Determiner, Modifier ). In a dependency tree, one word is the head of a sentence, and all other words are either a dependent of that word or else a dependent on some other word which connects to the headword through a sequence of dependencies. Here is an example of the dependency relations produced by a dependency parser:
A dependency parser can be built without using a formal grammar of a language, which is quite hard to define for a full natural language, as in other approaches to constituent parsing like Probabilistic ContextFree Grammars. A dependency parser in fact can be trained on the basis of a corpus annotated with just word dependencies, which are easier to produce even by annotators without deep linguistic knowledge. Another advantage of dealing with dependencies is that they decompose the sentences into pairs, i.e. smaller and more modular units than trees which sometimes have several children. Smaller units will occur more frequently and therefore statistical learning algorithms can rely on a larger number of samples and hence with more statistical relevance. Recent advances in dependency parsing show that it is feasible to achieve good accuracy with a parser capable of analyzing over 200 sentences per second [1]. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
168
4
G. Attardi, M. Simi
A Multilanguage Dependency Parser
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Statistical parsers are trained on examples of sentences annotated with the corresponding parse tree. The parser extracts from each example a set of features and learns how to associate those features to the correct tree, i.e. it learns a mapping F : S → T from the set of sentences S to the set of trees T (not considering ambiguity for simplicity). A linear parsing model consists in: 1. a generator GEN : S → 2T which generates a set of candidate trees for a sentence 2. a feature extractor Φ : T → n which gives a vector of the weights of each feature in a tree. A feature corresponds to one dimension in a typically large feature space 3. a vector of weights for the features W ∈ n . Given GEN, Φ, and W , a global linear parsing model F can be defined as: F (s) = argmaxt∈GEN (s) Φ(t) · W and training the parser involves just learning the weights W . When given a new sentence to parse, the parser extracts the features from the sentence and tries to predict which parse tree is the most likely for the sentence. We have followed a different approach inspired by the one proposed by Yamada and Matsumoto [11]. Instead of learning directly which tree to assign to a sentence, the parser learns the actions to use for building the tree. Parsing can be cast into a classification problem: at each step the parser applies a classifier to the features representing its current state and the resulting class is the name of the action to perform on the tree. Any classification algorithm can be used for this purpose and this allows experiments with alternative methods. We tested SVM, Maximum Entropy and the perceptron algorithm in various configurations. The parser constructs dependency trees by scanning input sentences in left-to-right word order and performing Shift /Reduce parsing actions. The parsing algorithm is fully deterministic and works as follows: Input Sentence: (w1 , p1 ), (w2 , p2 ), . . . , (wn , pn ) A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
169
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
S = I =< (w1 , p1 ), (w2 , p2 ), . . . , (wn , pn ) > T = while I != do begin cxt = getContext (S, I, T ); a = estimateAction(model, cxt ); performAction(a, S, I, T ); end I is the sequence of remaining input tokens consisting of n elements tn , which are pairs (wi ,pi ) of a word and associated lexical information, consisting typically of a POS tag and morphological features. S is the stack containing analyzed tokens; T is a stack of tokens whose processing has been delayed. During execution, links are added to the tokens representing the dependencies among nodes, and hence a token may become the head of a sub-tree built by the parser. To estimate the appropriate parsing action two functions are used: getContext and estimateAction. The function getContext extracts a set of contextual features cxt surrounding the current word, i.e. a few elements from I and S. The function estimateAction estimates an appropriate parsing action a ∈ {Shift, Right, Left, Right2, Left2, Extract, Insert } based on the model learned from training. We can describe the actions performed by the parser as follows. The state of the parser is represented by a quadruple S, I, T , A, where S is the stack, I is the list of (remaining) input tokens, T is a stack of temporary tokens and A is the arc relation for the dependency graph, which consists in a set of labeled arcs (wi , r, wj ), where wi , wj ∈ W (the set of tokens), r ∈ R (the set of dependencies). Given an input string w, the parser is initialized to (), w, (), (), and terminates when it reaches a configuration S, (), (), A. The three basic parsing rules are as follows: Shift
S,n|I,T,A n|S,I,T,A
Right
s|S,n|I,T,A S,n|I,T,A∪{(s,r,n)}
Left
s|S,n|I,T,A S,s|I,T,A∪{(n,r,s)}
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
170
G. Attardi, M. Simi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 1: An example of a Right action We illustrate the behavior of the parser by means of an example. The basic actions (Shift, Right and Left ) are applied to two neighboring tokens, top and next, which we display as yellow (light gray) and gray boxes in the figures below: next is the next input token while top is the last of the previously processed tokens that are accumulated on the stack S. A Right action creates a dependency relation between two neighboring words where the next token becomes a child of the top token. Figure 1 is an example of the action Right. After applying this action, “I” becomes a child of “saw” (“I” modifies “saw”).
Figure 2: An example of a Shift action Right moves top to the previous word, if it exists, while it does not change next. This allows further Right actions to be applied to previously skipped words. Since there are none in this case, a Shift is needed. Shift does not create any dependencies between the target nodes, but just advances the point of focus to the right (Figure 2). A further Shift action is now performed advancing the input to A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
171
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 3: An example of a Shift action the right and moving next to top (Figure 3). The next step is again a Right action. Notice that the next target nodes are “saw” and “girl” after executing the action, keeping the right frontier focus point unchanged (Figure 4). This is an important difference with the Yamada-Matsumoto algorithm, since it allows avoiding a second pass over the input, ensuring that the algorithm has linear complexity.
Figure 4: An example of a Right action The Left action constructs a dependency relation between two neighboring words where the right node of target nodes becomes a child of the left one, opposite to the action Right. Left pops top from S and pushes it back into next. Figure 5 shows an example of a Left action. Note that when either of Left or Right action is applicable, the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
G. Attardi, M. Simi
172
dependent child, the child should be a complete subtree to which no further dependent children will be added. For the parser to guarantee this, we have to make the parser to be able to see the surrounding context of the target nodes. The rest of the parsing proceeds with a sequence of Shift, Shift, Shift, Right, Left, Left, Left, leading to the following dependency tree (Figure 6).
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Figure 5: An example of a Left action
Figure 6: The complete dependency tree
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
173
The annotated sentence for training can be written as follows: The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
I SHIFT saw RIGHT a SHIFT girl SHIFT with SHIFT the SHIFT glasses RIGHT LEFT LEFT LEFT . LEFT Notice that a different annotation would represent an alternative reading of the same sentence: I SHIFT saw RIGHT a SHIFT girl RIGHT LEFT SHIFT with SHIFT the SHIFT glasses RIGHT LEFT LEFT . LEFT In languages with freer word order than English, dependency trees sometimes are non-projective, i.e. if we display the tree linearly, some arcs will cross, like in this Dutch fragment:
To handle non-projective trees we added four special actions to the parser (Right2, Left2, Extract, Insert). For example this fragment in Dutch is dealt by performing an Extract at the point after reading moeten, which takes gemaakt out of the stack and stores it on the temporary stack. Doing immediately an Insert, puts gemaakt back after worden leading to the following configuration, which can be handled by normal Shift /Reduce actions:
The parser has been trained on the 13 languages used in the CoNLL-X Shared Task [7], achieving precision scores from 70% to 88%, depending on the language. The parser also achieves an outstanding speed performance of over 200 sentences per second.
5
Intent Classification
The Intent Classifier is a typical machine learning classifier, consisting of a training phase and a prediction algorithm. The training phase A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
174
G. Attardi, M. Simi
creates a statistical model from a collection of annotated documents. Classification involves the following steps: The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
1. Feature extraction: convert a document to a feature vector 2. Multi-class classification: classify the feature vector as positive/negative/neutral sentiment polarity. The choice of features is critical to the effectiveness of the classifier. Since intent/attitudes can be expressed in very different ways depending on the domain [1], as a preliminary step the training documents are mined to extract: an intent vocabulary and frequent sub-patterns corresponding to dependency relations and including terms from the vocabulary. A classifier is then trained on an annotated corpus using such frequent sub-patterns as features. We use the dependency tree of a sentence to extract frequent subpatterns from sentences as features. Such patterns are obtained by removing nodes from the tree, preserving however the constraint that no negative adverb can be discarded. This avoids extracting the same feature “great movie” from the two sentences “this is a great movie” and “not a great movie”. Frequent sub-patterns are those patterns which occur in the training corpus more than a predetermined threshold. Frequent sub-patterns are extracted through a recursive algorithm as in [1], which starts from single node sub-trees, and then expands previous sub-trees by adding single sub-nodes in a depth-first order, in order to avoid generating duplicates. For classification, we use our own C++ implementation of Maximum Entropy, which is very fast both in learning and classification, and was preferred to SVMs because both its efficiency and its ability to handle multi-class classification.
6
Opinion Mining
The opinion retrieval task at TREC 2006 involves locating blog posts that express an opinion about a given target. The target can be a “traditional” named entity – a name of a person, location, or organization – but also a concept (such as a type of technology), a product name, or an event. The TREC 2006 Blog task provides a collection of blogs for comparing and evaluating opinion mining systems. This year however is the A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
175
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
first edition of the task, so there are no training data available to participants. The TREC Blog06 collection is just a collection of crawled feeds and blog pages. The answers to a set of 50 topics, submitted by all the participants, will be pooled and judged by human experts from NIST. After the TREC 2006 Conference, TREC will make available the list of all these relevance judgments. We plan to use this list for training and testing our Intent Classifier.
6.1
Indexing the collection
The TREC Blog06 collection consists in over 3 million blogs collected from over 100,000 feeds crawled during a period of 3 months. The feeds are pages in RSS/Atom format. Each RSS feed represents a single channel, with metadata for title, URL, description, generator, language and a list of items. Each item contains elements such as title, URL for the content, URL for the comments, description, date, creator and category. Atom feeds use slightly different naming but contain similar metadata and items. One major issue was how to recover the content of each blog, since the standard for RSS 2.0 does not provide for the inclusion of the content in the feed itself. The ‘Content’ extension module allows including content within an item, but this is rarely used in the collection. Some feeds include the whole content in the description element, even though this field is meant to provide a short synopsis of the content. So in general the content must be taken from the referred blog page. Unfortunately blog pages are messed up with all sort of extra information besides the blog post and the readers’ comments: pages often include annotated lists of previous posts, lists of similar related pages, navigation bars, side bars, advertising, etc. If we were indexing the page as a normal HTML page, all the text in these parts will end up in the index, leading to results with poor relevance. For identifying the proper post content within a blog page, we used three strategies. The first strategy is to use the content element from the feed, when available. In order to do this, we created an index for the feeds. When indexing a blog permalink, we check whether the feed where it came from contains a content element: in this case we use that element as the content for indexing. The second strategy is to deal specially with blogs generated by programs which follow well defined markup rules allowing the post’s content to be identified. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
176
G. Attardi, M. Simi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Of the total 551,763 million blogs: 282,982 were produced with Blogger, 101,355 with WordPress, 99,100 with LiveJournal, 9,267 with MovableType, 3,562 with Technorati, 1,869 with UserLand, 626 with FeedCreator. Each generator creates pages with a specific markup style. For instance WordPress encloses the posts within a element, and the proper content within a element. Post in Spanish instead are enclosed in a element. Comments are enclosed in a element. Blogger instead uses div’s post, post-body and to enclose comments. For these most used content generators, we created a list of elements to be included. We exploited the features of the customizable HTML reader in IXE [6], which allows providing a list of elements, element classes or element ids to skip or to include during indexing. For instance, using these parameters in IXE configuration file: IncludeElement Blogger div.post div.comments IncludeElement WordPress* div.post ol#commentlist we direct IXE to limit indexing to div elements with class name post or comments for pages generated by Blogger or div elements with class name post or ol elements with id commentlist generated from any version of WordPress. The third strategy is used to handle the remaining cases, excluding elements which are considered not part of the post. For example: ExcludeElement div.*link* div.side* div.*bar ExcludeElement div#header div#nav* excludes from indexing any div whose class name contains ‘link’, starts with ‘side’ or ends with ‘bar’ as well as any div whose id name is ‘header’ or starts with ‘nav’. Fortunately enough, many content generator do indeed use markup of this kind, so that with a list of about 50 element to exclude, we avoid most of the irrelevant parts. Another problem is the presence of spam blogs, also called splogs, i.e. fake blog pages which contain advertising or other irrelevant content used just to promote affiliated sites, which are often disreputable. We detected for instance in the collection a large number of splogs from the domain blogspot.com, which hosts a free blog posting service by Google. To avoid splogs, we used a black list of URLs from A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
177
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
http://www.splogspot.com/. Any blog from that list is assigned a page rank of 0 during indexing, so that it will not normally appear in search results. Finally, pages are written in several languages, but only the English blogs are considered relevant according to TREC blog track guidelines. RSS includes a metadata field for language. However it is not used consistently and hence it is quite inaccurate. One could apply a language detector to identify the language, but in the case of blog posts, which are often quite short, also this method is not sufficiently accurate. For helping finding opinionated blogs, we enriched the index with tags for words, i.e. the index does not contain only words but also an overlay of tags for each word. One tag is the OPINIONATED tag, which is associated to subjective words considered to be carrying an opinion bias. We tagged as opinionated a subset of the list of words SentiWordNet [9]. SentiWordNet was created from WordNet, starting from two seed sets of positive and negative terms, expanded by means of synonyms, antonyms and other semantic relations. Subjective terms were then represented as feature vectors consisting of terms in their description and glosses and used to train a statistical classifier. All words in WordNet were classified, producing the list SentiWordNet, consisting in 115,341 words marked with positive and negative orientation scores ranging from 0 to 1. We extracted from SentiWordNet a subset of 8427 opinionated words, by selecting those whose orientation strength is above a threshold of 0.4. The overlay allows performing proximity searches of the type: content matches proximity 6 [OPINIONATED:* ’George Bush’] which will return all documents where any (i.e. *) opinionated word occurs within 6 terms from the phrase George Bush. We plan to refine the approach by exploiting an English parser, in order to detect whether the opinionated term refers indeed to Bush, rather than to another entity in the same sentence.
6.2
Results
We performed a few experiments using the TREC 2006 Blog topics number 851 to 900. These topic range from controversial or beloved political figures (e.g. Abramoff, Bush, Ann Coulter), to performers (Jon A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
178
G. Attardi, M. Simi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
Steward), to movies or tv shows (March of the Penguins, Arrested Development, Life on Mars, Oprah Winfrey), to products (MacBook pro, Blackberry, Shimano) to political subjects (nuclear power, jihad). Each topic consists in a title, plus a description and a narrative. For example, topic 951 is as follows:
"March of the Penguins" <desc>Provide opinion of the film documentary "March of the Penguins". Relevant documents should include opinions concerning the film documentary "March of the Penguins". Articles or comments about penguins outside the context of this film documentary are not relevant. We performed a baseline run with queries made just from title words joined in AND. A second run used the same words but added a proximity operator with distance 6 to an opinionated word. The third run used an AND combination of title words plus an OR of description words. For the fourth run we used queries made from title words within proximity 6 from opinionated words plus an OR of description words. We performed an informal judgment of the results of these runs and obtained the following estimates for average precision@5, i.e. the precision of the first five results averaged on all topics:
run title title + proximity title + description title + proximity + description
relevant results 573 483 537 450
precision@5 0.100 0.275 0.250 0.637
Results are quite preliminary, but they seem to indicate that the opinionated words analysis provides a significant improvement on both title and title+description queries. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
7
179
Intent Retrieval
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
The Intent Classifier will be used as part of an intent retrieval engine, which uses a specialized Passage Retrieval system in order to retrieve candidate sentences about a target. The Passage Retrieval system [5] supports keyword searches based on a traditional inverted word-document index as well as searches for opinionated words as described above, but returns passages rather than documents where the keywords occur. The Passage Retrieval index also contains annotations about Named Entitites and can be queried specifying the search for a term representing a named entity. This allows distinguishing for instance the term ‘Apple’ used to mean an organization rather than a fruit. Each returned candidate result sentence is given a rank computed by a combination of a classical IR similarity metric (PL2, a variant of the well known BM25) and a distance metric on the target proximity. Since at the moment the system does not perform anaphora resolution, the target may not occur in a candidate sentence, but be just referred through a pronoun or implicitly. We handle anaphora in a crude way by using a score on the distance between the candidate sentence and one where the target occurs. The opinion mining task is performed by retrieving sentences from the Passage Retrieval that contain the given target tagged as an entity in proximity to an opinionated word. Each retrieved sentence is parsed and a set of sub-patterns extracted as features and used to classify the sentence. The result consists in a list of sentences with associated notes about the presence of opinions about the given target and their polarity.
8
Conclusions
We have presented the design of an opinion retrieval engine which extracts candidate sentences containing an opinion using a specialized search index which maintains annotations on words denoting whether the word is opinionated or whether it is part of a Named Entity. Each candidate sentence is then passed to an Intent Classifier to assess whether it indeed contains an opinion on the requested subject. We propose to build an Intent Classifier which uses relations extracted from a dependency parse tree of a sentence to compute features for classification. We sketched the technology of dependency parsing A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
180
G. Attardi, M. Simi
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
and a multi-language parser whose performance is suitable for large scale deployment. We presented the opinion search engine which we built for the TREC 2006 Blog Task and which can be used as searching component of the opinion retrieval system. Preliminary results from experiments are promising, showing improvements in precision when exploiting annotations on opinionated words. Possible enhancements to the system include incorporating an anaphora resolution mechanism, so that each sentence will be indexed also with the entity referred by the anaphora. Also it will be worthwhile to use multiple classifiers, one for each intent, so that the retrieval system could mark each result with a list of the expressed intents, or alternatively one could retrieve documents with a given intent. It will also be worth considering moving the intent classification at indexing time and recording the outcome as metadata associated to each sentence. The benefits would a reduction of the search time, while the difficulty would be to identify all possible targets present or referred in each sentence.
Acknowledgements We thank Andrea Esuli and Fabrizio Sebastiani for making available to us the list of opinionated words they produced from WordNet.
Bibliography [1] K. Abe, S. Kawasoe, T. Asai, H. Arimura, S. Arikawa. [2002] Optimized substructure discovery for semi-structured data. Proc. of 6th PKDD, pp.1-14. [2] A. Aue, M. Gamon. [2005] Customizing Sentiment Classifiers to New Domains: a Case Study. In RANLP-05, the International Conference on Recent Advances in Natural Language Processing. [3] G. Attardi [2006] Experiments with a Multilanguage non-projective dependency parser. In Proc. of the Tenth CoNLL. [4] G. Attardi, S. Di Marco, D. Salvi [1998] Categorisation by context. Journal of Universal Computer Science , 4(9),719-736. A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy
Extracting Dependency Relations for Opinion Mining
181
The electronic edition of this book is not sold and is made available in free access. Every contribution is published according to the terms of “Polimetrica License B”. “Polimetrica License B” gives anyone the possibility to distribute the contents of the work, provided that the authors of the work and the publisher are always recognised and mentioned. It does not allow use of the contents of the work for commercial purposes or for profit. Polimetrica Publisher has the exclusive right to publish and sell the contents of the work in paper and electronic format and by any other means of publication. Additional rights on the contents of the work are the author’s property.
[5] G. Attardi, A. Cisternino, F. Formica, M. Simi, A. Tommasi, C. Zavattari [2001] PIQASso: PIsa Question Answering System Proceedings of Text Retrieval Conference (Trec-10), 599-607, NIST, Gaithersburg (MD), November 13-16. [6] G. Attardi, A. Cisternino [2001] Template Metaprogramming an Object Interface to Relational Tables, Reflection 2001, LNCS 2192, 266-267, Springer-Verlag, Berlin. [7] S. Buchholz, et al. [2006] CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proc. of the Tenth CoNLL. [8] S. Dumais, J. Platt, D. Heckerman and M. Sahami [1998] Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM, 148–155. [9] A. Esuli, F. Sebastiani [2005] Determining the semantic orientation of terms through gloss classification. CIKM 2005: 617-624. [10] S. Matsumoto, H. Takamura, M. Okumura [2005] Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees. In: Ho, Tu Bao, Cheung, David, & Li, Huan (eds), Proceeding of PAKDD’05, the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. LNCS, vol. 3518. [11] H. Yamada, Y. Matsumoto [2003] Statistical Dependency Analysis with Support Vector Machines. In Proc. IWPT.
A. Soro, G. Armano and G. Paddeu (eds.) Distributed Agent-Based Retrieval Tools c 2006 Polimetrica International Scientific Publisher Monza/Italy