Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Editorial Board Ozgur Akan Middle East Technical University, Ankara, Turkey Paolo Bellavista University of Bologna, Italy Jiannong Cao Hong Kong Polytechnic University, Hong Kong Falko Dressler University of Erlangen, Germany Domenico Ferrari Università Cattolica Piacenza, Italy Mario Gerla UCLA, USA Hisashi Kobayashi Princeton University, USA Sergio Palazzo University of Catania, Italy Sartaj Sahni University of Florida, USA Xuemin (Sherman) Shen University of Waterloo, Canada Mircea Stan University of Virginia, USA Jia Xiaohua City University of Hong Kong, Hong Kong Albert Zomaya University of Sydney, Australia Geoffrey Coulson Lancaster University, UK
69
Martin Szomszor Patty Kostkova (Eds.)
Electronic Healthcare Third International Conference, eHealth 2010 Casablanca, Morocco, December 13-15, 2010 Revised Selected Papers
13
Volume Editors Martin Szomszor Patty Kostkova City eHealth Research Centre (CeRc) School of Community and Health Sciences City University, London, EC1V 0HB, UK E-mail: {martin.szomszor.1; patty}@city.ac.uk
ISSN 1867-8211 ISBN 978-3-642-23634-1 DOI 10.1007/978-3-642-23635-8
e-ISSN 1867-822X e-ISBN 978-3-642-23635-8
Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935341 CR Subject Classification (1998): K.4, I.2, J.3-4, H.4, C.2, H.5
© ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
It is our great pleasure to introduce the special issue of LNSV compiled from scientific presentations at the Third International ICST Conference on Electronic Healthcare for the 21st Century (eHealth 2010) that took place in Casablanca, Morocco, during December 13–15, 2010. Building on the very successful First and Second eHealth conferences, held in London and Istanbul, respectively, the aim of eHealth 2010 was to bring together experts from academia, industry and global healthcare institutions, such as the WHO and ECDC, to stimulate cutting-edge research discussions, share experience with real-world healthcare service providers and policy makers, as well as provide numerous business and networking opportunities. It is reassuring to see that in 3 years, this event has established itself as the key annual conference in the domain of eHealth, with a fast-growing professional and scientific community. For the first time, eHealth was co-located with two well-established workshops: the 5th International Workshop on Personalization for eHealth (Pers4eHealth2010) and the 6th Workshop on Agents Applied in Healthcare (A2HC2010). We accepted 30 full technical presentations by speakers from all over the world, having received over 70 submissions in total. These, along with 12 papers from the 2 collocated workshops, appear in these proceedings and cover a wide range of topics including: Web intelligence; privacy, trust and security; ontologies and knowledge management; eLearning and education; Web 2.0 and online communities of practice; and performance monitoring and evaluation frameworks for healthcare. We also had the privilege of hosting two prestigious keynote speakers. Najeeb Al Shorbaji from the World Health Organization delivered a talk on global eHealth challenges and Mathew Swindells, the Health Chair of the British Computer Society, presented his experiences designing and delivering large-scale ICT healthcare solutions. Both talks stimulated some interesting discussions, with many participants making real-time comments via Twitter using the hash-tag #ehealth2010. In addition to our two keynote speakers, we invited representatives from a number of commercial organizations, hospitals, and public health bodies, including Frank Wartena (Philips Research Europe), Frederic Lievens (Med-e-Tel, Luxembourg), Jeremy Nettle (Oracle Corporation), Neill Jones (First Databank UK), Maurice Mars (University of KwaZulu-Natal, South Africa), Alberto E. Tozzi (Bambino Gesu Hospital, Italy), Erik van der Goot (European Commission—Joint Research Centre, Italy), and Salwa Rafee (Global Healthcare & Life Sciences Management Team at IBM, France).
VI
Preface
To encourage discussion and sharing of ideas and opinions between participants, two panels were held at eHealth 2010. Laszlo Balkanyi (ECDC, Stockholm, Sweden) organized the ePublicHealth Data Interoperability panel that discussed and explored the possibility of an agenda that would improve the visibility and usability of information provided by EU public health bodies through interoperability of services and an adoption of shared standards. Femida Gwadry-Sridhar (Lawson Health Research Institute, Canada) chaired the Evidence-Based eHealth Applications panel where the utility of novel Web technologies (such as social media) was debated, in particular how to best provide users with access to the most accurate and up-to-date information possible. Furthermore, eHealth 2010 also held a joint Poster and Demo session (chaired by Ed de Quincey and Gayo Diallo) – an informal session on the first evening of the conference that allowed participants to demonstrate and share new ideas. Ten posters and eight demonstrations were accepted and provided a good catalyst for discussion among the participants. Finally, we would like to thank everyone who contributed to making eHealth2010s such a success: the authors of all submitted papers, the speakers, the invited and the keynote presenters, the Programme Committee members, reviewers and Session Chairs and above all the Local Chair, Hassan Ghazal, together with his student volunteers and local organizers. Finally, we thank ICST and Create-Net for sponsoring the event, and Springer for publishing this LNSV book. Martin Szomszor Patty Kostkova
Organization Third International ICST Conference on Electronic Healthcare Steering Committee Chair Imrich Chlamtac
President Create-Net Research Consortium
Steering Committee Scientific Co-chairs Patty Kostkova Muttukrishnan Rajarajan
City eHealth Research Centre, City University London, UK Mobile Networks Research Centre, City University London, UK
General Co-chairs Martin Szomszor Harini Kulatunga
City eHealth Research Centre, City University London, UK Logica Healthcare Consulting, UK
Clinical Chair Femida Gwadry-Sridhar
Lawson Health Research Institute, Canada
EU Public Health Chair Laszlo Balkanyi
ECDC, Stockholm, Sweden
Poster Chair Ed de Quincey
University of Greenwich, UK
Demo Chair Gayo Diallo
LESIM/ISPED, University of Bordeaux 2, France
VIII
Organization
Local Chair Hassan Ghazal
University Mohammed First, Morocco
Conference Co-ordinator Gergely Nagy
ICST
Technical Programme Committee Anne Adams Dimitra Alexopoulou Elske Ammenwerth Bill Andreopoulos Ricardo Baeza-Yates L´aszl´o Balk´anyi Ay¸se Bener Olivier Bodenreider
Albert Burger Juan Chia Vittoria Colizza Colizza Olivier Corby Ulises Cortes Ed De Quincey Gayo Diallo
Charles Doarn Jonathan Elford Floriana Grasso Femida Gwadry-Sridhar David Hansen Jesse Hoey Alexander H¨orbst Gawesh Jawaheer Malina Jordanova
The Open University, Milton Keynes, UK Technische Universit¨at Dresden, Germany UMIT, Austria Technische Universit¨at Dresden, Germany Yahoo! Research Labs, Barcelona, Spain ECDC, Stockholm, Sweden Bogazici University, Turkey National Institutes of Health, U.S. Department of Health and Human Services, USA Heriot-Watt University, UK Worldwide Clinical Trials, London, UK Institute for Scientific Interchange Foundation, Italy Institut National De Recherche en Informatique et en Automatique, France Technical University of Catalonia, Spain University of Greenwich, UK Laboratory of Applied Computer Science LISI-ENSMA (FUTUROSCOPE Poitiers), France University of Cincinnati, USA City University London, UK University of Liverpool, UK University of Western Ontario, Canada eHealth Research Centre, Australia University of Dundee, UK University for Health Sciences, Medical Informatics and Technology, Austria City eHealth Research Centre, City University London, UK Solar-Terrestrial Influences Institute, Bulgaria
Organization
Simon Jupp Harald Korb Patty Kostkova Harini Kulatunga Panayiotis Kyriacou Shirle Large Lisa Lazareck Panos Liatsis Frederic Lievens Cecil Lynch Julie Maitland Corinne Marsolier Maria G. Martini Kenneth McLeod Henning M¨ uller Chris Nugent Venet Osmani Daniela Paolotti George Polyzos Rob Procter Muttukrishnan Rajarajan Dietrich Rebholz-Schuhmann Blaine Reeder David Ria˜ no Mike Santer Heiko Schuldt Tacha Serif Martin Szomszor Adel Taweel Alexey Tsymbal ¨ Asli Uyar Ozkaya Eric van der Goot Jan Vejvalka
IX
University of Manchester, UK Vitaphone GmbH, Germany City eHealth Research Centre, City University London, UK Logica Healthcare Consulting, UK City University London, UK NHS Direct, Hampshire, UK City eHealth Research Centre, City University London, UK City University London, UK Med-e-Tel, Luxembourg UC Davis School of Medicine, USA National Research Council, Canada Internet Business Solutions Group, Cisco Europe Kingston University, UK Heriot-Watt University, UK University of Applied Sciences Western Switzerland University of Ulster, UK Ubiquitous Interaction Group (UBiNT), CREATE-NET, Italy Institute for Scientific Interchange Foundation, Italy Athens University of Economics and Business, Greece University of Edinburgh, UK Mobile Networks Research Centre, City University London, UK European Bioinformatics Institute, Welcome Trust Genome Campus, UK University of Washington, USA Universitat Rovira i Virgili, Spain University of Southampton, UK University of Basel, Switzerland Yeditepe University, Turkey City eHealth Research Centre, City University London, UK Kings College London, UK Siemens AG, Germany Bogazici University, Turkey European Commision’s Joint Research Centre, Italy Charles University, Czech Republic
X
Organization
Dasun Weerasinghe Peter Weller Yeliz Yesilada Jennifer Zelmer Jana Zvarova
City eHealth Research Centre, City University London, UK City University London, UK The University of Manchester, UK International Health Terminology Standards Development Organisation, Denmark EuroMISE, The Academy of Sciences of the Czech Republic
5th International Workshop on Personalization for eHealth
Organizing Committee Floriana Grasso C´ecile Paris
University of Liverpool, UK CSIRO, Sydney, Australia
Technical Programme Committee Giuseppe Carenini Ulises Cortes Nadja de Carolis Reva Freedman Nancy Green Patty Kostkova Robin Cohen Tze Yun Leong Peter Lucas Wendy Moncur Antonio Moreno Sara Rubinelli
University of British Columbia, Canada Technical University of Catalonia UPC, Spain University of Bari, Italy Northern Illinois University, USA University of North Carolina Greensboro, USA City eHealth Research Centre, City University London, UK University of Waterloo, Canada National University of Singapore University of Nijmegen, The Netherlands University of Aberdeen, UK Universitat Rovira i Virgili, Spain University of Lucerne and Swiss Paraplegic Research, Switzerland
6th Workshop on Agents Applied in Healthcare
Organizing Committee Antonio Moreno Ulises Cortes Roberta Annicchiarico Magi Lluch-Ariet David Isern
Universitat Rovira i Virgili, Spain Technical University of Catalonia, Spain IRCCS- Istituto di Ricovero e Cura a Carattere Scientifico, Italy MicroArt Universitat Rovira i Virgili, Spain
Technical Programme Committee Martin Beer David Isern Patty Kostkov Lenka Lhotska Julian Padget Aida Valls Laszlo Zsolt Varga Javier Vazquez-Salceda
University of Sheffield Hallam, UK Universitat Rovira i Virgili, Spain City eHealth Research Centre, City University London, UK Czech Technical University, Czech Republic University of Bath, UK Universitat Rovira i Virgili, Spain MTA SZTAKI, Hungary Technical University of Catalonia, Spain
Table of Contents
Session 1: Epidemic Intelligence An Authoring Framework for Security Policies: A Use-Case within the Healthcare Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Trojer, Basel Katt, Florian Wozak, and Thomas Schabetsberger Detecting Public Health Indicators from the Web for Epidemic Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Avar´e Stewart, Marco Fisichella, and Kerstin Denecke #Swineflu: Twitter Predicts Swine Flu Outbreak in 2009 . . . . . . . . . . . . . Martin Szomszor, Patty Kostkova, and Ed de Quincey
1
10 18
Session 2: Data Representation and Knowledge Management Identifying Breast Cancer Concepts in SNOMED-CT Using Large Text Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zharko Aleksovski and Merlijn Sevenster Towards Knowledge Oriented Personal Health Systems . . . . . . . . . . . . . . . Juha Puustj¨ arvi and Leena Puustj¨ arvi
27 35
Session 3: Training and ICT Infrastructure: International Perspective Evaluation of Popularity of Multi-lingual Educational Web Games – Do All Children Speak English? . . . . . . . . . . . . . . . . . . . . . . . . . . . Dasun Weerasinghe, Lisa Lazareck, Patty Kostkova, and David Farrell
44
Session 4: Online Communities of Practice Integrating Consumer-Oriented Vocabularies with Selected Professional Ones from the UMLS Using Semantic Web Technologies . . . . . . . . . . . . . . Elena Cardillo, Genaro Hernandez, and Olivier Bodenreider
54
XVI
Table of Contents
Evaluation of a Semantic Web Application for Collaborative Knowledge Building in the Dementia Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Helena Lindgren and Peter Winnberg
62
Impacts of a Web-Based System on a Distributed Clinical Community of Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie Gustafsson Friberger
70
Session 5: Clinical Decision Support Systems Modeling Healthcare Processes in BPEL: A Colon Cancer Screening Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberta Cucino and Claudio Eccher
78
Improvements in Data Quality for Decision Support in Intensive Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Filipe Portela, Marta Vilas-Boas, and Manuel Filipe Santos
86
Predicting Sepsis: A Comparison of Analytical Approaches . . . . . . . . . . . . Femida Gwadry-Sridhar, Ali Hamou, Benoit Lewden, Claudio Martin, and Michael Bauer
95
Session 6: Knowledge Dissemination and Training Promoting e-Health Resources: Lessons Learned . . . . . . . . . . . . . . . . . . . . . Ed de Quincey, Patty Kostkova, and Gawesh Jawaheer Between Innovation and Daily Practice in the Development of AAL Systems: Learning from the Experience with Today’s Systems . . . . . . . . . Shirley Beul, Lars Klack, Kai Kasugai, Christian Moellering, Carsten Roecker, Wiktoria Wilkowska, and Martina Ziefle The FEM Wiki Project: A Conversion of a Training Resource for Field Epidemiologists into a Collaborative Web 2.0 Portal . . . . . . . . . . . . . . . . . . Patty Kostkova and Martin Szomszor
103
111
119
Session 7: ICT Healthcare Architectures Logica’s eCareLogic: A Service Oriented Architecture for Connected Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harini Kulatunga Collaborative Encoding of Asbru Clinical Protocols . . . . . . . . . . . . . . . . . . . Marco Rospocher, Claudio Eccher, Chiara Ghidini, Rakebul Hasan, Andreas Seyfang, Antonella Ferro, and Silvia Miksch
127 135
Table of Contents
XVII
Session 8: Social Structures eHealth Living Lab Micro Innovation Strategy: A Case Study of Prototypes through Co-creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josep Ma. Monguet, Marco Ferruzca, Joaqu´ın Fern´ andez, and Eduardo Huerta Personal Health Records among Institutions, Medical Records, and Patient Wisdom: A Socio-technical Approach . . . . . . . . . . . . . . . . . . . . . . . . Barbara Purin and Enrico Maria Piras
144
151
Session 9: User Support and Socio-economic Issues in Healthcare Economic Viability of eCare Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Van Ooteghem, Ann Ackaert, Sofie Verbrugge, Didier Colle, Mario Pickavet, and Piet Demeester
159
Gender-Specific Kansei Engineering: Using AttrakDiff2 . . . . . . . . . . . . . . . Bianka Trevisan, Anne Willach, Eva-Maria Jakobs, and Robert Schmitt
167
Accounting for User Diversity in the Acceptance of Medical Assistive Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sylvia Kowalewski, Wiktoria Wilkowska, and Martina Ziefle
175
Session 10: Delivery and Monitoring Platforms Experimental Evaluation of IEEE 802.15.4/ZigBee for Multi-patient ECG Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Helena Fern´ andez-L´ opez, Jos´e H. Correia, Ricardo Sim˜ oes, and Jos´e A. Afonso Wearable Sensor Networks for Measuring Face-to-Face Contact Patterns in Healthcare Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alain Barrat, Ciro Cattuto, Vittoria Colizza, Lorenzo Isella, Caterina Rizzo, Alberto E. Tozzi, and Wouter Van den Broeck
184
192
Session 11: Security, Trust and Privacy A Note on the Security in the Card Management System of the German E-Health Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcel Winandy
196
Towards a Framework for Privacy Preserving Medical Data Mining Based on Standard Medical Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . Aur´elien Faravelon and Christine Verdier
204
XVIII
Table of Contents
Session 12: Modelling and Evaluation of Healthcare ICT System On the Usage of SAML Delegate Assertions in an Healthcare Scenario with Federated Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimiliano Masi and Roland Maurer An Event-Based, Role-Based Authorization Model for Healthcare Workflow Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vassiliki Koufi, Flora Malamateniou, Eleni Mytilinaiou, and George Vassilacopoulos Hostpial Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Lux
212
221
229
Workshop: 5th International Workshop on Personalisation for eHealth A Model for Interaction Design of Personalised Knowledge Systems in the Health Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Helena Lindgren and Peter Winnberg
235
Conceptual Design of a Personalised Tool for Remote Preanaesthesia Evaluation: A User-Centred Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polyxeni Vassilakopoulou, Vassilis Tsagkas, and Nicolas Marmaras
243
Privacy in Commercial Medical Storage Systems . . . . . . . . . . . . . . . . . . . . . Mehmet Tahir Sandıkkaya, Bart De Decker, and Vincent Naessens
247
A Portal to Promote Healthy Living within Families . . . . . . . . . . . . . . . . . . Nathalie Colineau and C´ecile Paris
259
e-MomCare: A Personalised Home-Monitoring System for Pregnancy Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marina Velikova, Peter J.F. Lucas, and Marc Spaanderman
267
Workshop: VWorkshop on Agents Applied in Healthcare The MOSAIC System – A Clinical Data Exchange System with Multilateral Agreement Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mag´ı Lluch-Ariet and Josep Pegueroles-Valles
275
Agent-Based Careflow for Patient-Centred Palliative Care . . . . . . . . . . . . . Ji Ruan, Wendy MacCaull, and Heather Jewers
285
To Share or Not to Share SHARE-it : Lessons Learnt . . . . . . . . . . . . . . . . . Roberta Annicchiarico and Ulises Cort´es
295
Table of Contents
iTutorials for the Aid of Cognitively Impaired Elderly Population . . . . . . Carolina Rubio, Roberta Annicchiarico, Cristian Barru´e, Ulises Cort´es, Miquel S´ anchez-Marr´e, and Carlo Caltagirone An Agent Framework for the Analysis of Streaming Physiological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruairi D. O’Reilly, Philip D. Healy, John P. Morrison, and Geraldine B. Boylan
XIX
303
311
ALIVE Meets SHARE-it: An Agent-Oriented Solution to Model Organisational and Normative Requirements in Assistive Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ignasi G´ omez-Sebasti` a, Dario Garcia-Gasulla, Cristian Barru`e, Javier V´ azquez-Salceda, and Ulis´es Cort´es
319
Support-Based Distributed Optimisation: An Approach to Radiotherapy Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graham Billiau, Chee Fon Chang, Aditya Ghose, and Andrew Miller
327
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
335
An Authoring Framework for Security Policies: A Use-Case within the Healthcare Domain Thomas Trojer1 , Basel Katt1 , Florian Wozak2 , and Thomas Schabetsberger2 1
Research Group Quality Engineering, University of Innsbruck, Austria {thomas.trojer,basel.katt}@uibk.ac.at 2 ITH-icoserve GmbH, Innsbruck, Austria {florian.wozak,thomas.schabetsberger}@ith-icoserve.com
Abstract. Traditionally, the definition and the maintenance of security and access control policies has been the exclusive task of system administrators or security officers. In modern distributed and heterogeneous systems, there exist the need to allow different stakeholders to create and edit their security and access control preferences. In order to solve this problem two main challenges need to be met. First, authoring tools with different user interfaces should be designed and adapted to meet domain background and the degree of expertise of each stakeholder. For example, policy authoring tools for a patient or a doctor should be user friendly and not contain any technical details, while those for a security administrators can be more sophisticated, containing more details. Second, conflicts that can arise among security policies defined by different stakeholders must be considered by these authoring tools on runtime. Furthermore, warnings and assisting messages must be provided to help defining correct policies and to avoid potential security risks. Towards meeting these challenges, we propose an authoring framework for security policies. This framework enables building authoring tools that take into consideration the views of different stakeholders. Keywords: Security policy, EHR, Policy authoring, Usability, Modeldriven engineering.
1
Introduction
In recent years’ conducted research and applications developed to advance the field of patient healthdata management, the trend is clearly towards data maintenance in an electronic way. Medical institutions, like hospitals or private practitioners extend their (or establish) IT infrastructures to cope with proposed architectures and governmental regulations regarding the storage and distribution of electronic health records (EHRs). Distribution of patient’s medical data or the provision of shared access to it, is commonly supported and understood to lower healthcare costs and increase efficiency of individual medical
This work was partially supported by the Austrian Federal Ministry of Economy as part of the Laura-Bassi – Living Models for Open Systems – project FFG 822740/QE LaB, see http://lab.q-e.at/
M. Szomszor and P. Kostkova (Eds.): E-Health 2010, LNICST 69, pp. 1–9, 2011. c Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2011
2
T. Trojer et al.
treatments [1]. An example of an EHR implementations is sense infrastructure developed by ITH-icoserve. ITH-icoserve technology for health-care GmbH, as a subsidiary company of a regional hospital holding company, the Tiroler Landeskrankenanstalten GmbH (TILAK) and Siemens AG, has developed systems supporting cooperative care within Austria. The electronic health records in this scenario are aligned to the specifications of an Austrian governmental company, ELGA GmbH (Elektronische lebensbegleitende Gesundheitsakte, ”electronic lifelong health record ”) which proposes the interlinked, but distributed storage of electronically acquired healthdata of patients. In our previous work [6,7,8] we dealt with security mechanisms on the infrastructure level. In this work, on a more conceptual level, we identify sources for patient and clinician authentication as a prerequisite to further allow or deny requests or to audit (im)proper behavior. Access control policies are put in place to cope with governmental regulation on who is allowed to access what kind of health record under which circumstances. An example of such a policy in Austria is that no physician is allowed to access a record she/he created about a patient if the treatment was given earlier than 28 days in the past. This strict rule is weakened under certain conditions, e.g., if it is the intention of the patient to make her/his medical data available for a longer period or in case of a required emergency access. Our project goal is to design a framework, which provides different stakeholders with security authoring tools with a high degree of usability. Further these authoring tools are designed to analyse security measures and to generate enforceable security artifacts, later deployed within the security infrastructure of the system. Analysis thereby aims at ensuring consistency by resolving conflicts among active policies defined by different stakeholders and enhancing the quality and correctness of defined policies by warning users about potential security risks when new policies are created. Access to medical data is important to many stakeholders within the healthcare domain, but heavily raises the potential of having privacy and security at risk. The type of security we focus on is about patient-controlled dynamic access control. Patient-controlled and dynamic, because of patients expressing their personal privacy preferences about their identifying medical records at any time and at their free will. Patient-controlled declaration of access control regulations is a useful tool to support a patient’s desire on self-determination about the usage of her/his private data. Besides that it also presents itself to be a challenging task with open questions in various ways. We have to deal with potential data risks, since non-IT experts define privacy and security measures. This in turn leads to a discussion to which extend patients shall be allowed to determine privacy aspects regarding their healthcare data. Furthermore, authoring tools have to offer functionality which guides and helps patients who are inexperienced and face difficulties when creating their privacy rules. Thus Usability of the authoring platform has to be considered. This can be partially done by the development team, but mainly, usability evaluations have to be conducted by empirical studies including patients, physicians, security experts and IT experts. Finally, general information and clarification on the purpose of patient-controlled
An Authoring Framework for Security Policies
3
authoring of access control policies has to be transmitted to the general public. This is especially needed to gather the required public acceptance in using such authoring tools and the trust in the method itself to be a chance for everybody to make use of the personal right of informational self-determination. The framework is of a generic type, but we will evaluate its feasibility within a usecase scenario taken from the healthcare domain. Additional knowledge is therefore provided by our collaborating industrial partner, ITH-icoserve GmbH.
2
Related Work
Different security measures have been proposed to protect electronic healthcare environments. As of our focus, we are especially interested in authorization to secure access to electronic healthcare records of patients. The healthcare domain has certainly specific requirements to access control. Patients, for example, should be allowed to declare access control regarding their identifying health data [1]. In [11] requirements and an initial model for patient-controlled access control using Role-based Access Control (RBAC) [12] is presented. We build upon that work but further extend the access control model to cope with a variety of other requirements in our context. Additionally the work in [4] discusses access control for medical records maintained by electronic information systems. The authors proposed, similar to our work, several (abstract) models which define concepts of security related to the healthcare domain. Furthermore, delegation of access rights plays an important role to e.g., allow medical institutions and research laboratories to conduct studies in medical sciences. Of course, such transitive access rights have to be purpose and obligationbound, as well as they have to be accompanied by protection of corresponding data through pre-enforced data privacy measures. E.g., [13] describes a method how to anonymize and share data from different sources to conduct mining and analysis. This might be just one of many methods which can be employed together with delegation mechanisms. Regarding usability issue in security policy authoring tools, there has already been a body of work published. In [2] the authors propose the automatic generation of security-aware event-driven graphical user interfaces, by mapping RBAC entities to events triggered in order to access application resources. Similar to this approach we employ an extended policy model and deal with the authoring of corresponding policy artifacts, as it has been proposed in [5]. Our approach on the other hand covers a lot more than just the policy authoring tools themselves, as we design a framework which incorporates arbitrary types of security policies and enterprise domains. The final outcome is specifically tailored security policy authoring tools regarding the domain they are designed for. Finally, usability plays an important role in the acceptance and the success of security policy authoring tools. E.g., in [10], the authors put a focus on usability in the context of authorization methods. They defined several usability challenges with which policy authoring tools have to cope. Further a user study was conducted to
4
T. Trojer et al.
evaluate their implementation of a policy authoring tool, namely the SPARCLE policy workbench.
3
Modeling Approach
In order to achieve a usable method for a patient to modify her/his access control preferences regarding personal medical data, we identify a variety of issues that must be taking into consideration. First and foremost security requirements have to be identified. These are then reflected by appropriate security models (in our context e.g., an access control policy model). An access control policy model is used to have a domain and platform independent representation to express authorization concepts. Such a model only covers generic concepts how statements of a concrete policy are defined and therefore represents a policy specification. We can use such a specification to evaluate conformance of any concrete policy we maintain. A policy model contains security concepts independent of the domain, but can be mapped to domain entities with a certain properties. In case of RBAC we would e.g., infer properties like ”isRole”, ”isResource” or ”isAccessMode” for domain entities like ”pharmacist ”, ”prescription” or ”read ” respectively.
Fig. 1. Domain and security model according to a security requirements analysis
Fig. 1 shows how we incorporate the policy model with the actual domain. Note that we use the term security model to emphasize that all types of securityrelated measures are potentially modeled there. In this paper we will only focus on access control policies. In order to use the access control policy model within a specific domain, we have to first identify all concepts which occur in the real domain environment. This is entitled as the domain model and is also depicted in the referred figure. The mapping of model elements in the context of security to domain model elements introduces a method we call ”security typing”. Security typing simply relates security model elements as being properties of the domain model entities.
4
Entity Models
In this paper we deal with two entity models, the domain model and the security model (domain access control policy model ). In this section we want to elaborate
An Authoring Framework for Security Policies
5
our viewpoint on the importance of model-driven engineering to design and develop access control authoring tools in the healthcare scenario. We do this by describing the two entity models we introduced briefly within Sec. 3. 4.1
Domain Model
The domain model is the core resource for which certain authoring aspects are made available. The domain model describes, as mentioned earlier, all concepts which occur in the real environment, whereas domain model instances describe the actual entities which conform to the domain model. The healthcare domain, e.g., can be described by a set of healthcare related roles, medical documents (EHRs) of different types, healthcare institutions and care activities. Furthermore, relationships between the various constituents of this domain can be identified. The general domain model, which describes the real domain environment (e.g., healthcare), can be connected with other models tackling a specific aspects in that domain (e.g., security), which we call a aspect model. By mapping both models, we can develop or generate executable applications dealing with certain aspects of the domain. 4.2
Access Control Aspect Model
Fig. 1 depicts the modeling process performed by domain experts, which step-bystep considers the (i) analysis of the domain model in order to gather knowledge about certain aspects and requirements; the (ii) selection of security properties which are appropriate to the targeted security requirements; finally, the (iii) mapping of the domain model and the aspect model in order to highlight how arbitrary domain model elements have to be represented during an authoring process. The access control policy model is the aspect model we focus on in this work. It covers role/subject entities for which access to resource entities shall be (dis)allowed to a set of conditions. Conditions are also expressible within the model and consist of context attributes which have to be verifiable. During our research we identified types of potential access control rules. The policy model has to be expressive enough to cover all of the following concepts: – – – – –
Permissions based on hierarchical roles allowed or denied to access resources Permissions based on the identity of individuals (Restricted) delegation of permission Permissions targeting resources, either by their type or identifiers Permissions bound to conditions, possibly of the following types or any combination of them: temporal, location-based, (medical) session-specific, purpose-based and obligation-bound.
Hierarchical roles [12] are used to express institutional roles and how they relate to the domain. E.g., clinicians are part of a healthcare institution (parent role) and consists of surgeons, internists, anesthetists and others (child roles).
6
T. Trojer et al.
A role used within a permission assignment implies that all child roles are assigned with that permission as well. This is since their parent role is used to generically cover the group of roles with certain attributes in common. Permissions for subjects or individuals are meant to be assigned to individual persons, rather than roles. Such assignments are especially practical e.g., to declare a specific doctor as being someone’s family practitioner. Such a practitioner may receive extended permissions compared to permissions she/he would get granted because of the general domain role that is assigned to her/him. Delegation of permissions is useful if, for a specifically stated purpose, an actor needs access to resources she/he normally doesn’t hold. A case is if the maintenance of medical records and access rules to them is not feasible e.g., due to a certain disability of the identified individual. Delegates thereby may include e.g., family members or nursing staff. Resources, which are the target of permissions may be referenced in two different ways: Each single EHR is labelled with a type. Permissions can therefore target all EHRs of a common type. Such types include e.g., prescriptions, medical treatment reports or discharge letters. Further specific EHRs can be the target of a permission by explicitly referring to their unique document identifiers. Various types of conditions increase the expressiveness of permissions. We identify four different types of conditions: Temporal to express time and date constraints according to access requests. Location-based may declare permitted access only at certain locations. Such a condition can e.g. express, ”only at the employing healthcare institution” in order to prevent physicians to access medical records from home. Another example of location-based conditioned access is to establish the ”four-eye principle”, in which a physician is only allowed to access patient’s medical data if the corresponding patient is attending in a medical session. (Medical) session-specific, which declares EHRs to be (un)available to the practitioner performing in a medical treatment session. This condition may internally represented by temporal and location-based conditions covering the session’s appointed date and place, respectively. Purpose-based conditions are set in order to be able to verify if the purpose of a usage of EHRs stated by the patient matches the intended usage by the practitioner or researcher. Different kinds of purposes may be related within a hierarchy [12]. A chosen purpose of usage for an EHR represents the ”maximal” allowed purpose (i.e. regarding the severity of e.g., risked privacy) the data is intended to be accessed, processed or distributed. Obligation-bound access declares certain actions to be fulfilled prior to granted access. E.g., patient notification, if demanded by the patient, or request for permission by a physician can be established that way.
5
Policy Authoring Environment
Continuing from the secured domain model, where we relate security entities to domain entities, we declare editing functionality through user interface (UI) input controls. This method is what we call ”interface typing” and bases on domain entities which have been previously associated with security properties
An Authoring Framework for Security Policies
7
Fig. 2. Security domain with user interface annotations. Generation of authoring tools according to transformation models.
through security typing. Fig. 2 shows this by assigning UI input controls to the domain access control policy model. Such input controls are defined within the user interface model and realized by a concrete UI development kit (see UI components as an instance of the user interface model ). Typically textual input fields/areas, item and option selections as well as command buttons are provided. Besides that, a general style and textual output elements are provided to increase the structural quality, readability and usability of the UI. According to the secured domain interface model we have now gathered securityrelevant domain entities with additional information on how input data can be acquired. Still, in order to complete a transformation process from the extended domain model to an authoring tool for security properties of the domain, a transformation model [3] is put in place. The following tasks are carried out by using this transformation model: (i) UI layout and UI component grouping, (ii) interface realization to auxiliary system components, and (iii) stakeholder-centric views. UI component layout is performed for both, input controls and output components according to the policy constructs and their semantical meaning. Further a usable UI has to be generated by taking into account certain layout rules and coded best practices. E.g., semantical meanings like ”subject is allowed to perform action on target EHR, if condition is satisfied and private health data is at most used for the purpose of purpose.” can be inferred from our access control policy model. UI components, i.e. input controls for domain entity placeholders (emphasized by italic font in the access control statement) as well as text labels to complement a natural language sentence, have to be logically grouped. The transformation model, which is defined closely related to the selected security model, is responsible for realizing interfaces to perform communication with auxiliary system components. E.g., queries to a policy repository and the provisioning of enforceable rules to a security engine have to be implemented. Further a source for retrieving domain model instances has to be defined there. The transformation process is performed with regards to the stakeholder, which leads to customized authoring environments with full or limited functionality available to certain types of users. Additionally, each stakeholder has only
8
T. Trojer et al.
access to a restricted set of the overall domain entity instances usable for permission creation. According to this subset, specific queries are generated to fetch all accessible domain entity instances. Finally it has to be easily possible to integrate the generated authoring applications to existing healthcare portal applications, which are already in place in many healthcare institutions.
6
Conclusion
This paper presents an overview of a conceptual framework to author certain domain aspects. Tackled domain aspects are defined by aspect models. A security model, e.g., is the aspect model of our healthcare domain model, while a user interface model is the aspect model of the secured domain model. Transformation models are used to layout the generated authoring applications according to usability best practices. Our future research will highlight each step performed within this framework. A realistic domain model is currently discussed with our partner from the healthcare industry. Once the domain model is established we will decide on a security model which fits the access control requirements of the given domain. Our focus will be entirely put on access control in EHR-maintaining infrastructures, which is performed by patients identified by such health records. Further we will conduct real usability studies with different domain stakeholders in order to establish best practices in designing patient-controlled authoring tools regarding access control of EHRs.
References 1. IBM Austria, Feasibility study for implementing the electronic health record (ELGA) in the Austrian health system, IBM (November 2006) 2. Basin, D., Clavel, M., Egea, M., Schl¨ apfer, M.: Automatic generation of Smart, Security-aware GUI Models. In: Massacci, F., Wallach, D., Zannone, N. (eds.) ESSoS 2010. LNCS, vol. 5965, pp. 201–217. Springer, Heidelberg (2010) 3. B´ezivin, J., B¨ uttner, F., Gogolla, M., Jouault, F., Kurtev, I., Lindow, A.: Model transformations? Transformation models! In: Wang, J., Whittle, J., Harel, D., Reggio, G. (eds.) MoDELS 2006. LNCS, vol. 4199, pp. 440–453. Springer, Heidelberg (2006) 4. Blobel, B.: Authorisation and access control for electronic health record systems. International Journal of Medical Informatics 73(3) (2004) 5. Karat, C., Karat, J., Brodie, C., Feng, J.: Evaluating interfaces for privacy policy rule authoring. In: SIGCHI 2006. ACM, New York (2006) 6. Katt, B., Breu, R., Hafner, M., Schabetsberger, T., Mair, R., Wozak, F.: Privacy and Access Control for IHE-Based Systems. In: Weerasinghe, D. (ed.) eHealth 2008. LNICST, vol. 1, pp. 145–153. Springer, Heidelberg (2009) 7. Katt, B., Trojer, T., Breu, R., Schabetsberger, T., Wozak, F.: Meeting EHR Security Requirements: SeaaS approach. In: EFMI STC 2010, Reykjavik, Iceland (June 2010)
An Authoring Framework for Security Policies
9
8. Katt, B., Trojer, T., Breu, R., Schabetsberger, T., Wozak, F.: Meeting EHR Security Requirements: Athentication as a Security Service. In: Perspegktive Workshop, GMDS 2010, Mannheim, Germany (September 2010) 9. Lodderstedt, T., Basin, D., Doser, J.: SecureUML: A UML-based modeling language for model-driven security. In: J´ez´equel, J.-M., Hussmann, H., Cook, S. (eds.) UML 2002. LNCS, vol. 2460, p. 426. Springer, Heidelberg (2002) 10. Reeder, R.W., Karat, C., Karat, J., Brodie, C.: Usability challenges in security and privacy policy-authoring interfaces. In: Baranauskas, C., Abascal, J., Barbosa, S.D.J. (eds.) INTERACT 2007. LNCS, vol. 4663, pp. 141–155. Springer, Heidelberg (2007) 11. Røstad, L.: An initial model and a discussion of access control in patient controlled health records. In: ARES 2008. IEEE Computer Society, Washington, DC, USA (2008) 12. Sandhu, R., Coyne, E., Feinstein, H., Youman, C.: Role-based access control models. Computer 29 (1996) 13. Trojer, T., Fung, B., Hung, P.: Service-oriented architecture for privacy-preserving data mashup. In: IEEE International Conference on Web Services (2009)
Detecting Public Health Indicators from the Web for Epidemic Intelligence Avar´e Stewart, Marco Fisichella, and Kerstin Denecke L3S Research Center, Appelstr. 9A, Hannover, Germany {stewart,fisichella,denecke}@L3S.de
Abstract. Recent pandemics such as Swine Flu, have caused concern for public health officials. Given the ever increasing pace at which infectious diseases can spread globally, officials must be prepared to react sooner and with greater epidemic intelligence gathering capabilities. However, state-of-the-art systems for Epidemic Intelligence have not kept the pace with the growing need for more robust public health event detection. In this paper, we propose an approach that shifts the paradigm for how public health events are detected. Instead of manually enumerating linguistic patterns to detect public health events in human language text (pattern matching); we propose the use of a statistical approaches, which instead learn the patterns of public health events in an automatic or unsupervised way. Keywords: Epidemic Intelligence, Surveillance and Analysis.
1
Introduction
Many factors in today’s changing society such as: demographic change, globalization, terrorism, as well as the resilient nature of viruses, contribute towards the continuous emergence of infectious diseases. Emerging infectious diseases are those considered to be either: completely new, resistant, or reoccurring. Only an early detection of disease activity, followed by a rapid response, can mitigate the impact of epidemic threats [1]. As a result, the multi-disciplinary area of Event-Based Epidemic Intelligence (EI) has emerged as a body of work devoted to the early identification of potential health threats from unstructured text that is present on the Web. State-of-the-art systems in EI are Automatic Event-Based systems [1]. The algorithms used in these systems typically detected disease related activity, by relying upon predefined templates, such a keywords or patterns, within the text. However, a major drawback of this is that the only indicators about public health events a system is capable of detecting, are those that are explicitly under surveillance. This limitation poses a problem, for example, for an early detection system if a disease is emerging and can only be characterized by symptoms and has no known name. The first steps toward overcoming this limitation is to view the Epidemic Intelligence in a new light. M. Szomszor and P. Kostkova (Eds.): E-Health 2010, LNICST 69, pp. 10–17, 2011. c Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2011
Detecting Public Health Indicators from the Web
1.1
11
Proposed Solution
In this work, we address this challenge, by seeking to learn patterns in an automatic and unsupervised way, which can then be used as indicators for the presence of a public health event. Instead of using keywords, we use an entity-centric unsupervised learner to automatically detect salient patterns within a document collection. These patterns are intended to be an indication that disease-related activity is occurring; thus, we refer to this task as Public Health Indicator Detection. More specifically, we address the following questions: 1. How can we characterize a Public Health Indicator 2. How do we measure the quality of Public Health Indicators? 3. When is one set of indicators better than another?
2
Discovering Public Health Indicators
In Figure 1, an overview of Public Health Indicator Detection is depicted and outlined in Algorithm 1. Each stage of our algorithm is discussed in detail below.
Fig. 1. Overview: Generating Public Health Indicators
2.1
Document Analysis
Given a finite set of text articles, A, we process the raw text of each to build a vocabulary, VT , of relevant terms. The relevance of a term is determined by a set of allowable types, T , which we refer to as a HealthEventTemplate. The types considered are: Location, Medical Condition and Victim. Using the type system and vocabulary, each article is transformed into a vector format. Each entry in the vector corresponds to the frequency with which an entity, of the given type, appears in the article. The document surrogates for the set of articles, DA := [|A|][|T |][|VT |] are then finally created from the frequency vectors. 2.2
Statistical Pattern Recognition
A key stage in our approach is the use of statistical pattern recognition to discover events. We define an event to be a pattern of entities, which co-occur with
12
A. Stewart, M. Fisichella, and K. Denecke
Algorithm 1. Public Health Indicator Generation Input: HealthEventTemplate: T := {Location, M edicalCondition, V ictim} Collection of articles: A := {a1 , · · · an }, where each ai := {e1 · · · em }, where each ei is an entity, given by a type in T K: desired number of candidate indicators QualityMeasurementPairs: F := {(f1 , α1 )...(fn , αn )} Pattern Recognition Engine: (DA , K) |= I K Output: I P , Set of Public Health Indicators 1 begin 2 // Feature Extraction: 3 Hashtable: VT := { key=t, valuet = {e1 · · · em }} = ∅ 4 for each a ∈ A do 5 for each t ∈ T do 6 Wa,t = {ei |ei ∈ a ∧ type(ei) = t} 7 VT .put(t, VT .get(t) ∪ Wa,t )
Φ
8 9 10 11 12 13 14 15 16
DA := [|A|][|T |][|VT |],construct document surrogates // Pattern Recognition:
Φ(DA , K) |= I K // Indicator Assessement: for each I ∈ I K do for each (fi , αi ) ∈ F do if Qualityf (I) ≥ α then I P = I P ∪ {I} end
such saliency, that an unlabeled, real-world event, can be inferred from the content of the articles that contain mentions of these entities. Since the set of documents that describe the same event contain similar sets of term co-occurrences, the documents themselves cluster. We propose that these patterns can be found in a statistical manner for public health, without the need for defining linguistic templates to extract the co-occurrence of entities from the text. The statistical patterns we find, (i.e., clustering of documents) is considered to be a “hint” that a potential public health event has occurred, or is currently occurring. As denoted in Figure 1, pattern recognition may be accomplished using either a supervised or unsupervised approach [2]. In general the process of recognizing these patterns is taken to be a mapping of each document surrogate to one or more of the K different clusters in the IndicatorCandidate set, I K . We now present the following definitions for an IndicatorCandidate, Indicator, and PublicHealthIndicator, as follows: Definition 1. Let an IndicatorCandidate set, I K := C, D, Φ, be a set derived from a pattern recognition engine, Φ: where C is a set of K clusters; D w is a set of documents surrogates; Φ : D − → C is a mapping of the document
Detecting Public Health Indicators from the Web
13
surrogates to one or more of the clusters, C = {c1 · · · cK }; w is weight representing the confidence associate with the assignment of the surrogate to a cluster. In general, we say an Indicator, I, is a subset of the IndicatorCandidate set, such that the |I| = 1. Based on Definition 1, potentially many Indicators are produced in the IndicatorCandidate set. PublicHealthIndicators is a the subset of IndicatorCandidates, which are filtered according to some criteria for their goodness or quality. 2.3
Indicator Extraction
When is an Indicator good? This question is particularly important when the statistical patterns are recognized in an unsupervised manner since, in general, many clusters may be produced - even if there are no natural patterns in the data. Since all indicators may not have the same quality, we define Indicator Extraction as the two stage process of: 1) defining a quality measure to apply to IndicatorCandidates (Indicator Assessment) and 2) selecting a subset of the IndicatorCandidates as PublicHealthIndicators (Indicator Pruning). Quantitative Assessment. We assess the quality of Indicators based on two criteria: quantitative and qualitative. Quantitatively, the quality of an Indicator can be determined, given a set of QualityMeasurementPairs (fi , αi ) ∈ F , where each fi is a measurement and α is a threshold value for interpreting when the measurement is of a good quality. Based on the application of such a measure to one or more Indicators, we prune the IndicatorCandidates to generate a PublicHealthIndicator according to the following: P ublicHealthIndicator = Qualityf (I) ≥ α
(1)
A number of measures can be used, to determine the quality. For example, precision and recall can be used to assess the quality of the generated indicators, the Response set (Res), with respect to an alternative clustering of the articles, the Reference set (Ref) [3] as followings: ci ∈CRef |ci | − overlap(ci , Res) Recall(Ref, Res) = (2) ci ∈CR ef |ci | − 1 Qualitative Assessment. Recall from Definition 1, that an optional weight, w, may be used to associate a document to a cluster. In the qualitative assessment of Indicators the statistics describing the distributions of these weights and their overall magnitude are taken into account. We express this in terms of 1) the sum of the weights in a given interval and 2) entropy. Entropy is the measure of the uncertainty or the amount of disorder associated with a distribution. Specifically, a high entropy value means that the articles associated to the given indicator cluster have diverse probabilities and, if we consider that the samples are sorted
14
A. Stewart, M. Fisichella, and K. Denecke
in a descending order, this describes a distribution which decreases rapidly. On the other hand, a low value denotes similar probabilities of articles associated to the given indicator. Entropy is defined as follows: H =−
N
w log2 w
(3)
i=1
3
Experiments
The goal of our experiments is twofold: first to qualitatively compare the indicators extracted with an unsupervised approach to a state-of-the-art, templatebased extractions. Second, we qualitatively, assess how the magnitude of the weights associating a document to a cluster influence the quality of an indicator. 3.1
Experimental Setting
Data Sets. To build our document collection, we downloaded the web pages for each urls listed in source column of the PULS fact base [4], for the period from, January 1 - December 31, 2009. Of the 2,587 documents collected, we used the 1,280 documents for which the PULS date column could be automatically computed as a timestamp. For the same time period, we also collected the records present in the PULS fact base, to use as a benchmark. We used both the OpenCalais and UMLS MetaMap entity extraction tools. Since MetaMap produces multiple named entity candidates, a deterministic choice for selecting the correct annotation automatically is error prone. On the other hand, we found that OpenCalais does not recognize entities as victims, but has a high precision for detecting the other entities given by the HealthEventT emplate. 3.2
Template Matching Benchmark
The benchmark system aggregates facts into the same group, or equivalence class, if they share the same disease and county, within a temporal window of 15 days. Based on this criteria, the records of the PULS fact base that we collected, yielded a total of 524 clusters and 3,722 documents. From the 524, we used those clusters that constrained at least 10 documents; this amounted to 70 clusters. 3.3
Statistical Pattern Recognition
Numerous techniques exist for detecting events in an unsupervised way. In this work, we base our unsupervised event detection algorithm on the Retrospective Event Detection [5] algorithm. This model for event detection, provides a framework for handling the multiple entity types. We extend this method to handle those defined by the HealthEventT emplate.
Detecting Public Health Indicators from the Web
3.4
15
Results
Part I: Quantitative Assessment of Indicators. The goal of this experiment was to compare the quality of the indicators that were discovered with our approach, to those that were extracted using a template-based method. Using PULS as the Reference and our indicators as the Response, we computed according to Equation 2. Figure 2 shows the precision and recall for various clusters sizes.
Recall 1
0.8
0.8 Recall
Precision
Precision 1
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Cluster size
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Cluster size
Fig. 2. Extrinsic Indicator Evaluation wrt. Template-based Detection
The first observation we make is that the overall precision and recall seem fairly low for most of the clusters, the majority of the values falling in the range of .02 to .04. This can be explained by the fact that the benchmark set used only two entity types (disease and location) to cluster indicators, where we used three entity types, as defined by the HealthEventT emplate. This would suggest that further experiments are needed to select the same entity types as in the benchmark set. Also, we notice that there are several spikes in the graph for both precision and recall reaching a maximum value of 1, for different values of K. Upon closer inspection, we notice that when the precision or recall reached these values the contribution of the victim entity type had a much smaller contribution out of the three that we used in our HealthEventT emplate. We also notice that there are several values of K for which an alignment above .8 occurs. We believe this already shows promising results that the statistical approach does,at least, align with a template-based approach. Part II: Qualitative Assessment of Indicators. In Figure 3a, the weights are expressed by the probability of a document, given and event, according to Retrospect Event Detection algorithm mentioned in Section 3.3. In Figure 3 we show the results for each quartile, given a randomly selected event. As can be seen, the larger magnitude weights are mainly contained in the first quartile, while the values for the other quartiles, related to different values of K, are almost zero. For a small value of K, we notice bigger values for the sum
16
A. Stewart, M. Fisichella, and K. Denecke
10
1st quartile 2nd quartile 3rd quartile 4th quartile
75 1 Entropy
Sum of probabilities (log scale)
100
1st quartile 2nd quartile 3rd quartile 4th quartile
100
0.1 0.01 0.001
50
25
0.0001
10
1e-05
0 10
50
100
150
200 250 300 350 Number of clusters K
400
450
500
10
50
100
150
200 250 300 350 Number of clusters K
400
450
500
Fig. 3. Qualitative values over the number of clusters K
of probabilities. This is due to the fact that for small values of K, each cluster is represented by a larger grouping of documents - hence more probabilities are summed. On the surface, this would suggest that a smaller value of K is better, however, this value does not reveal any information about the order (or disorder) among the probabilities of documents associated with the event. To examine this, we compute the entropy (Figure 3b). Such a measure indicates the disorder for each quartile, zero entropy being the best value. As can be observed from this figure, a small value of K can have a high entropy value; while for K = 500, the entropy is almost zero. 3.5
Discussion
These preliminary results suggest that an unsupervised approach to detecting public health indicators can, at least, align with indicators that have been detected with a template base approach. Also, we say that based on a extrinsic qualitative evaluation, we would prune indicators that have a precision and recall below 80%. In the unsupervised approach, we produce many more indicators than in the template approach given the number entities defined in the HealthEventT emplate. Further evaluation is need for the non-overlapping indicators we detected to evaluated. Finally we note that numerous systems exist to detect public health events [1,6]. None of these existing Event-Base EI systems use an unsupervised event detection approach. As such, they do not allow for public health events to be identified in the absence of predefined matching keywords or linguistic rules.
4
Conclusions and Future Work
We introduce our approach to the discovery of public health indicators; and presented formalizations for characterizing public health indicators. We realized the approach by discovering indicators in an unsupervised manner and further present a framework to evaluate the their quality. We have shown that a statistical approach to detecting public health indicators can produce indicators that
Detecting Public Health Indicators from the Web
17
are similar to a template matching algorithm. The impact of this work is that epidemic investigators can now rely upon alternative sources and techniques to corroborate information about public health events. This is important, since a diversity of information sources and detection techniques can offer an additional means of mitigating the impact of potential threats. In future work, a more detailed evaluation of the proposed algorithm will be undertaken. This includes additional measures, such as the B-Cube for computing precision and recall. Also, it should be noted that many factors influence the quality of public health indicators. For example, the existing prevalence levels of a disease or even the personal preference, of the information seeker: such as their geographical location or occupation. Assessing the quality of an indicator based on such factors requires a more robust qualitative evaluation with input from domain experts. We plan this as future work. Acknowledgement. This work was funded, in part, by the European Commission Seventh Framework Programme (FP7/2007-2013) under grant agreement No.247829.
References 1. Hartley, D., Nelson, N., Walters, R., Arthur, R., Yangarber, R., Madoff, L., Linge, J., Mawudeku, A., Collier, N., Brownstein, J., Thinus, G., Lightfoot, N.: The landscape of international event-based biosurveillance. Emerging Health Threats (2009) 2. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000) 3. Bagga, A., Baldwin, B.: Algorithms for scoring coreference chains. In: Language Resources and Evaluation Workshop on Linguistics Coreference, pp. 563–566 (1998) 4. Yangarber, R., von Etter, P., Steinberger, R.: Content collection and analysis in the domain of epidemiology. In: International Workshop on Describing Medical Web Resources (2008) 5. Li, Z., Wang, B., Li, M., Ma, W.-Y.: A probabilistic model for retrospective news event detection. In: SIGIR 2005, pp. 106–113 (2005) 6. Linge, J.P., Steinberger, R., Weber, T.P., Yangarber, R., van der Goot, E., Khudhairy, D.H.A., Stilianakis, N.I.: Internet surveillance systems for early alerting of health threats. Eurosurveillance 14(13) (2009)
#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009 Martin Szomszor1, Patty Kostkova1, and Ed de Quincey2 1 City eHealth Research Centre, City University, London, UK School of Computing and Mathematics, University of Greenwich, UK
[email protected],
[email protected] 2
Abstract. Early warning systems for the identification and tracking of infections disease outbreaks have become an important tool in the field of epidemiology. While government lead initiatives to increase the sharing of surveillance data have improved early detection and control, along with advanced web monitoring and analytics services, the recent swine flu outbreak of 2009 demonstrated the important role social media has and the wealth of data it exposes. In this paper, we present an investigation into Twitter, using around 3 Million tweets gathered between May and December 2009, as a possible source of surveillance data and its feasibility to serve as an early warning system. By performing simple filtering and normalization, we demonstrate that Twitter can serve as a self-reporting tool, and hence, provide indications of increased infection spreading. Our initial findings indicate that Twitter can detect such events up to one week before conventional GP reported surveillance data. Keywords: Epidemic Intelligence, Twitter, H1N1, Pandemic Flu.
1 Introduction Social media, such as blogging, social networking, Wikis, etc., has attracted much interest recently as a possible source of data for epidemic intelligence (EI). The realtime nature of micro-blogging and status updating presents unique opportunities to gather information on large numbers of individuals and offers the opportunity to enhance early warning outbreak detection systems. During the 2009 swine flu outbreak, Twitter (a popular micro-blogging website) received a substantially increased amount of traffic related to swine flu, with many individuals reporting that they contracted the virus. While traditional EI systems, such as GPHIN and Medisys are well established and used routinely by the European Centre for Disease Control (ECDC) and the World Health Organisation (WHO), new sources of data are constantly under review. Recent work [1] by companies such as Google has demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick. However, such search data remains proprietary and therefore not useful for research or the construction of non-commercial applications. However, Twitter data is publicly available and offers a highly accessible view into people’s online and offline real time activity. In this paper, we present our analysis of Twitter data from May until December 2009, and demonstrate its potential as a data source for early warning systems. M. Szomszor and P. Kostkova (Eds.): E-Health 2010, LNICST 69, pp. 18–26, 2011. © Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2011
#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009
19
2 Background 2.1 Epidemic Intelligence Epidemic Intelligence (EI) is an automated early identification of health threats and disease outbreaks, their verification and risk assessment and investigation to inform health authorities about the required measure to protect the citizens [2, 3, 4]. This is of a particular concern in situations of mass gatherings (e.g. sport events such as World Cups and Olympics, festivals, etc.) and humanitarian emergencies [5]. European, national and regional level surveillance systems produce routine reports and can provide indications of potential risks and abnormal events. However, more dynamic data collection is needed to identify threats early enough to asses the risk and launch an appropriate response. This process has been strengthen by the International Health Regulations (IHR) [6], coordinated by WHO, and signed by all UN member states, requiring states to report incidents of infectious diseases to facilitate outbreak prevention at the source, rather than at borders. ECDC [7] in Europe has proposed an improved epidemic intelligence framework bringing together an indicator-based surveillance and event-based surveillance. This new legal framework is further changing the reporting culture which in the past relied on health authorities in the countries and therefore was often subject to under-reporting due to fear from economical sanctions imposed by the EC and other states. However, in a world with the large-scale blogging, social networks and Web 2.0, an outbreak is often discovered sooner through EI tools than health authorities of a concerned country might even know through traditional reporting channels. 2.2 The Role of the Internet and Social Media in EI Epidemic intelligence has been relying on automated news media searching systems for over a decade. Tools such as Global Public Health Intelligence Network (GPHIN), developed by Health Canada and in use by WHO and Medisys, developed by the JRC, gather news from global media to identify disease outbreaks threats using multilingual natural language processing and appropriately weighted set of keywords, categories and taxonomies [8, 9]. In addition, the email-based system ProMED-mail has been an informal source of upcoming emergencies [10]. However, with the ever increase user activity on the Internet and Web 2.0 and social networks, a valuable real-time source of data to assist this process has become available. Unlike Google’s Flu Trends research that has estimated an upcoming flu epidemics sooner than CDC surveillance data evaluated online search queries for keywords relating to flu [11]. A similar study, on a smaller scale, was conducted by the NeLI/NRIC portal identifying user information needs during the swine flu pandemics in 2009 [12]. However, in order to use user searchers to assist EI system a global search portals receiving billions of queries a day to analyse a sufficient volume of data, however, a drawback for public health is that the information is stored in weblogs at commercial servers, which cannot be accessed and made available for EI systems. The increase in Web 2.0 and user-generated content via social networking tools such as Facebook and Twitter, however provides EI systems with a highly accessible source of real-time online activity. Facebook’s privacy setting allow users to restrict their profile content and activity, however, Twitter [13], a micro-blogging
20
M. Szomszor, P. Kostkova, and E. de Quincey
service that allows people to post and read other users’ 140 character messages, called “tweets”, is available in public domain and therefore freely searchable and analyzable using a provided API [14]. The information posted on twitter, currently used by over 15 million unique users per month [15], is describing a real time activity due to the social nature of the service, unlike search queries collected by search engines. Therefore, utilizing this increasingly popular freely available data source has a potential for EI and other rapid information intelligence systems. In addition to using Twitter for outbreak detection, which is the aim of our study, Twitter has been successfully used to demonstrate it could track an earthquake or typhoon [16] and both Facebook and Twitter are becoming increasingly more popular for raising awareness and raising funds for global relief [17].
3 Twitter Based Surveillance Twitter, a micro-blogging service that allows people to send and receive messages otherwise known as tweets. Tweets are limited to 140 characters and are displayed on the user’s profile page. Individuals have their own personalised feed, displaying the recent tweets made by anyone they follow. Users are free to follow any other users and use this facility to build networks that support social, business and academic activities. Current usage estimates place the number of tweets made per day at 65 million. 3.1 Data Collection We searched for the term ‘flu’ and collected over 3 million tweets in the period from May 7th until December 22nd 2009 and carry on collecting them on a 1 minute basis. We found just less than 3 million tweets containing the keyword “flu”, including individuals reporting flu symptoms or self-diagnosing; sharing links to news articles, websites, and blogs; and generally commenting on the topic. The most popular words in these tweets and their frequencies are show in Table 1. Table 1. Top 20 most frequently occurring words
Freq
Word
Freq
Word
2,993,022
flu
92,999
#swineflu
1,6217,82
swine
88,801
cases
264,903
rt
82,130
#h1n1
223,876
h1n1
71,323
today
195,163
vaccine
69,071
shots
156,658
shot
66,167
hope
109,995
health
64,271
feel
107,675
sick
63,732
school
97,889
news
61,004
:(
#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009
21
3.2 Classification of Tweets To investigate the use of Twitter as a mechanism for self-reporting of flu, we first classify the tweets using the following classes (it is possible for a tweet to be placed in more than one class): 1. Tweets containing a Link. A popular activity in Twitter is to post a link to a website. Many use this mechanism to link their followers to online news articles, blogs, videos, images, etc. Because of the 140 character limit of tweets, and the typical long length of urls, url shortening services (such as bit.ly and tinyurl.com) are often used. 2. Retweets. Another popular Twitter behaviour is to Retweet a message. In essence, users who see an interesting tweet will pass it onto their followers by reposting the original message and quoting the original author. Retweets themselves often contain links. We search for “rt @” to find retweets. 3. Self-Reporting Flu. We check the text of each tweet and search for phrases that indicate the user has the flu. These include the phrases “have flu”, “have the flu”, “have swine flu”, “have the swine flu” in present and past tenses. Figure 1 contains a time-series plot for the total number of tweets recorded during the period 11-05-2009 until 20-12-2009. A 7-day moving window average is applied to smooth the data. The plot shows the total number of tweets containing the keyword flu (labelled “All Flu Tweets”) for each day, the total number of tweets containing a link (“Contains Link”), the total number of tweets reporting flu (“Self Reporting via Twitter”), and the total number of retweets (“Retweets”). Due to technical problems, a section of data is missing for the period 30/08/2010 to 14/09/2010. The time-series indicates significant increases in activity around week 30 (20/07/2010), and again around week 40 (28/08/2010). Posting of links constitutes the most significant percentage of tweets - around 67%, the number of self-reporting tweets is around 5%, and the number of retweets is approximately 2%. 3.3 Distribution of Links and Retweets Since the posting of links makes up a significant proportion of flu related tweets, we decided to perform further analysis of these cases to identify any global trends. An increase in the posting of links could indicate an increased reaction to news and other online media. Figure 2 plots the percentage of tweets for each day that contain a link (using left axis), and the percentage of tweets that are retweets (right axis). The plot shows that the posting of links remains relatively constant over time (around 67%). The percentage of retweets displays an overall increase from approximately 0.75% in week 25, to around 3% in week 52. It is not clear from the data we have gathered whether this increase in retweeting is a trend specific to flu related tweets or a trend across the whole of twitter. The latter seems more likely since individuals have become more aware of the retweeting practice in Twitter since the beginning of 2009.
22
M. Szomszor, P. Kostkova, and E. de Quincey
Fig. 1. A time series showing all tweets containing the keyword flu, those containing links, those reporting flu, and retweets. A 7-day moving window average has been applied to smooth the data.
Fig. 2. A plot showing the proportion of links each day that contain a link and the those that are retweets
#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009
23
4 Experiment 4.1 Correlation with UK National Surveillance Data To test the accuracy of Twitter as a mechanism for self-reporting flu, and hence it’s potential to provide early warning detection, we collected official surveillance data from the UK Health Protection Agency (HPA) [18]. The HPA provide weekly reports on the RCGP influenza-like illness (ILI) consultation rate for England and Wales, Scotland, and Northern Ireland. For comparison, we calculate the percentage of tweets that are self-reporting flu for each day in our investigation period. This normalization process means that global trends in Twitter activity (e.g. spam, increased retweeting, and increased posting of links) are not factored in. Instead, the data here shows the number of individuals self diagnosing as a percentage of all flu related Twitter activity. The plot shown in Figure 3 contains the HPA RCGP ILI consultation rate for England and Wales (square points, right axis), and the percentage of Twitter activity reporting flu (crossed points, left axis). First impressions reveal a strong correlation between the two data sources: a sharp peak in activity on twitter (around week 28, 6/07/2009) corresponds to the rapid increase in the number of consultations.
Fig. 3. A plot showing the RCGP ILI rate for England vs the number of self-reported cases on Twitter
24
M. Szomszor, P. Kostkova, and E. de Quincey
4.2 Normalized Cross-Correlation: Twitter predicts To provide some indication of the correlation between Twitter and the official UK surveillance data, we calculate the normalized cross-correlation ratio between various signals from Twitter and the official HPA surveillance data. Since the HPA data is gathered on a weekly basis, we perform the comparison using a weekly aggregation of Twitter data. Equation 1 gives the normalized cross-correlation function we use, where x(t ) is the total number of tweets during week t , and y(t − i) is the number of reported cases according to the HPA during week (t − i) . We calculate r across all flu tweets, those that are self reporting, those that contain links, and those that are retweets for values of i between -4 and 4.
Eq. 1. Normalised Cross-Correlation
Figure 4 displays the various values of r for weekly offsets between i = −4 and i = 4 . The cross-correlation ratio (or sliding dot product) is a measure of how similar two signals against a moving time lag. This means that values of r for i=0 represent
Fig. 4. The cross-correlation plot between Twitter and the HPA Surveillance Data
#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009
25
how much two signals a correlated, when i=-1, it represents how much the first signal predicts the second signal. The higher the value of r, the stronger the correlation. Figure 4 shows that the self reporting tweets have a strong correlation with the HPA data – the signals for all flu tweets, those containing links, and retweets do not. This would indicate that our filtering and normalization process has been successful, allowing us to discriminate messages that indicate someone has the flu from the general noise on Twitter. Although the strongest correlation occurs at i=0 (when r=0.1140), indicating a co-occurrence of tweets and surveillance data, there is still a strong correlation at i=-1 (when r=0.0750) indicating that the HPA surveillance data could be predicted by Twitter up to 1 week in advance, and therefore demonstrates the potential of twitter for early warning and outbreak detection.
5 Conclusions and Future Work In this paper, we have provided presented our analysis of Twitter data relating to the Pandemic Flu outbreak of 2009. We have shown that although Twitter contains quite a lot of noise in the form of spam, posting of links, and retweets, a simple filtering method can be used to extract those tweets that indicate that a user has the flu. In the future, more advanced computational linguistics will be applied to identify individuals that are reporting flu-like symptoms, as well as directly reporting having the flu. By comparing the data gather from Twitter to the official national surveillance data from the HPA, we have shown that Twitter could be used as an early warning detection system: Our initial findings indicate that HPA data could be predicted up to one week in advance. Clearly, the use of Twitter in an EI system would provide an even faster response since official data usually takes some time to collate and process. A further piece of information that is vital to EI systems is that of location. Location awareness is becoming more popular in Twitter and is likely to become a corepart of the API in the future. This extra piece of information would provide even more motivation to exploit Twitter in EI systems.
References 1. http://www.google.org/flutrends/ 2. Kaiser, R., Coulombier, D., Maldari, M., Morgan, D., Paquet, C.: What is epidemic intelligence, and how it is being improved in Europe? Eurosureillance 11(2), 60202 (2006) 3. Kaiser, R., Coulombier, D.: Different approaches to gathering epidemic intelligence in Europe. Euro Surveillance 11(17), 2948 (2006) 4. Paquet, C., Coulombier, D., Kaiser, R., Ciotti, M.: Epidemic intelligence: a new framework for strengthening disease surveillance in Europe. Euro Surveillance 11(12), 665 (2006) 5. Coulombier, D., Pinto, A., Valenciano, M.: Epidemiological surveillance during humanitarian emergencies. Médecine tropicale: revue du Corps de santé colonial 62(4), 391–395 (2002) 6. http://www.who.int/ihr/en/ 7. http://www.ecdc.europa.eu/en/Pages/home.aspx
26
M. Szomszor, P. Kostkova, and E. de Quincey
8. WHO, http://www.who.int/csr/alertresponse/epidemicintelligence/en /index.html 9. Linge, J.P., Steinberger, R., Weber, T.P., Yangarber, R., van der Goot, E., Al Khudhairy, D.H., Stilianakis, N.I.: Internet surveillance systems for early alerting of health threats. Euro Surveill. 14(13), 1916 (2009) 10. Madoff, L.C.: ProMED-mail: An Early Warning System for Emerging Diseases. Clinical Infectious Diseases 39(2), 227 (2004) 11. Google Flu Trends, http://www.google.org/flutrends/ 12. de Quicney, E., Kostkova, P., Wiseman, S.: An investigation into the potential of Web 2.0 websites to tracks disease outbreak. Poster at Infection 2009, Birmingham, UK (2009) 13. Twitter, http://www.twitter.com 14. Williams, D.: API Overview, http://apiwiki.twitter.com/API-Overview 15. http://www.crunchbase.com/company/twitter. 16. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: WWW 2010: Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. Raleigh, North Carolina (2010) 17. http://www.fastcompany.com/blog/kiteaton/technomix/facebook-twitter-turn-charity-efforts-11 18. http://www.hpa.org.uk/
Identifying Breast Cancer Concepts in SNOMED-CT Using Large Text Corpus Zharko Aleksovski and Merlijn Sevenster Philips Research Europe, High Tech Campus 37 (room 2.044), 5656 AE Eindhoven, The Netherlands
[email protected] Abstract. Large medical ontologies can be of great help in building a specialized clinical information system. First step in their use is to identify the subset of concepts which are relevant to the specialty. In this paper we present a method to automatically identify the breast cancer concepts from the SNOMED-CT ontology using large text corpus as source of knowledge. In addition to finding them, the concepts are also assigned relevance values. In our experiments the method produced results of an overall high quality. The precision was high, and the recall was relatively low, but the concepts which were not found are complex and arguably ambiguous, which limits their applicability in practice. This research was application driven, and the breast cancer concepts found have been applied in a real oncology information system. Keywords: ontology, SNOMED-CT, breast cancer, term frequency.
1
Introduction
Large medical ontologies such as SNOMED-CT contain hundreds of thousands of clinical concepts usually organized in a hierarchy and interconnected by domain specific relations, together representing the explicit semantic knowledge describing a medical field. For a given application it is often desirable to restrict oneself to a smaller subontology. But the relevant concepts are rarely found under one sub-branch of the large ontology, instead they are usually scattered over multiple high-level categories, e.g. clinical findings, procedures, body locations, etc. In this paper we describe a study on the identification of breast cancer concepts in SNOMED-CT. We use a large text corpus of medical documents, a portion of which is dedicated to breast cancer, and by analyzing how frequently SNOMED-CT concepts occur in different parts of the corpus we measure how relevant they are to breast cancer. Our experiments show that large text corpora of medical literature can be successfully used to identify the concepts relevant to a clinical setting, in our case that is breast cancer. The concepts are also assigned relevance score, such that concepts that are key to breast cancer receive highest score. Evaluating the M. Szomszor and P. Kostkova (Eds.): E-Health 2010, LNICST 69, pp. 27–34, 2011. c Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2011
28
Z. Aleksovski and M. Sevenster
ranked list of concepts revealed that our method exhibits high accuracy: all the top concepts are relevant to breast cancer and the relevancy decreases as we move down the list. In addition to identifying the breast cancer concepts, we examine how complete is the resulting list. First, we find that there are breast cancer concepts in SNOMED-CT that we did not find, because most of them could not be extracted from the text due to their complex names. This characteristic makes these breast cancer concepts somewhat difficult to use in practice, and it is therefore not a serious drawback of our method. Second, we test if SNOMED-CT is rich enough to cover the breast cancer field. Using a different experimental setup we show that SNOMED-CT does not seem to lack any key breast cancer concepts. Having identified these breast cancer concepts enables variety of applications. The very basic ones are highlighting important parts in medical documents, providing more relevant autocompletion suggestions, and building fulltext search engine specialized for oncology related documents. Related Work. Identifying disease-centric subdomains in medical ontologies has been studied in [2,6]. The first study uses a predefined set of queries as a source of knowledge to find the concepts, and the second uses formalized clinical guidelines. They expand these initially found sets using the ontology structure, and also the UMLS1 meta-ontology. Importantly, these studies define a subdomain as a set of concepts which excludes the possibility that concepts can be relevant to varying degree. The fact that we allow gradual relevance measure is major advantage of this work. The TFIDF measuring scheme is well established in the field of information retrieval. Its basic use is to assign representative keywords to individual documents in a collection, but it has been modified and extended in various ways [1]. Perhaps the most similar application of TFIDF to our work is reported in [5] where it is used to find topic-relevant keywords. Finally, the problem of identifying topic-relevant concepts in ontology can be compared to ontology modularization [3,4]. Ontology modules are autonomous parts of ontologies intended for reuse, and identifying these modules to some extent resembles our problem. The paper is structured as follows: in the next section we describe our method in detail together with the experimental results. Sections 3 and 4 report on experiments to test the completeness of the results: in Section 3 we try to find concepts in SNOMED-CT that we might have missed, and in Section 4 we test if SNOMED-CT is complete enough and if it contains all the breast cancer important concepts. Section 5 presents the concluding remarks.
2
Methodology: Finding Breast Cancer Concepts in SNOMED-CT
Concepts that are frequently used in the context of breast cancer are relevant to the topic: the more frequently used - the more relevant the concept is. But breast 1
http://www.nlm.nih.gov/research/umls/
Breast Cancer Concepts in SNOMED-CT
29
cancer concepts should also have ”exclusivity”. When they are used primarily to talk about breast cancer and no other things, that makes them more relevant. Terms like Patient or Disease are frequently used in breast cancer, but due to their general meaning they carry little information in medical context, so they are not among the most relevant breast cancer concepts. Our method uses a ranking scheme that makes trade-off between these two requirements. Before presenting the details of the method, we first describe the data that was used in the experiments. 2.1
Experimental Data
We use large medical text corpus as source of knowledge to identify the breast cancer concepts in SNOMED-CT. Because part of the corpus were documents about breast cancer and the rest about other medical topics, we were able to observe how often individual SNOMED-CT concepts occur in breast cancer documents, and how often in other medical documents. SNOMED-CT(Systematized Nomenclature of Medicine - Clinical Terms), is a systematically organized computer-processable collection of medical terminology covering most areas of medicine such as diseases, findings, procedures, etc. [9]. The version used in this study is from July 2009 and consists of 307,754 concepts interconnected with over 100 relation types. The concepts are organized in 19 mutually exclusive hierarchies called SNOMED categories. Some of these categories are: Body structure, Clinical finding, etc. SNOMED-CT has been under active development for more than 35 years, and is constantly evolving. Text Corpus of Medical Documents. The text corpus comprises 103 medical digital documents, mostly books. We divided the documents in two: the breast cancer corpus of 11 documents about breast cancer, and the general medical corpus of the remaining 92 documents about other medical topics. The breast cancer corpus consisted of 10 million characters, and the general medicine corpus consisted of 512 million characters. The texts were extracted from electronic documents in formats like PDF and PS. 2.2
The Method
The method proceeds in two steps: annotation and TFIDF ranking. In the annotation step we extracted SNOMED-CT concepts from the text corpus. Single occurrence of SNOMED-CT concept in a document we call an annotation. After an exhaustive annotation process we counted how many times each concept was annotated. Then we fed these numbers into an adapted TFIDF ranking scheme, which produces the resulting ranked list of breast cancer concepts. Annotation. In the annotation step we lexically searched for SNOMED-CT concepts in the text corpus. We considered that a concept is annotated when its label or one of the synonyms was found in the text. The search itself allowed for some flexibility. The case of the words was ignored, stopwords were ignored, the order of the words was ignored - as long as they appeared in consecutive sequence, and before comparing the words were stemmed using the Porter stemmer algorithm [7].
30
Z. Aleksovski and M. Sevenster
Results of the Annotation. The breast cancer corpus established 1,259,844 annotations to 12,647 different concepts. That is 99.6 annotations per annotated concept on average. These annotations were distributed very unequally over the concepts. After ranking the concepts by the number of annotations on average the top 5 had more than 10,000 annotations each, the top 100 had more than 2,100 annotations each, and the top 4,560, which is 36% of the annotated concepts, had 10 or more annotations. The general medicine corpus established 64,248,152 annotations to 30,092 different concepts, and again the number of annotations were very unevenly distributed per concept comparable to the situation with the breast cancer corpus. TFIDF. (Term frequency - inverse document frequency) ranking measure [8] is used in information retrieval to estimate the importance of terms to particular document in a collection of documents. It is calculated as T F IDF (t, d) = T F (t, d) × IDF (t) where T F (t, d) is the relative term frequency of the term t within the document d, which is the number of times t occurs in d divided with the number of all term occurrences in d. IDF (t) is a measure of how general the meaning of the term t is. It is obtained by dividing the number of documents in the collection by the number of documents containing the term, and then taking the logarithm of that quotient. If D is the collection of documents, then IDF is calculated as IDF (t) = log
|D| |{d : t ∈ d}|
The intuition behind the T F IDF measure is that a term is descriptive to a document if it occurs frequently in the document, and is infrequent in the other documents in the collection. These properties perfectly fit the requirements about the relevance ranking scheme that we discussed in the beginning of this section. We used the scheme differently than it was originally intended as we counted occurrences of concepts and not terms. For instance, when breast cancer or malignant tumor of breast were found in the text they were counted as two occurrences of the same concept, even though they are different terms. Being interested in how important individual SNOMED-CT concepts are to breast cancer we concatenated the breast cancer documents into one single document. We also cut the general medical documents into smaller chunks of predefined size of maximum 50,000 characters and considered each of them as separate document. This was needed because the general medical documents were mainly large books with average of 5 million characters, and even though written on other medical topics breast cancer specific concepts occurred in most of these documents which was not a desired property for our ranking scheme2 . After conducting several experiments we choose size of 50,000 characters per chunk (and 2
Good solution to this problem would be to extract the separate sections of these books, but because of the overwhelming manual effort required we turned to this less appealing automatic solution.
Breast Cancer Concepts in SNOMED-CT
31
also include the leftover with smaller size as separate document), though, finding the optimal size can be a subject of further investigation. Finally, we restricted our focus to concepts that are annotated at least 10 times in the corpus, discarding the ones with fewer annotations. They can neither score high T F IDF value, nor can they change the ranking order of the other concepts. 2.3
Results
The method produced an ordered list of 4,560 concepts. The top 10 concepts in the list, i.e. the concepts with highest T F IDF score are shown in Table 1. Table 2 shows other parts of the list, the first 3 concepts starting from the 100, 200 and 500th position respectively. Table 1. Top 10 most relevant breast cancer concepts found by the method Concept’s label SNOMED-CT code 1. Breast cancer 254837009 2. Mamma 181131000 3. Breast 76752008 4. Malignancy 363346000 5. Cancer 86049000 6. Mammogram 71651007 7. DCIS 86616005 8. Mastectomy 172043006 9. Excision of breast tissue 69031006 10. Tamoxifen 373345002
TFIDF 0.0002434223 0.0002404850 0.0002389841 0.0001399625 0.0000966287 0.0000854837 0.0000617422 0.0000486403 0.0000462173 0.0000365901
Table 2. Selected concepts with their ranking in the breast cancer list of concepts
100. 101. 102. 200. 201. 202. 500. 501. 502.
Concept’s label SNOMED-CT code PET - Positron emission tomography 82918005 FH - Family history 57177007 Development of the breasts 364375002 Atypical hyperplasia 32416003 Specimen 123038009 Has specimen 116686009 Dense 255596001 Phenotype finding 8116006 Interested 225469004
TFIDF 0.0000091759 0.0000091479 0.0000089859 0.0000062063 0.0000061673 0.0000061673 0.0000032862 0.0000032774 0.0000032748
Precision. The first 10 concepts in the list are clearly key breast cancer concepts. As we go down the list it becomes harder to evaluate how the concepts are ranked. We manually inspected the first 100 concepts, and another set of 100 randomly drawn concepts from the whole list. Of course, the ranking of each individual concept can be debated, but there was no concept in the evaluation sets for which we could say that it is wrongly ranked. This evaluation suggested that the results are very precise. When we rank the concepts by the T F component only, interestingly, still most of the top concepts are key breast cancer concepts like Breast cancer or
32
Z. Aleksovski and M. Sevenster
Breast, but when we look at the first 30 concepts then we also find generic concepts like Patient, Study and Clinical. On the other hand, when we rank by IDF only, the core concepts like Breast cancer or Breast are down in the ranking, and the list is topped by concepts that are very specific to breast cancer like Nipple preserving subcutaneous mastectomy or Mammographic breast density.
3
Evaluation Experiment I: Completeness of the Results
In the previous section we briefly discussed the precision of our results, and now we look into the recall, that is, we test if breast cancer concepts in SNOMED-CT were missed. To make sure that we found all the breast cancer concepts, we would have to check for each SNOMED-CT concept. This is unrealistic due to the size of SNOMED-CT - over 300,000 concepts, so we choose for alternative. The so-called ”seed queries” method reported in [2] also extracts a subdomain from an ontology but in a very different way. It is reported to have high precision. Comparing with the seed queries can give an indication of the recall of our method. For this comparison we used simplified version of the seed queries method. According to this method we searched for concepts in SNOMED-CT using six queries: breast cancer, breast carcinoma, breast neoplasm, breast tumor, ductal carcinoma and mastectomy. Each concept that contains all the words from one of the queries is found in this search. The search found 355 concepts. Comparing with the Seed Queries. Of the 355 concepts found by the seed query method, our method finds 24 concepts, which estimates the recall of our method as compared to seed queries, to only 7%, which is very low. We analyzed the seed queries results to find the reasons for this low recall. Majority of the missed concepts have very precise meanings which is reflected in their linguistically complex labels. Below are some representative examples: 94964004 373182002 94182000
Neoplasm of uncertain behavior of nipple of female breast pT2: Tumor > 2 cm but ≤ 5 cm in greatest dimension (breast) Metastatic malignant neoplasm to axillary tail of female breast
One possibility is that our annotation technique was not good enough to annotate these concepts to the text corpus. For this reason we used a state-of-the-art tool called MetaMap, which is specialized for concept annotation in free text3 . We run the MetaMap tool on the breast cancer corpus, and it managed to annotate 7 concepts of these found by the seed queries, which is even less than what our annotation found4 . Hence, not finding these concepts in the free text is not necessarily a weakness of our annotation method. Since annotation tools fail to find these concepts in free text, they cannot be used in applications that require their automated discovery in text, such as highlighting important parts in medical record. So, having missed these concepts is not a serious drawback of our method. 3 4
MetaMap is developed as part of the UMLS project: http://mmtx.nlm.nih.gov/ This comparison does not reflect the quality of the MetaMap tool because it is general-purpose annotation tool not tailored to the requirements of our study.
Breast Cancer Concepts in SNOMED-CT
4
33
Evaluation Experiment II: SNOMED-CT Coverage of Breast Cancer
In the previous section we assessed the completeness of our method, i.e. if we have found all the breast cancer concepts that are in SNOMED-CT. In this section we assess the completeness of SNOMED-CT, i.e. whether it contains all the important breast cancer concepts. Now we analyze the text corpus alone, and construct a list of terms from the text that are important to breast cancer. If an important breast cancer term is not present in SNOMED-CT we hope to find it in this list. The method of Section 2.2 finds the breast cancer concepts by looking at which concepts are representative for the breast cancer part of the text corpus. Now, we look at which terms are representative for the breast cancer part of the text corpus. We calculate the TFIDF importance of every term that occurs at least 10 times in the breast cancer corpus. The same setup as in Section 2.2 was used: the breast cancer documents are put together in a single document, the general medical documents are chopped into chunks of max. 50,000 characters, and when comparing if two terms are equal we were flexible as in Section 2.2: ignore stopwords, ignore word case, ignore word order and use stemming. Results. This experiment reported 21,519 terms. Due to the large size, we restrict the evaluation to the top 1,000 terms in the list. For each of these terms we searched in SNOMED-CT if a concept has it as a label, and for those not found we investigated as to why it was not found. If it was an important breast cancer term not given a SNOMED-CT concept we would expect to find it here. In the first 1,000 terms we did not identify any meaningful term that did not have appropriate concept in SNOMED-CT. This means that SNOMED-CT is very complete in describing the most important terms used to communicate about breast cancer. Most of the terms not found in SNOMED-CT were not valid noun terms, for example breast and ovarian or early breast. Also there were other artifacts as well, like 2005, riskof or Clin Oncol which might have occurred due to imperfections in the preparation of the test data, or simply because these artifacts are only being used in an informal communication, and hence are not given appropriate concepts in SNOMED-CT.
5
Conclusions
We presented a novel method to automatically extract topic-based concepts from an ontology using large text corpus. The method was applied to extract the breast cancer concepts from the SNOMED-CT ontology. It produced a ranked list of concepts with good enough quality to be applied in practice. We conducted three rounds of evaluation on the quality of the results. First, testing the precision showed that the top of the list are all related to breast cancer. Second, the evaluation experiment 1 showed that the recall of the method is low, but we concluded that this is not a serious drawback for the method. The missed concepts are complex and have limited usefulness in practice because
34
Z. Aleksovski and M. Sevenster
even state-of-the-art tools cannot automatically find them in free text. Third, the evaluation experiment 2 showed that SNOMED-CT is complete in covering the terminology used in breast cancer, and can be used in real clinical information systems designed to support the breast cancer care cycle.
References 1. Aizawa, A.: An information-theoretic perspective of tf-idf measures. Information Processing & Management 39(1), 45–65 (2003) 2. Aleksovski, Z., Vdovjak, R.: Overlap of selected ontologies in the context of the breast cancer domain. In: Proceedings of SIIM Annual Meeting (2009) 3. Clark, K., Parsia, B.: Modularity and owl. Literature survey (2008) 4. Cuenca Grau, B., Horrocks, I., Kazakov, Y., Sattler, U.: Just the right amount: extracting modules from ontologies. In: Proceedings of WWW, pp. 717–726 (2007) 5. Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 349–357. ACM, New York (2001) 6. Milian, K., Aleksovski, Z., Vdovjak, R., ten Teije, A., van Harmelen, F.: Identifying disease-centric subdomains in very large medical ontologies: A case-study on breast cancer concepts in snomed ct. or: Finding 2500 out of 300.000. In: Ria˜ no, D., ten Teije, A., Miksch, S., Peleg, M. (eds.) KR4HC 2009. LNCS, vol. 5943, pp. 50–63. Springer, Heidelberg (2010) 7. Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997) 8. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988) 9. Stearns, M.Q., Price, C., Spackman, K.A., Wang, A.Y.: Snomed clinical terms: overview of the development process and project status. In: Proceedings of the AMIA Symposium, p. 662. American Medical Informatics Association (2001)
Towards Knowledge Oriented Personal Health Systems Juha Puustjärvi1 and Leena Puustjärvi2 1
Helsinki University of Technology, Box 9210, 02015 TKK, Finland
[email protected] 2 The Pharmacy of Kaivopuisto, Neitsytpolku 10, Helsinki 00140, Finland
[email protected] Abstract. Although current e-health tools such as personal health records have proven to be useful they still have many shortages. Especially, they do not provide effective means for querying personal health information. They also fail in information therapy, i.e., in providing the right information to right people at right time. In addition, they are totally passive, although through automating the control of patients health information we could significantly and costeffectively improve the quality of patient-centered healthcare. In this paper, we describe our designed a knowledge oriented active personal health system, which does not suffer from these shortcomings. Its key components are the personal health ontology and the alerts. Through the ontology we can provide data expressive queries and avoid the problem of limited information supply, and through the alerts we can provide active elements for controlling patient’s health information. We present the ontology in a graphical form and in OWL, and give rules for transforming the ontology into relational model. We also present how the alerts can be implemented by the triggers supported by relational database systems. Keywords: E-health tools, Patient-centered healthcare, Personal health records, Information therapy, Ontologies, OWL, RDF, SOA, Relational database system, triggers.
1 Introduction Patient-centered healthcare is widely studied (e.g., in [1, 2, 3, 4]) emerging e-health model that contributes to preventive medical care. It optimizes the healthcare system to focus on patient experience and outcomes for better health and well-being [5]. A key point in patient-centered healthcare is that patients and their families have the ability to obtain and understand health information and services, and make appropriate health decisions [6]. This requires that patient’s health information as well as other relevant medical information is presented in appropriate format according to individuals understanding and abilities. Our argument is that current XML-based e-health tools developed for managing personal health information do not satisfy well enough the requirements of patientcentered healthcare but instead knowledge oriented technologies are required for controlling and representing personal health information. In particular, we have M. Szomszor and P. Kostkova (Eds.): E-Health 2010, LNICST 69, pp. 35–43, 2011. © Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2011
36
J. Puustjärvi and L. Puustjärvi
discovered that existing personal health records (PHRs) [7] and e-health tools suffer at least from the following three weaknesses. First, XML-based PHRs’ data is document-centric-data, i.e., they are collections of documents such as documents including lab tests, prescribed medications and illnesses. By contrast, PHR usage often is data centric, meaning that data should be extracted from various documents and then integrated according to certain criteria. For example, a patient may be interested to know the average blood pressure and/or blood sugar concentration (glucose level) during the time periods he or she was using a drug for blood pressure or the patient may be interested to know the cholesterol values when he or she was on a diet. Unfortunately the computation required by such queries is not provided by the query languages (e.g., XPath [8] and XQuery [9]) that are designed to address XML documents. Second, current e-health tools suffer from limited information supply. Our argument is that personal health system should support information therapy (Ix) [10], information based medicine [11], and the management of user specific physical exercises, as they contain the information that patients need in making appropriate health decisions. This is important as many studies have indicated that most patients are not satisfied with the medical treatment information on the Web though many e-health tools provide links to materials or other websites that have information about patient’s health conditions or medications [11, 12]. In particular, they have regarded many sites to be overly commercial, or they could not determine the source of the information [13]. Third, existing PHRs are passive in the sense that they do not contain any active elements. By an active element we refer to an expression or statement that is stored in PHR (or in any e-health tool), and expect the element to execute at appropriate times. The times of action might be when a certain event occurs such as an insertion of a blood test result. Then depending on the inserted values an action can be taken such as generating an email to patient’s personal physician. In this paper, we describe our work on developing a knowledge oriented personal health system, which does not suffer from these three shortages. Its key components are the personal health ontology and the alerts. Through the ontology we can provide data centric queries and automate information therapy, and through the alerts we can configure appropriate active elements for each patient. Further, as a result capturing the features of many e-health tools in one system, a user does not have to use a variety of e-heath tools (which usually have their own heterogeneous interfaces) but all their functionalities can be captured into one system. In addition, through the shared ontology we can achieve synergy in developing more sophisticated services for the patient as well as we can avoid the problems of replicated data. The rest of the paper is organized as follows. First, in Section 2, we describe the personal health ontology. Firstly, we characterize the nature of ontologies and represent the core components of the personal health ontology in a graphical form. Then we present the personal health ontology in OWL, and give examples of querying the ontology by RQL query language. In Section 3, we present the architecture of our developed personal health system. In particular, we describe how the personal health ontology can be transformed in the relational model, and how the alerts can be implemented by the triggers that are supported by relational database systems. Finally, Section 5 concludes the paper by discussing the advantages and limitations of our developed solutions.
Towards Knowledge Oriented Personal Health Systems
37
2 Personal Health Ontology Originally ontology is the philosophical study of the nature of being, existence or reality in general, as well as of the basic categories of being and their relations [14]. In the context of computer science, an ontology is a general vocabulary of a certain domain, and it can be defined as “an explicit specification of a conceptualization” [15]. It tries to characterize that meaning in terms of concepts and their relationships. It is typically represented as classes, properties, attributes and values. As an example consider a subset of the personal health ontology presented in Fig. 1. Duration
Le ngth
Date
Date
Le ngth
Value
Date
Time
Ale rtName ActivationDate
W alk
Swimming
Date
Run
W e ight Alle rgyName
SubclassOf SSN
SubclassOf
Exce rciseTest Person
A lert
Me asures
SubclassOf
Alle rgy
Hobby
Store d
Discovered
Date
Vaccination
Systolic
Diastolic
SubclassOf BloodPressure Te st
VaccinatedBy
Patient
CaresFor
Physician
Performe d
BloodPressure Te stIE ActorID
Source
MedicalTest
Mothe r
PhysicianEmail
Uses
Originates
Te stName SuffersFrom
Ale rtValue
SubclassOf
DiseaseName
Me dication MedicationId
Value
Unit
Disease
RelatesTo
Coleste rolTe st
Stre nghtValue StrenghtUnit
Date Time
De als
SubclassOf
Father
A ctorRole
Type
A ctivate d
SubclassOf
Name
Date
Deals
Pre de ce ssor
Contains BlogItem
Deals
Deals
Deals
InsertionDate
URL ProductName
Includes
Inse rtionDate
DiseaseIE
Coleste rolTestIE
Product De als BrandName
Blog
SubclassOf
SubclassOf
SubclassOf
De als Subje ct
BlogName
InformationEntity URL
ProductIE SubclassOf Date
InformationSource
Fig. 1. A subset of the personal health ontology in a graphical form
In this graphical representation ellipses represent classes and subclasses, and rectangles represent data type and object properties. Classes, subclasses, data properties
38
J. Puustjärvi and L. Puustjärvi
and object properties are modeling primitives in OWL (Web Ontology Language) [16]. Object properties (e.g., SuffersFrom) relate objects to other objects while data type properties (e.g., Hobby) relate objects to datatype values. In Fig. 1 we have presented only a few of objects’ datatype properties. The personal health ontology comprises the vocabulary that the patient can use in describing his or her personal health information. For example object properties Father and Mother are included in the vocabulary, but the patient does not have to give values for these properties. Note also that the datatype property Activated connects classes Patient and Alert. By giving the values for the datatype properties of the class Alert the patient indicates, which alerts he or she is activated. As we will see in the next section, active elements query these values and function accordingly. A subset of the graphical ontology of Fig. 1 is presented in OWL in Fig. 2.