Challenges And Opportunities of Healthgrids: Proceedings of Healthgrid 2006 (Studies in Health Technology and Informatics)

CHALLENGES AND OPPORTUNITIES OF HEALTHGRIDS Studies in Health Technology and Informatics This book series was started ...

Author: Vicente Hernandez | Ignacio Blanquer | Tony Solomonides | Vincent Breton and Yannick Legre

58 downloads 998 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

CHALLENGES AND OPPORTUNITIES OF HEALTHGRIDS

Studies in Health Technology and Informatics This book series was started in 1990 to promote research conducted under the auspices of the EC programmes’ Advanced Informatics in Medicine (AIM) and Biomedical and Health Research (BHR) bioengineering branch. A driving aspect of international health informatics is that telecommunication technology, rehabilitative technology, intelligent home technology and many other components are moving together and form one integrated world of information and communication media. The complete series has been accepted in Medline. Volumes from 2005 onwards are available online. Series Editors: Dr. J.P. Christensen, Prof. G. de Moor, Prof. A. Famili, Prof. A. Hasman, Prof. L. Hunter, Dr. I. Iakovidis, Dr. Z. Kolitsi, Mr. O. Le Dour, Dr. A. Lymberis, Prof. P.F. Niederer, Prof. A. Pedotti, Prof. O. Rienhoff, Prof. F.H. Roger France, Dr. N. Rossing, Prof. N. Saranummi, Dr. E.R. Siegel, Dr. P. Wilson, Prof. E.J.S. Hovenga, Prof. M.A. Musen and Prof. J. Mantas

Volume 120 Recently published in this series Vol. 119. J.D. Westwood, R.S. Haluck, H.M. Hoffman, G.T. Mogel, R. Phillips, R.A. Robb and K.G. Vosburgh (Eds.), Medicine Meets Virtual Reality 14 – Accelerating Change in Healthcare: Next Medical Toolkit Vol. 118. R.G. Bushko (Ed.), Future of Intelligent and Extelligent Health Environment Vol. 117. C.D. Nugent, P.J. McCullagh, E.T. McAdams and A. Lymberis (Eds.), Personalised Health Management Systems – The Integration of Innovative Sensing, Textile, Information and Communication Technologies Vol. 116. R. Engelbrecht, A. Geissbuhler, C. Lovis and G. Mihalas (Eds.), Connecting Medical Informatics and Bio-Informatics – Proceedings of MIE2005 Vol. 115. N. Saranummi, D. Piggott, D.G. Katehakis, M. Tsiknakis and K. Bernstein (Eds.), Regional Health Economies and ICT Services Vol. 114. L. Bos, S. Laxminarayan and A. Marsh (Eds.), Medical and Care Compunetics 2 Vol. 113. J.S. Suri, C. Yuan, D.L. Wilson and S. Laxminarayan (Eds.), Plaque Imaging: Pixel to Molecular Level Vol. 112. T. Solomonides, R. McClatchey, V. Breton, Y. Legré and S. Nørager (Eds.), From Grid to Healthgrid Vol. 111. J.D. Westwood, R.S. Haluck, H.M. Hoffman, G.T. Mogel, R. Phillips, R.A. Robb and K.G. Vosburgh (Eds.), Medicine Meets Virtual Reality 13 Vol. 110. F.H. Roger France, E. De Clercq, G. De Moor and J. van der Lei (Eds.), Health Continuum and Data Exchange in Belgium and in the Netherlands – Proceedings of Medical Informatics Congress (MIC 2004) & 5th Belgian e-Health Conference

ISSN 0926-9630

Challenges and Opportunities of HealthGrids Proceedings of Healthgrid 2006

Edited by

Vicente Hernández and

Ignacio Blanquer With Tony Solomonides, Vincent Breton and Yannick Legré

Amsterdam • Berlin • Oxford • Tokyo • Washington, DC

© 2006 The authors. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 1-58603-617-3 Library of Congress Control Number: 2006925644 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected] Distributor in the UK and Ireland Gazelle Books Services Ltd. White Cross Mills Hightown Lancaster LA1 4XS United Kingdom fax: +44 1524 63232 e-mail: [email protected]

Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS

Challenges and Opportunities of HealthGrids V. Hernández et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.

v

Introduction HealthGrid 2006 (http://valencia2006.healthgrid.org) is the fourth edition of this open forum for the integration of Grid Technologies and its Applications in the Biomedical, Medical and Biological domains to pave the path to an International Research Area in HealthGrid. The main objective of HealthGrid conference and the HealthGrid Association is the exchange and discussion of ideas, technologies, solutions and requirements that interest the Grid and the Life-Sciences communities to foster the integration of Grids into Health. Participation is encouraged for Grid middleware and Grid applications developers, Biomedical and Health Informatics users and security and policy makers to participate in a set of multidisciplinary sessions with a common concern on the applications to Health. HealthGrid conferences have been organized in an annual basis. The first conference, held in 2003 in Lyon (http://lyon2003.healthgrid.org), reflected the need to involve all actors – physicians, scientists and technologists – who might play a part in the application of Grid technology to Health, whether health care or bio-medical research. The second conference, held in Clermont-Ferrand in January 2004 (http://clermont2004.healthgrid.org) reported research and work in progress from a large number of projects. The third conference of Oxford (http://oxford2005.healthgrid. org) had a major concern on the results and deployment strategies in Healthcare. Finally, this issue aims at consolidating the collaboration among Biologists, Healthcare professionals and Grid Technology experts. The conference includes a number of high-profile keynote presentations complemented by a set of high quality refereed papers. The number of contributions to this conference has increased from previous editions, reaching the number of 44 submissions of papers and demos from principal authors coming from 14 countries (according to the number of contributions: France, United Kingdom, Spain, Italy, Germany, Greece, The Netherlands, Belgium, Czech Republic, Cuba, Japan, Romania, Russia and Taiwan). Considering the affiliations of all the authors of the papers, the number of contributing countries is extended to 18 countries including Switzerland, Austria, Turkey and USA. The contributions of this edition follow mainly five main topics: Medical Imaging on the Grid; Ethical, Legal and Privacy Issues on HealthGrids; Bioinformatics on the Grid; Knowledge Discovery on HealthGrids and Medical Assessment and HealthGrid Applications. The maturity of the discipline of HealthGrids is clearly reflected on these subjects. There are more contributions related to two main application areas (Medical Imaging and Bioinformatics), confirming the analysis of the HealthGrid White Paper published last year, which outlined them as the two more promising areas for HealthGrids. Along with these two areas, the assessment on the results of HealthGrid applications, also focused by several contributions, denotes also the maturity of HealthGrids. Finally the other two areas (Knowledge Discovery and Ethical, Legal and Privacy Issues) focus on basic technologies which are very relevant for HealthGrids. In Medical Imaging, the different contributions covered the problems of medical image processing and virtual distributed storages. In this topic there are contributions

vi

focusing on the structuring of medical information through semantic classifications, as in the case of the NeuroBase project presented by Barillot et al. or in the case of the TRENCADIS software architecture presented by Blanquer et al. The problem of encryption and data sharing is a very important topic addressed in contributions such as the Medical Data Manager (Montagnat et al.) and other contributions related with privacy. In the area of medical image processing several papers describe their experiences on providing services for neuroimaging. The work of Olabarriaga et al. cover image processing services for FMRI (Functional Magnetic Resonance Imaging), and Bagnasco et al. describe the application of Grid for the early diagnosis of Alzheimer’s disease by assisting the diagnosis on PET / SPECT through Statistical Parametric Mapping and on the highly-computational problem of Fibre Tracking (Bucur et al.). On the area of modelling processes related with medical images, Bellet et al. proposes a web interface for MRI devices simulation, and Blanquer et al. proposes a Grid implementation of processing services for co-registration of medical images for assessing a quantitative diagnosis of liver cancer. In this precise topic of image co-registration, Montagnat et al. propose a mechanism to evaluate the quality of co-registration methods using the Grid, in a methodology called “Bronze Standard”. Finally, the problem of interactive use of Grids for medical image processing is tackled in the work of GermainRenaud et al. In the area of Ethical, Legal and Privacy Issues on HealthGrids, on one side, contributions focus on ethical and legal issues, such as the problem of medical consent (Herveg et al.) and the organisation of Virtual Organisations for clinical trials in epidemiology (Sinnott et al.). On the other side, different technical solutions for privacy enhancement are presented. In the work of Torres et al., a solution for sharing an encrypted and distributed storage of medical images is presented. A similar approach is used by Blanchet et al. to propose a mechanism for encrypting genetic information. Other approaches for sharing and linking for distributed repositories of epidemiological data are presented in Ainsworth et al. and Tashiro et al. The area of Bioinformatics is a very active one in HealthGrids. The increasing on size and complexity of genomic databases and protein modelling is opening the door to new Grid applications. Results in large-scale in-silico docking for malaria is presented in Jacq et al. and a grid-enabled protein structure prediction system namely Rokky-G is presented in the work of Masuda et al. Other important activity on this topic is the integration of bioinformatics information where the complexity of browsing data is also considered by Schroeder et al. in the frame of the Sealife project. Other approach based in data mediation is presented by Colonna et al. for predisposition Genes discovery. The integration of OGSA-DAI technologies for biochemical distributed data is proposed in Tverdokhlebov et al. The development of genomic processing services and its interfacing to Grid is presented in the porting of the GPS@ portal (Blanchet et al.), and in the work of Segrelles et al., in which an MPIBlast processing Grid service is developed and integrated in a Gene Annotation tool (Blast2GO). The early results of the BIOINFOGRID project are presented in the work of Milanesi et al. More consolidated results on bioprofiling are presented on the work of Sun et al. in the frame of the BIOPATTERN project. Finally, an application of HealthGrid to SARS is described in the work of Hung et al. In the area of Knowledge Discovery on HealthGrids, contributions focus on the semantic integration of medical information. The work of Boniface et al., in the frame of the ARTEMIS project, focus on Healthcare data, whereas the work of Koutkias et al.

vii

focus on semantic integration of bioinformatics data. The semantic integration is the key for knowledge discovery in large databases, in which techniques such as Data Mining are applied. Tsiknakis et al. propose the use of these techniques for cancer study on the ACGT IP, and McClatchey et al. apply those techniques for the integration of paediatric information in the frame of the Health-e-child project. In the area of Medical Assessment and HealthGrid Applications, covers, on one side, medical results of the application of Grid technologies to Health and other applications related to biomedical simulation and clinical environments. The application of Grids to radiotherapy is also a classic topic due to the maturity of High Energy Physics, revealing new applications of the MonteCarlo simulation to Intensity-Modulated Radiation Therapy (Gómez et al.) and interfacing to well-known environments such as GATE (Thiam et al.). Other applications of P2P and Grid technologies show their potential for emergency management (Harrison et al.), and collaborating environments (Kuba et al.). Finally, contributions also focus on the needs of hospital management systems for Grids (Graschew et al.), the success stories of e-DiaMoND and NeuroGrid projects (Ure et al.) and the exploitation of successful projects on Medical Imaging and Grids, such as the MAMMOGRID project (del Frate et al.).

ACKNOWLEDGEMENTS The editors would like to express their gratitude to the Programme Committee and the reviewers; each paper was read by at least two reviewers, including the editors. The editors want to thank the remarkable work that the staff of the HealthGrid association has invested in these conference proceedings and on the organisation of the conference, especially Yannick Legré. Opinions expressed in these proceedings are those of individual authors and editors, and not necessarily those of their institutions.

viii

Healthgrid 2006 Programme Committee Vicente Hernández Ignacio Blanquer Vincent Breton Jose Maria Carazo Andres Santos Antonio Sousa Armando Padhila Emmanuelle Ifeachor Ferran Sanz Alfonso Jaramillo Fabrizio Gagliardi Carlos Martinez Johan Montagnat Tony Solomonides Richard McClatchey Martin Hofmann Howard Bilofsky Petra Wilson Simon Robinson Paulo Bisch Luis Núñez de Villavicencio Chun-Hsi Huang Mary Kratz

Universidad Politécnica de Valencia, Spain Universidad Politécnica de Valencia, Spain Centre National de la Recherche Scientifique, France Centro Nacional de Biotecnología, Spain Universidad Politécnica de Madrid, Spain Instituto de Engenharia Electrónica e Telemática de Aveiro, Portugal Faculdade de Engenharia da Universidade do Porto, Portugal University of Plymouth, United Kingdom Universitat Pompeu Fabra, Spain École Polytechnique, France Microsoft, Switzerland Generalitat Valenciana, Spain Institut National de Recherche en Informatique et Automatique, France University of the West of England, United Kingdom University of the West of England, United Kingdom Fraunhofer Institut für Algorithmen und Wissenschaftliches Rechnen SCAI, Germany University of Pennsylvania, USA CISCO, Belgium Empirica GmbH, Germany Universidade Federal do Rio de Janeiro, Brasil Universidad de los Andes, Venezuela University of Connecticut, USA University of Michigan, USA

ix

Contents Introduction Healthgrid 2006 Programme Committee

v viii

Part I. Medical Imaging on the Grid Federating Distributed and Heterogeneous Information Sources in Neuroimaging: The NeuroBase Project C. Barillot, H. Benali, M. Dojat, A. Gaignard, B. Gibaud, S. Kinkingnéhun, J.-P. Matsumoto, M. Pélégrini-Issac, E. Simon and L. Temal Bridging Clinical Information Systems and Grid Middleware: A Medical Data Manager Johan Montagnat, Daniel Jouvenot, Christophe Pera, Ákos Frohner, Peter Kunszt, Birger Koblitz, Nuno Santos and Cal Loomis Grid Scheduling for Interactive Analysis Cécile Germain-Renaud, Romain Texier, Angel Osorio and Charles Loomis Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture: A Web Portal Design F. Bellet, I. Nistoreanu, C. Pera and H. Benoit-Cattin

3

14

25

34

Towards a Virtual Laboratory for fMRI Data Management and Analysis Silvia D. Olabarriaga, Aart J. Nederveen, Jeroen G. Snel and Robert G. Belleman

43

Service-Oriented Architecture for Grid-Enabling Medical Applications Anca Bucur, René Kootstra, Jasper van Leeuwen and Henk Obbink

55

Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation of Statistical Parametric Mapping Analysis S. Bagnasco, F. Beltrame, B. Canesi, I. Castiglioni, P. Cerello, S.C. Cheran, M.C. Gilardi, E. Lopez Torres, E. Molinari, A. Schenone and L. Torterolo Using the Grid to Analyze the Pharmacokinetic Modelling After Contrast Administration in Dynamic MRI Ignacio Blanquer, Vicente Hernández, Daniel Monleón, José Carbonell, David Moratal, Bernardo Celda, Montse Robles and Luis Martí-Bonmatí Medical Image Registration Algorithms Assessment: Bronze Standard Application Enactment on Grids Using the MOTEUR Workflow Engine Tristan Glatard, Johan Montagnat and Xavier Pennec

69

82

93

x

Part II. Ethical, Legal and Privacy Issues on HealthGrids The Ban on Processing Medical Data in European Law: Consent and Alternative Solutions to Legitimate Processing of Medical Data in HealthGrid Jean Herveg

107

Development of Grid Frameworks for Clinical Trials and Epidemiological Studies Richard Sinnott, Anthony Stell and Oluwafemi Ajayi

117

Privacy Protection in HealthGrid: Distributing Encryption Management over the VO Erik Torres, Carlos de Alfonso, Ignacio Blanquer and Vicente Hernández

131

Secured Distributed Service to Manage Biological Data on EGEE Grid Christophe Blanchet, Rémi Mollon and Gilbert Deléage

142

Part III. Bioinformatics on the Grid Demonstration of In Silico Docking at a Large Scale on Grid Infrastructure Nicolas Jacq, Jean Salzemann, Yannick Legré, Matthieu Reichstadt, Florence Jacq, Marc Zimmermann, Astrid Maaß, Mahendrakar Sridhar, Kasam Vinod-Kusam, Horst Schwichtenberg, Martin Hofmann and Vincent Breton A Gridified Protein Structure Prediction System “Rokky-G” and Its Implementation Issues Shingo Masuda, Minoru Ikebe, Kazutoshi Fujikawa and Hideki Sunahara Sealife: A Semantic Grid Browser for the Life Sciences Applied to the Study of Infectious Diseases Michael Schroeder, Albert Burger, Patty Kostkova, Robert Stevens, Bianca Habermann and Rose Dieng-Kuntz Advancing of Russian ChemBioGrid by Bringing Data Management Tools into Collaborative Environment Alexey Zhuchkov, Nikolay Tverdokhlebov and Alexander Kravchenko GPS@ Bioinformatics Portal: From Network to EGEE Grid Christophe Blanchet, Vincent Lefort, Christophe Combet and Gilbert Deléage Blast2GO Goes Grid: Developing a Grid-Enabled Prototype for Functional Genomics Analysis G. Aparicio, S. Götz, A. Conesa, D. Segrelles, I. Blanquer, J.M. García, V. Hernández, M. Robles and M. Talon

155

158

167

179 187

194

Bioprofiling over Grid for eHealthcare L. Sun, P. Hu, C. Goh, B. Hamadicharef, E. Ifeachor, I. Barbounakis, M. Zervakis, N. Nurminen, A. Varri, R. Fontanelli, S. Di Bona, D. Guerri, S. La Manna, K. Cerbioni, E. Palanca and A. Starita

205

SARS Grid—An AG-Based Disease Management and Collaborative Platform Shu-Hui Hung, Tsung-Chieh Hung and Jer-Nan Juang

217

xi

Part IV. Knowledge Discovery on HealthGrids A Secure Semantic Interoperability Infrastructure for Inter-Enterprise Sharing of Electronic Healthcare Records Mike Boniface, E. Rowland Watkins, Ahmed Saleh, Asuman Dogac and Marco Eichelberg Constructing a Semantically Enriched Biomedical Service Space: A Paradigm with Bioinformatics Resources Vassilis Koutkias, Andigoni Malousi, Ioanna Chouvarda and Nicos Maglaveras Building a European Biomedical Grid on Cancer: The ACGT Integrated Project M. Tsiknakis, D. Kafetzopoulos, G. Potamias, A. Analyti, K. Marias and A. Manganas Health-e-Child: An Integrated Biomedical Platform for Grid-Based Paediatric Applications Joerg Freund, Dorin Comaniciu, Yannis Ioannis, Peiya Liu, Richard McClatchey, Edwin Morley-Fletcher, Xavier Pennec, Giacomo Pongiglione and Xiang (Sean) Zhou

225

236

247

259

Part V. Medical Assessment and HealthGrid Applications Grid Empowered Sharing of Medical Expertise Martin Kuba, Ondřej Krajíček, Petr Lesný, Jan Vejvalka and Tomáš Holeček

273

Mobile Peer-to-Grid Architecture for Paramedical Emergency Operations Andrew Harrison, Ian Kelley, Emil Mieilica, Adina Riposan and Ian Taylor

283

Virtual Hospital and Digital Medicine – Why Is the GRID Needed? Georgi Graschew, Theo A. Roelofs, Stefan Rakowsky, Peter M. Schlag, Paul Heinzlreiter, Dieter Kranzlmüller and Jens Volkert

295

Final Results and Exploitation Plans for MammoGrid Chiara del Frate, Jose Galvez, Tamas Hauer, David Manset, Richard McClatchey, Mohammed Odeh, Dmitry Rogulin, Tony Solomonides and Ruth Warren

305

Part VI. Posters and Short Contributions Proposing a Roadmap for HealthGrids Vincent Breton, Ignacio Blanquer, Vicente Hernández, Yannick Legré and Tony Solomonides

319

Remote Radiotherapy Planning: The eIMRT Project 330 Andrés Gómez, Carlos Fernández Sánchez, José Carlos Mouriño Gallego, Francisco J. González Castaño, Daniel Rodríguez-Silva, Javier Pena García, Faustino Gómez Rodríguez, Diego González Castaño and Miguel Pombar Cameán

xii

Designing for e-Health: Recurring Scenarios in Developing Grid-Based Medical Imaging Systems 336 John Geddes, Clare Mackay, Sharon Lloyd, Andrew Simpson, David Power, Douglas Russell, Marina Jirotka, Mila Katzarova, Martin Rossor, Nick Fox, Jonathon Fletcher, Derek Hill, Kate McLeish, Yu Chen, Joseph V. Hajnal, Stephen Lawrie, Dominic Job, Andrew McIntosh, Joanna Wardlaw, Peter Sandercock, Jeb Palmer, Dave Perry, Rob Procter, Jenny Ure, Mark Hartswood, Roger Slack, Alex Voss, Kate Ho, Philip Bath, Wim Clarke and Graham Watson Design and Implementation of Security in a Data Collection System for Epidemiology John Ainsworth, Robert Harper, Ismael Juma and Iain Buchan

348

Architecture of Authorization Mechanism for Medical Data Sharing on the Grid Takahito Tashiro, Susume Date, Singo Takeda, Ichiro Hasegawa and Shinji Shimojo

358

Database Integration for Predisposition Genes Discovery François-Marie Colonna, Yacine Sam and Omar Boucelma

368

High Performance GRID Based Implementation for Genomics and Protein Analysis L. Milanesi and I. Merelli

374

TRENCADIS – A WSRF Grid MiddleWare for Managing DICOM Structured Reporting Objects Ignacio Blanquer, Vicente Hernández and Damià Segrelles

381

GATE Simulation for Medical Physics with Genius Web Portal C.O. Thiam, L. Maigne, V. Breton, D. Donnarieix, R. Barbera and A. Falzone

392

Biomedical Applications in EELA Miguel Cardenas, Vicente Hernández, Rafael Mayo, Ignacio Blanquer, Javier Perez-Griffo, Raul Isea, Luis Nuñez, Henry Ricardo Mora and Manuel Fernández

397

Outlook for Grid Service Technologies Within the @neurIST eHealth Environment A. Arbona, S. Benkner, J. Fingberg, A.F. Frangi, M. Hofmann, D.R. Hose, G. Lonsdale, D. Ruefenacht and M. Viceconti Author Index

401

405

Part I Medical Imaging on the Grid

This page intentionally left blank


3

Federating Distributed and Heterogeneous Information Sources in Neuroimaging: The NeuroBase Project C. Barillota, H. Benalib, M. Dojatc, A. Gaignarda, B. Gibauda, S. Kinkingnéhunb, J-P. Matsumotod, M. Pélégrini-Issacb, E. Simond, L.Temala a Visages U746, INSERM-INRIA-CNRS-Univ-Rennes1, IRISA, Rennes, France b IFR 49, CHR La Pitié Salpetrière/CEA-SHFJ, Paris, Orsay, France c Unité INSERM U594, Grenoble, France d Business Objects/Médience, Levallois-Perret, France

Abstract. The NeuroBase project aims at studying the requirements for federating, through the Internet, information sources in neuroimaging. These sources are distributed in different experimental sites, hospitals or research centers in cognitive neurosciences, and contain heterogeneous data and image processing programs. More precisely, this project consists in creating of a shared ontology, suitable for supporting various neuroimaging applications, and a computer architecture for accessing and sharing relevant distributed information. We briefly describe the semantic model and report in more details the architecture we chose, based on a media-tor/wrapper approach. To give a flavor of the future deployment of our architecture, we de-scribe a demonstrator that implements the comparison of distributed image processing tools applied to distributed neuroimaging data Keywords. Medical Image Data bases, Mediation Systems, Mediator/Wrappers, Neuroimaging, Semantic Web, Medical Ontology

1. Introduction One objective of neuroscientists is the construction of functional cerebral maps under normal and pathological conditions. Researches are currently performed to find correlations between anatomical structures, essentially sulci and gyri, where neuronal activation takes place, and cerebral functions, as assessed by recordings obtained by means of various neuroimaging mo-dalities, such as PET (Positron Emission Tomography), fMRI (functional Magnetic Resonance Imaging), EEG (ElectroEncephaloGraphy) and MEG (Magneto-EncephaloGraphy). Formation of such correlations maps requires the development of sophisticated image processing techniques, such as segmentation and modeling of anatomical structures, registration and multi-modality fusion, and specific methods for longitudinal data analysis. Two of the major concerns of researchers and clinicians involved in neuroimaging experiments are on one hand, to manage internally the huge quantity of produced data ( 1 Gb per subject) and, on the other hand, to be able to confront their experiences and the programs they develop with those existing in other centers or, moreover, with those described in publications. Fur-thermore, and this is more particularly true for medium size centers (with limited staff capabili-ties), or even small ones (it is mostly the case in

4

C. Barillot et al. / Federating Distributed and Heterogeneous Information Sources in Neuroimaging

clinical centers), the researchers or the clini-cians have great difficulties to set up largescale experiments, mainly due to the lack of man power and capacities of recruiting subjects. Besides, the statistical validity of the results is sometimes insufficient (the rate of "false negative" is probably not negligible). For all these reasons, we believe that pooling experimental results, through a network between collaborative centers, will widen the scientific achievement of the conducted experimental studies. Through distributed neuroimaging data bases, the search for similar results, the search for images con-taining singularities or transverse searches via data mining techniques could highlight possible regularities. Moreover, this will broaden also the possible panel of people involved in neuroi-maging studies, while protecting the excellence of the supplied work. In this context, NeuroBase is a cooperative project that is aimed at establishing the conditions allowing, through the Internet, the federation of distributed information sources in neuroimag-ing, these sources being located in various centers of experimentation, clinical departments in neurology, or research centers in cognitive neurosciences. This requires that the users can diffuse, exchange or reach neuroimaging information with ap-propriate access means, in order to be able to retrieve information almost as easily as if it were stored locally. 1.1. Background Due to of the explosion of data generated by the neurosciences community, early in the 90's has appeared the imperative necessity for innovative techniques for data and knowledge sharing and reuse [1,2]. This led to the starting of the North American ambitious "Human Brain Mapping" project. An objective recently added to this project is the development of data analysis and data processing software to operate on various data repository systems for data mining and knowledge discovery purposes. In parallel the development of web applications has stimulated the interest of researchers for distributed databases and information sharing. Four research topics are particularly relevant for our project: 1. Digital and probabilistic atlases of brain. To gather and share neuroimaging information in a common referencel space, various research efforts are performed for the construction of digital atlases: based on the labeling of post-mortem brains to quantify individual anatomical variability of cortical regions [3], for the anatomy and brain functions of rats [4] or of the primate visual system [5], or to associate symbolic data and graphical data about the nervous system [6]. Some atlases are developed to support interpretation of functional data [7], image processing instantiation in a specific context [8] or training [9]. For probabilistic atlases, some 300 MRI brain scans plus post morten data of 30 subjects have been mixed in a common referential by the International Consortium for Brain Mapping [10]. Several image processing tools have been added to allow segmentation and mapping of brain images to this brain reference. 2. Conception of image processing tools. The BRAID1 project at Hopkins University is relevant here. It explores the anatomy-function relationship based on activationresponse experiments and deficit-lesion analysis. The proposed system integrates mechanisms for complex queries, combining selection with multiple criteria, 1

BRAin Image Database, http://braid.uphs.upenn.edu/websbia/braid/


3.

4.

5

images quantification, and statistical tests to calculate correlations between deficits and lesions. Group studies rely on matching all brains to a target (reference) by the means of linear or non-linear (deformable elastic model) matching methods, each with its own pros and cons. Several participants of this project have a well-known experience on the conception of such robust image processing tools. Multi-center databases. Several laboratories belonging to the Illinois University participate to the constitution of a commonly shared database devoted to neuronal patterns recordings. This work, oriented to animal recordings, is close to our project. The database is used for instance, to find temporal series specific to neurons populations under various stimuli conditions. A common data model has been developed to organize the experimental data. An atlas is available to enter, search and analyze heterogeneous data in a common referential. Ontology sharing and data schemata updating facilities have been also explored in the context of cooperative federated databases [11]. Infrastructures for sharing data and processing tools. Several projects such as IXI [12] or Mammogrid [13] explore how grid technology can be applied to the field of medical image analysis by using large collections of computer resources to facilitate and scale processing across sites. The architectures proposed allow image processing algorithms to be exposed as Grid services with the ability to compose these services as complex workflows executed across distributed resources. The notion of pipelines for the sequencing of image processing algorithms is also present in the LONI [14] or BrainVISA [15] frameworks.

2. The NeuroBase Approach Instead of gathering all data in a central database [16], NeuroBase promotes a federated system for the management of distributed and heterogeneous sources of information. The goal of the system is to allow the sharing of two types of information: on the one hand neuroimaging data, typically results from neuroimaging experiments, on the other hand data processing programs, typically image processing programs or statistical tools, being applied to the data available in the distributed system. Data can then be stored in relational databases or just in local files (wrappers will find their own way to the information). Image processing programs are modeled by the use of data flows. A dataflow specifies the inputs and the parameters required for completing of a given processing method, and the outputs of this procedure. Then, one of the most important aspects in this project is to identify the main concepts shared by the different information centers in order to define a common semantic model every site can subscribe to (see Figure 1). From this base line, each site participating to the federated system can map its own concepts, data, image processing programs and ontology, to this semantic referential [11]. For this purpose, we rely on a mediator/wrapper approach [17], where both the integration of both (i) anatomical and functional images and related data (e.g. experimental protocols or subjects, pathology) and (ii) image processing programs, which can be applied to the images, (e.g. segmentation, registration, statistical analysis, …) can be expressed.

6


Figure 1: The NeuroBase system architecture for managing distributed information sources in neuroimaging. Mediation services are used to map and retrieve local information stored in heterogeneous and distributed databases following user queries expressed using concepts from the shared ontological model.

2.1. The Mediator/Wrapper approach Mediators are systems for the mediation of information that have been introduced to allow the virtual integration of heterogeneous distributed information sources in cooperative federated database systems. Mediators differ from standard database management systems in several aspects. Firstly, they do not supply mechanisms for updating simultaneous sources of information. They only support queries to information sources in order to preserve their autonomy and the fact that they are locally managed. Secondly, to reinforce interoperability, to be highly adaptive to data structures encountered in databases, mediators support various data models from standard structured data, such as relational, object or multi-dimensional models, to semi-structured models, such as XML. The architecture of mediators is also different, based on a "mediator/wrapper" concept [17], in which a mediator offers a central view about all sources of information and the associated wrappers, dedicated to each source, hide their heterogeneity. Using the corresponding wrappers, a mediator redefines the user query into source dependent queries, then recomposes the various responses and formats the final response to the user. The query redefinition in sub-queries is optimized by means of a cost based model to obtain the most efficient execution plan. This architecture clearly specifies the respective role of the mediator, which processes the user queries, and the wrappers, which translate the sub-queries into the relevant format for the associated source of information. The pragmatic interest of such architecture is to lower the amount of work linked to the introduction of a new source of information to the creation of the corresponding wrapper. Several mediators have


7

been developed already (for instance DISCO [18], and Mocha [19]). Since 1998, one of the project’s participants have developed a new generation of mediators, called Le Select2, which allows one to share distributed, heterogeneous and autonomous data and programs via a high level query language. Le Select is the cornerstone of the NeuroBase approach.

Figure 2: Excerpt of the NeuroBase ontology. Some concepts that appear in the text are shown in italic.

2.2. The Semantic Model This semantic model or ontology has to be defined by a collaborative community, which requires quite a lot of work since there exists no fully defined common ontology from which we can derive our semantic referential. We have to build it in a domain which is complex and not well defined. Some existing works can provide valuable inputs such as the fMRI data center ontology 3 and medical thesauri such as the “Neuronames” terminology [20]. Parts of our efforts were related to the design of this ontology (see Figure 2). Briefly, the ontology “relies on/is made of” concepts that represent the relevant entities and their associated properties and supply the search criteria susceptible to support user queries, such as a Subject or a GroupOfSubjects with or without a specific PersistentPathologyAssessment involved in a Study. Corresponding Dataset of Anatomical and Functional images with their AcquisitionProtocol, DataProcessing methods and InterpretationOfDatasetComponent (e.g. labels corresponding to anatomical entities, mesh, probabilistic information …) are described. Concepts have been introduced to cover at least the specific applications addressed by the NeuroBase contributors (epilepsy, visual cortex exploration and Alzheimer disease).

2 3

http://www-caravel.inria.fr/~leselect/ http://www.fmridc.org/f/fmridc/aboutus/index.html

8


3. The NeuroBase Demonstrator In order to evaluate our architecture, we have recently built a demonstrator based on some existing modules like Le Select, BrainVISA/Anatomist4, MRIcro5, FSL6 and Vistal7. This can be extended to modules largely used in neurosciences communities such as SPM8 software. 3.1. The test-bed application The purpose of the test-bed application is to demonstrate that the NeuroBase architecture can support, via the Internet, the test and comparison of image processing modules in order for instance to select the most robust. These modules are distributed in several centers and applied on distributed data. Presently, two test centers are involved. A center, C1, located in Grenoble (FR) has developed an image processing chain for the delineation of visual cortical areas including cortex segmentation and unfolding. Image data are acquired on a 3T Bruker scanner in the context of cognitive experiments for visual cortex exploration. They are stored using the Analyze format. A second center, C2, located in Rennes (FR) has developed image processing tools for restoration (denoising and debiasing) and segmentation. Data are mainly acquired in the context of Epilepsy on a 1.5T GE scanner and stored using the GIS format. Figure 3 illustrates the application. First, C2 queries for an anatomical image available at C1 that is locally restored (anisotropic filtering) – i.e. at C2 - after the required format transformation. Then, C2 launches a specific tool for brain extraction. The Bet/FSL algorithm is executed at C1 on the input (a restored image) and provides the corresponding outputs: a brain image and a brain mask (binary image). After format conversion C2 fires locally the tissues segmentation. C2 launches a similar image processing available at C1 (MA_segmentation). Execution is then performed at C1. The two segmented images are then compared using the tool required (difference) at C2. Results are displayed at C2 or at C1 with the local 3D viewer. The same dataflow can be executed on data either from C1 (as in the example) or from C2. The final user does not need to know where the data are stored and where the methods are executed.

4

http://brainvisa.info http://www.psychology.nottingham.ac.uk/staff/cr1/mricro.html 6 http:/www./fmrib.ox.ac.uk.fsl/ 7 http://www.irisa.fr/visages/software-fra.html 8 http://www.fil.ion.ucl.ac.uk/spm/ 5


9

Figure 3: The test-bed application: two research centers C1 and C2, physically separated, share via the Internet data and processing tools in order to compare two segmentation methods (Vistal in C2 and MA_segmentation in C1) on an anatomical image previously restored at C2. Segmentation processes are executed separately at each center. No synchronization process is implemented and difference is calculated when data are available. Concepts present in our ontology that correspond to inputs and outputs of our image processing tools are shown in italic.

3.2. The Architecture The overall architecture is shown in Figure 4. The Le Select LeSelect middleware is installed at each center. It is a generic server that includes data and image processing wrappers. Wrappers are site specific. Shared image processing tools are executed based on each local software library environment. Local 2D/3D viewer (here Anatomist9 and MRICro10) can be used. All distributed queries are performed via a common application developed in a Tomcat servlet server environment accessible through a standard web browser. 3.3. The Working Principles Shared anatomical images are stored in local data repository (for C1 in a local file hierarchy and in a PostgreSQL database for C2). To make various queries using services available on our distributed system, wrappers were designed to map the local data organization in C1 and C2 with to the semantic referential. Figure 5 highlights the main mappings between the local file hierarchy in C1 and some concepts.

9

http://brainvisa.info/ http://www.psychology.nottingham.ac.uk/staff/cr1/mricro.html

10

10


Similarly wrappers were introduced to execute programs on data published. As relational format is used by Le Select, program wrappers use relational data as input and relational data as output. In the following example, the skull stripping program is executed on images referred as Dataset in the ontology. This command is executed by Tomcat. job execute //$host/fsl/Bet Å execution of bet program. host is set to C1 hostname input a is select Vol_Bin1 as img, Vol_Bin2 as hdr " + Å SQL query to retrieve input files /ontology/AllDataset " + in AllDataset table of ontology where ID = '$DatasetID' Å identifier for all Dataset entities


11

INTERNET Firewall

Firewall

Firewall

Firewall

Secure ports

Secure ports

Secure ports

MWS server

MWS server

MWS server

MWS server

C1

C2

C3

C4

Local_NET

Secure ports

Secure ports

Client Demo WebApp C1(w3ext)

TomCat Apache

8080 INTERNET

Connect thru https and passwd

8080

Figure 4: (Top) the NeuroBase demonstrator architecture deployment between two distant centers C1 and C2. WD: data wrapper, WP: image processing program wrapper; (Bottom) The current network implementation of the system between 4 different partners, this underlines the generality of the proposed approach.

Figure 5: Mapping the BALC concepts hierarchy, ie the local database in center C1, to the semantic referential.

12


4. Discussion Our preliminary demonstrator shows that the principles and the technology we propose can be used in the context of neuroimaging data. Clearly, it should be extended. Presently, only a small part of the ontology is mapped to local databases. The call to processing tools is hard-coded: selection of inputs and tuning parameters are still limited. The application developed in theTomcat environment should be extended to allow the selection, through a standard web browser, of the processing tools available. Finally, output of images processing tools available at each site are not reintroduced in the corresponding file hierarchy or PostgresSQL database. Such extensions are under development. Neuroimaging is a relatively new scientific discipline in vivid evolution. Many concepts currently used did not exist a few years ago. In this moving area, where no consensus is reached for several concepts, the definition of a centralized database for sharing data and processing methods seems rather complex and requires, if successful, strong man power facilities for its maintenance. Moreover, there is a legitimate desire of autonomy that contradicts the centralized approach. Actually information sources exist in different centers but have generally been set up for purely local needs and are accessible to only a very small user community. In this context, the NeuroBase architecture we propose, based on a mediator/wrapper approach, seems attractive. Our architecture can be used to manage the evolution or even the upcoming of new information sources by just updating wrappers or creating new ones (this somewhat comes to changing or adding new views to the semantic referential). In our approach, the semantic referential is central. It should be flexible enough to accept the introduction of new concepts, while remaining consistent. The AI community, from knowledge engineering to semantic grid has developed a strong expertise in this field via the construction of controlled vocabularies and thesauri that will provide valuable hints. The extensive use and the evolution of our demonstrator will allow us to confront it to different real situations. 5. References [1] Mazziotta J.C., Toga A.W., Evans A.C., Fox P. and Lancaster J.L.: A Probabilistic Atlas of the Human Brain: Theory and Rationale for its development. Neuroimage 2 (1985) 89-10. [2] Roland P.E., Zilles K.: Brain atlases - a new research tool. Trends in neurosciences 17(1994) 458-467. [3] Graf von Keyserlingk, D., Niemann K., and Wasel, J.: A quantitative approach to spatial variation of human cerebral sulci. Acta Anatomica 131 (1988) 127-131. [4] Toga, W.: A Three-Dimensional atlas of structure/function relationships, Journal of chemical neuroanatomy. 4 (1991) 313-318. [5] Van Essen, D.C., Drury, H.A., Joshi, S. and Miller, M.I. Functional and structural mapping of human cerebral cortex: solutions are in the surfaces. Proc Natl Acad Sci U S A, 95 (1998) 788-795. [6] Bloom, F.E.: The multidimensional database and neuroinformatics requirements for molecular and cellular neuroscience, Neuroimage 4 (1996) S12. [7] Seitz, R.J., Bohm, C., Greitz, T., Roland, P.E., and Erikson, L.: Accuracy and precision of the computerized brain atlas programme for the localization and quantification in positron emmision tomography. J. Cerebr Blood Flow Metab. 10 (1990) 443-457.


13

[8] Lehmann, E.D., Hawkes, D.J., Hill, D.L., Bird, C.F., Robinson, G.P., Colchester, A.C. and Maisey M.N.: Computer-aided interpretation of SPECT images of the brain using an MRI-derived 3D neuro-anatomical atlas. Medical Informatics. 16 (1991) 151-66. [9] Höhne, K.H., Bomans, M., Riemer, M., R. Schubert, R., Tiede, U. and Lierse, W.: A volume based anatomical atlas. IEEE Comp. Graphics and Application. 12 (1992)72-78. [10] Tiede, U., Schiemann, T. and Höhne, K.H.: Visualizing the Visible Human. IEEE computer Graphics and Applications. 16 (1996) 7-9. [11] Kahng J. and McLeod D. Dynamic Classificational Ontologies: Mediation of information sharing in Cooperative Federated Database Systems. In: Papazoglou M.P., Schlageter G. (eds): Cooperative Information Systems: Trends and Directions. Academic Press, London, U.K., (1998) 179-203. [12] Rowland, A., Burns, M., Hartkens, T., Hajnal, J., et al. Information eXtraction from Images (IXI): Image Processing Workflows Using A Grid Enabled Image Database. In: Dojat M., Gibaud, B. (eds): Proceedings of DiDaMIC'04 Workshop. MICCAI conference (St Malo, 26-29 Sept. 04), Rennes (2004) 55-64. [13] Brady, M. Grid-based Federated Databases of Mammograms: Mammogrid and eDiamond experiences. In: Dojat M., Gibaud, B. (eds): Proceedings of DiDaMIC'04 Workshop. MICCAI conference (St Malo, 26-29 Sept. 04), Rennes (2004) 84. [14] Rex, D. E., Ma, J. Q. and Toga, A. W.: The LONI Pipeline Processing Environment. NeuroImage 19 (2003) 1033-48. [15] Cointepas, Y., Mangin, J., Garnero, L., Poline, J., et al. BrainVISA: Software platform for visualization and analysis of multi-modality brain data. In: Proceedings of 7th Human Brain Mapping Conf, Brighton (UK) (2001) S98. [16] Koslow, S.H.: Should the neuroscience community make a paradigm shift to sharing primary data? Nature Neuroscience. 3 (2000) 863-865. [17] Wiederhold G. and Genesereth M.: The Conceptual Basis for Mediation Services. IEEE Expert 12 (1997) 38-47. [18] Tomasic A. et al.. The distributed Information Search Component (Disco) and the World Wide Web. In Proc. of the ACM SIGMOD, Tucson, Arizona, May (1997) 546-548. [19] Rodriguez-Martinez, M., Roussopoulos, N.: MOCHA: a Self-extensible Middleware Substrate for Distributed Data Sources, In: Proc. of the ACM SIGMOD Int Conf on Management of Data, Houston, May 2000. [20] Bowden, D. M., Martin, R. F.: NeuroNames Brain Hierarchy. NeuroImage 2 (1995) 63-83.

14


Bridging clinical information systems and grid middleware: a Medical Data Manager ´ Johan Montagnat a , Daniel Jouvenot b , Christophe Pera c , Akos Frohner d , d d d Peter Kunszt , Birger Koblitz , Nuno Santos , and Cal Loomis b a CNRS, I3S laboratory b CNRS, LAL laboratory c CNRS, CREATIS laboratory d CERN Abstract. This paper describes the effort to deploy a Medical Data Management service on top of the EGEE grid infrastructure. The most widely accepted medical image standard, DICOM, was developed for fulfilling clinical practice. It is implemented in most medical image acquisition and analysis devices. The EGEE middleware is using the SRM standard for handling grid files. Our prototype is exposing an SRM compliant interface to the grid middleware, transforming on the fly SRM requests into DICOM transactions. The prototype ensures user identification, strict file access control and data protection through the use of relevant grid services. This Medical Data Manager is easing the access to medical databases needed for many medical data analysis applications deployed today. It offers a high level data management service, compatible with clinical practices, which encourages the migration of medical applications towards grid infrastructures. A limited scale testbed has been deployed as a proof of concept of this new service. The service is expected to be put into production with the next EGEE middleware generation.

1. Medical data management in hospitals and grid data management The medical community is routinely using clinical images and associated medical data for diagnosis, intervention planning and therapy follow-up. Medical imagers are producing an increasing number of digital images for which computerized archiving, processing and analysis are needed [8,12]. Indeed, image networks have become a critical component of the daily clinical practice over the years. With their emergence, the need for standardized medical data formats and exchange procedures has grown [2]. For this reason, the Digital Image and COmmunication in Medicine standard (DICOM) [6] was adopted by a large consortium of medical device vendors. Picture Archiving and Communication Systems (PACS) [10], manipulating DICOM images and often other medical data in proprietary formats, are proposed by medical device vendors for managing clinical data. PACS are often proprietary solutions weakly standardized. PACS may be more or less connected to the Hospital Information System (HIS), holding administrative information

J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware

15

about patients, and Radiological Information Systems (RIS), holding additional information for the radiological departments. The DICOM standard, PACS, RIS and HIS have been developed with clinical needs in mind. They are easing the daily care of the patients and medical administrative procedures. However, their usage in other areas is very limited. The interface with computing infrastructures for instance is almost completely lacking. In addition, current PACS hardly address medical data management needs beyond clinical centers’ administrative boundaries, while the patient medical folders are often wide spread over many medical sites that have been involved in the patient’s healthcare. Many medical image acquisition devices are also weakly conforming to the DICOM standard, thus hardly hiding the heterogeneity of these systems. In the last decades, with the growing availability of digital medical data, many medical data processing and analysis algorithms were developed, enabling computerized medical applications for the benefit of the patient and healthcare practitioners. Although sharing the same data sources, the medical image analysis community has different requirements for medical system than the healthcare community. Many algorithms are developed for processing and producing image files. A common procedure for accessing all medical data sources is needed. Given the enormous amount of medical data produced inside hospitals and the cost of medical data computing (especially image analysis algorithms), grid proved to be very useful infrastructures for a large variety of medical applications [11]. Grids are providing computing resources and workload systems that ease application code deployment and usage. Moreover, grids are providing distributed data management services that are well suited for handling medical data geographically spread throughout various medical centers [5,7,4,9,3]. However, existing grid middlewares are often only dealing with data files and do not provide higher level services for manipulating medical data. Medical data often have to be manually transferred and transformed from hospital sources to grid storage before being processed and analyzed. Such manual interventions are tedious and often limit systematic use of grid infrastructures. In some cases, they may even prevent the use of grids, e.g. when the amount of data to transfer is too large. As a consequence, the first key to the success of the systematic deployment of medical image processing algorithms is to provide a data manager that: • Provides access to medical data sources for computing without interfering with the clinical practice. • Ensures transparency so that accessing medical data does not require any specific user intervention. • Ensures a high data protection level to respect patients privacy. The Medical Data Manager (MDM) service described in this paper was designed to fulfill these constraints. It was developed with the support of the EGEE1 European IST project. The remaining of this paper describes the technical requirements to be addressed for such a service and details the service design. 1 Enabling

Grids for E-sciencE, http://www.eu-egee.org

16


2. Clinical usage of medical data The DICOM standard introduced earlier encompasses, among other things, an image format and an image communication protocol. A DICOM image usually contains one slice (a 2D image) acquired using any medical imaging modality (MRI, CT-scan, PET, SPECT, ultrasound, X-ray... [1]). A DICOM image may contain a multi-slice data set but this is rarely encountered. A DICOM image contains both the image data itself and a set of additional information (or metadata) related to the image, the patient, the acquisition parameters and the radiology department. DICOM metadata are stored in fields. Each field is identified by a unique tag defined in the DICOM standard. A given field may be present or absent depending on the imager that produced the image. The standard is open and image device manufacturers tend to use their own fields for various information. A couple of fields (such as image size) are mandatory but experience proved that surprises should be expected when analyzing a DICOM image. The image itself is usually stored as raw data. Most imaging devices produce one intensity value per image pixel, coded in a 12 bit format. Other format may be encountered such as 16 bit data or lossless JPEG. 2.1. DICOM protocol, storage, and security Most (reasonably modern) medical image acquisition device are DICOM clients. DICOM servers are computers with on-disk and/or tape back-ends able to store and retrieve DICOM images. The DICOM protocol defines the communication protocol between DICOM servers and clients. There is no standardization on DICOM storage. DICOM servers are implementing their own policy of data storage. One should not see DICOM data sets as a set of files. As stated above, a single DICOM image usually contains only one image slice. In practice, during a medical examination (a DICOM study), a radiologist acquires several 2D and 3D images, representing up to hundreds to thousands of slices. A study is divided in one or several series and each serie is composed by a set of slices (that can be stacked to assemble a a volume when they belong to the same 3D image). Note that there is often no notion of 3D image encoded in the DICOM format: a serie may contain a set of slices composing several 3D images. The way a DICOM server stores these data sets on disk is irrelevant just like the way a database stores its table is usually not known from the users: the medical user is never exposed to the DICOM storage and does not need to know if different files are used for each DICOM slice, serie, study, etc. Metadata are included in DICOM image headers, making them difficult to manipulate. A DICOM server will often extract these metadata and store them in a database to ease data search. The DICOM security model is rather weak. DICOM files are unencrypted and transported unencrypted. Files contain patient data. The DICOM server security model is based on a per-application basis: all users having access to some DICOM client application can access to the information that the server returns to this specific application. DICOM servers are using random file names without any connection to the patient information and a proprietary data storage policy.


17

To cope with these data protection limitations, security is often implemented in hospitals by isolating the images network from the outside world. 2.2. Access to medical images Each image acquisition device is a potential DICOM compliant medical image source. In a radiological department, one or several DICOM servers can be set up to centralize data acquired on this site. Medical data are naturally distributed over the different acquisition sites. In clinical practice, physicians do not access directly to image files. They identify data by associated metadata such as patient name, acquisition date, radiologist name, etc. The data are transferred mainly for visualization purposes. The physician quickly scans the slices stack in the DICOM study and focuses on the slices he or she is interested in.

3. Medical image analysis In the medical image analysis community, the needs are quite different. One often needs to identify images through metadata too, although the search are not necessarily for nominative data but often related to the acquisition type or body region. 3D images are exported to disk files for post-processing and ease of use. Various 3D medical image format may be used to stack different DICOM slices into a single image volume (the most common being the analyze file format). 3.1. Enforcing medical data and security All medical data should be considered as sensitive to preserve patient privacy. Nominative medical data are of course the most critical data and therefore, no binding between nominative data and images should be possible for non accredited users. In clinical practice, this result is often obtained by isolation of the image network. Only physicians participating to one patient healthcare should have access to the data of this patient. On a grid, the distribution of data make security problem very sensitive. To ensure patient privacy, the header of all DICOM images sent by a DICOM server should be wiped out, at least partially, to ensure anonymity. All images that are stored out of the source center should be encrypted to ensure that non accredited users cannot read the image content.

4. Medical Data Management Service 4.1. EGEE grid middleware The EGEE project is currently deploying the LCG2 middleware2 on its production infrastructure. LCG2 is based on GLOBUS2, Condor, and the other services 2 LCG2:

Large hadron collider Computing Grid middleware, http://lcg-web.cern.ch

18


developed in the European DataGrid project3 . A new generation middleware, gLite4 , is under testing and should be deployed at Spring 2006. Our Medical Data Manager service (MDM) is based on gLite. The gLite middleware provides workload management services for submitting computing tasks to the grid infrastructure and data management services for managing distributed files. The data management is based on a set of Storage Elements which are storage resources distributed in the various sites participating in the infrastructure (currently, more than 180 sites distributed all over Europe and beyond). All storage elements expose a same interface for interacting with the other middleware services: the Storage Resource Manager interface (SRM) that is standardized in the context of the Global Grid Forum5 . The SRM is handling local data at a file level. It offers an interface to create, fetch, pin, or destroy files among other things. It does not implement data transfer by itself. Additional services such as GridFTP or gLiteIO are coexisting on storage elements to provide transfer capabilities. In addition to storage resources, the gLite data management system includes a File Catalog (Fireman) offering a unique entry point for files distributed on all grid storage elements. Each file is uniquely identified through a Global Unique IDentifier (GUID). The file catalog contains tables associating each GUID to file location. For efficiency and fault tolerance reasons, files may be replicated on different sites. Thus, each GUID may be associated to several locations. To ease the manipulation by users, human readable Logical File Names (LFN) can be associated to each file (each GUID). 4.2. Medical Data Management service design The Medical Data Management service architecture is diagrammed in figure 4.2. On the left, is represented a clinical site: various imagers in an hospital are pushing the images produced on a DICOM server. Inside the hospital, clinicians can access the DICOM server content through DICOM clients. In the center of figure 4.2, the MDM internal logic is represented. On the right side, the grid services interfacing with the MDM are shown. All middleware services requiring access to data storage do so through SRM requests sent to storage elements. To remain compatible with the rest of the grid infrastructure, our MDM service is based on a SRM-DICOM interface software. The SRM-DICOM core is receiving SRM requests and transforms them into DICOM transactions addressed to the medical servers. Thus, medical data servers can be shared between clinicians (using the classical DICOM interface inside hospitals) and image analysis scientists (using the SRM-DICOM interface to access the same data bases) without interfering with the clinical practice. An internal scratch space is used to transform DICOM data into files that are accessible through data transfer services (GridFTP or gLiteIO). 3 European

DataGrid project, http://www.edg.org middleware, http://www.glite.org 5 Global Grid Forum, http://www.ggf.org

4 gLite

J. Montagnat et al. / Bridging Clinical Information Systems and Grid Middleware Hospital DICOM client

Medical Data Manager Encryption

DICOM

DICOM get

Header blanking

SRM−DICOM interface (read only)

Grid Middleware SRM request

layer

Abstraction

Server

DICOM push

gLiteI/O server

Hydra key store Imagers

19

Scratch space

AMGA Metadata manager

Any grid service MDM client library gLiteIO client

Fireman File Catalog GUID LFN key

key DICOM patient acquisition hospital

Secondary SRM (read−write)

gLiteI/O interface

Figure 1. Overview of the medical data manager

A metadata manager is also used to extract DICOM headers information and ease data search. The AMGA6 service [13] is used for ensuring secured storage of these very sensitive data. The AMGA server holds a relation between each DICOM slice and the image metadata. This specialized SRM is not providing a classical Read/Write interface to a storage element. A classical R/W storage element can symmetrically receive grid files to be stored or deliver archived files to the grid on request. In the MDM, The SRM interface only accepts registration request coming internally from the hospital. To avoid interfering with the clinical data, external grid files are not permitted to be registered on the MDM storage space: only get requests are authorized from the grid side. If classical grid storage is desired (with write capability), a classical secondary SRM can be installed on the same host. For data encryption needs, a secured encryption key catalog is also used. It is named hydra catalog as it uses a split key storage strategy to improve security and fault tolerance [15,14]. An Abstraction layer, currently being prototyped and tested, is also depicted on the diagram. Its role is to offer a higher level abstraction for accessing 3D images by associating all DICOM slices corresponding to a single volume. Indeed, most medical image processing applications are not manipulating 2D images independently but rather consider complete volumes. The abstraction layer is associating a single GUID to each volume. On a request for the volume associated to this GUID, all corresponding slices are transferred from the DICOM server and assembled in a single volume in scratch space. 6 ARDA

metadata dev/metadata/

catalog

project,

http://project-arda-dev.web.cern.ch/project-arda-

20


DICOM imager (client)

Trigger PUSH

DICOM server

Fireman

AMGA

Hydra key

File Catalog

metadata server

store

PUSH Analyse DICOM header (build SURL) register SURL write metadata generate encryption key store key

Figure 2. Triggered action at image creation

4.3. Internal service interaction patterns To fulfill its role, the MDM service needs to be notified when files are produced by the imagers and stored into the DICOM server. This notification triggers a file registration procedure that is depicted in figure 2. The DICOM data triggering the operation is first stored into the hospital DICOM server as usual. The DICOM header is then analyzed to extract image identifying information. This DICOM ID is used to build a Storage URL (SURL) as used by the grid File Catalog to locate files. The SURL is registered into the File Catalog and a GUID associated to this data on the grid side. The other metadata extracted from the DICOM header are stored into the AMGA metadata server. Finally, encryption keys that are associated to the file and that will be used for data retrieval are stored into the hydra distributed database. Once DICOM data sets have been registered into the MDM, the server is able to deliver requested data to the grid as depicted in figure 3. A client library is used for this purpose. To cover all application use cases, the MDM client library provides APIs for requesting files based on their grid identifier (GUID) or the metadata attached to the file. In case of request on the metadata, a database query is first made to the AMGA server and the list of GUIDs of images matching the query are returned. The SRM-DICOM server can then deliver images requested through their GUID. SRM get requests are translated into DICOM get queries. Data extracted from the DICOM server are first written to an internal scratch space. Their format is transformed into a simple 3D image file format (a human readable header including image size and encoding, followed by the raw image data). In this transformation, the DICOM header, containing patient identifying operations, are lost to preserve anonymity. The files are also encrypted before being sent out to ensure that no sensitive information is never transfered nor stored on the grid in a readable format. Files are then transferred through the gLiteIO service and returned to the client in an encrypted form. The file is only decrypted in memory of the client host, given that the client is authorized to access the file encryption keys.


AMGA metadata

MDM client library

gLiteI/O server

Fireman File Catalog

SRM−DICOM core

21

DICOM server

Query on metadata GUIDs gLiteI/O(GUID) hydra keystore

get SURL prepare(SURL)

DICOM GET

get key

convert file encrypt file

scratch space

write file encrypted file

fileI/O

get key decrypt file

Figure 3. Accessing DICOM images

4.4. MDM client On the client side, three levels of interfaces are available to access and manipulate the data hold by the MDM. The MDM is seen from the middleware as any storage resource exposing a standard SRM interface, the standard data management client interface can be used to access images provided that their GUID is known. The files retrieved using this standard interface are encrypted. The second interface is an extra middleware layer which encompasses access to the encryption key and the SRM. Thus images can be fetched and decrypted locally. The third and last level of interface is the fully MDM aware client library represented in figure 3. It provides access to encrypted files and in-memory decryption of the data on the application side, plus access to the metadata through the AMGA client interface.

5. Discussion 5.1. Data security The security model of the MDM relies on several services: (i) file access control, (ii) files anonymization, (iii) files encryption, and (iv) secured access to metadata. The user is coherently identified through a single X509 certificate and all services involved in security are using the same identification procedure. The file access control is enforced by the gLiteIO service which accepts Access Control Lists (ACLs) for fine grained access control. The hydra key store and the AMGA metadata services also accept ACLs. To read an image content, a user needs to be authorized both to access the file and to the encryption key. The access rights to the sensitive metadata associated to the files are administrated independently. Thus, it is possible to grant access to an encrypted file only (e.g. for replicating a file without accessing to the content), to the file content (e.g. for processing the

22


data without revealing the patient identity), or to the full file metadata (e.g. for medical usage). Through ACLs, it is possible to implement complex use cases, granting access rights (for listing, reading, or writing) to patients, physicians, healthcare practitioners, or researchers needing to process medical data, independently from each other. 5.2. Medical metadata schema A minimal metadata schema is defined in the MDM service for all images stored. It provides basic information on the patient owning the image, the image properties, acquisition parameters, etc. There are two main indexes used: a patient ID, for all nominative information associated to patients and the image GUID for all information associated to images. The patient ID is a unique but irreversible field (such as a MD5 sum on the patient field name). Four main relational tables are used: • The Patient table, indexed on the patient ID, contains the most sensitive identifying data (patient name, sex, date of birth, etc). • The Image table, indexed on the image GUID, contains technical information about the image (size, encoding, etc). It establishes a relation with the patient ID. • The Medical table, indexed on the image GUID, contains additional information on the acquisition (image modality, acquisition place and date, radiologists, etc). • The DICOM table, indexed on the image GUID, contains the image DICOM identifiers used for querying the DICOM server. To remain extensible, an additional Protocol table associates image GUIDs with medical protocol name. Through AMGA, the user can create as many medical protocols as needed, containing specific information related to some particular acquisition (e.g. a temporal protocol for cardiac acquisitions, etc). AMGA also enables per table access right control, allowing restricting access to the most sensitive data (e.g. the Patient table) to the minimum number of users.

6. Testbed The Medical Data Manager has been deployed on several sites for testing purposes. Three sites are actually holding data in three DICOM servers installed at I3S (Sophia Antipolis, France), LAL (Orsay, France) and CREATIS (Lyon, France). In addition to the DICOM servers, these sites have installed the core MDM services: a SRM-DICOM server and associated database back-end, a gLiteIO service, a GridFTP service, and all dependencies in the gLite middleware. Client have been deployed on all these three sites. To complete the installation, an AMGA catalog has also been set up in CREATIS (Lyon) for holding all sites’ metadata, and an hydra key store is deployed at CERN (Geneva, Switzerland) for keeping file encryption keys.


23

Given the number of services involved, the installation and configuration procedure is currently complex. It is being worked out to ease the testbed extension. The MDM service should be deployed in hospitals where little support is provided for the informatics infrastructure. The testbed deployed has been used to demonstrate the viability of the service by registering and retrieving DICOM files across sites. For testing purposes, DICOM data registrations are triggered by hand. Registered files could be retrieved and used from EGEE grid nodes transparently, using the standard EGEE data management interface. The next important milestone will be to experiment the system in connection with hospitals by registering real clinical data freshly acquired and registered on the fly from the hospital imagers. This step involves entering a more complex clinical protocol with strong guarantee on the data privacy protection. The security cannot be neglected at any level at this point.

7. Conclusion and future work The Medical Data Manager service presented in this paper is an important milestone for enabling medical image processing applications on a grid infrastructure. Its main strength are: • To access medical databases without interfering with clinical practice. Data are kept on clinical sites and transparently transferred to the grid only when needed. • To expose standard interfaces to other grid services. The MDM is fully integrated in the gLite middleware. • To ensure a high level of security to preserve patients privacy. The MDM prototype was successfully deployed and tested in a controlled computing environment. The next step will see interfacing to medical imagers inside hospitals. It will require to simplify the installation and configuration procedures as most as possible. The core MDM development is not finished yet and additional functionalities will be included to enrich the service. In particular, the abstraction layer depicted in figure 4.2 will soon be available. Applications will then be able to retrieve 3D volume files rather than single slices. In addition, metadata are expected to be distributed in the different clinical sites where data are acquired rather than being centralized as it is the case in our testbed. This configuration will be more acceptable to the clinical world to keep control on the hospital data. It will require deploying several AMGA servers on different sites and exposing a centralized query service able to retrieve data from these different servers.

Acknowledgments We are grateful to the EGEE European IST project for providing resources and support to this service development.

24


References [1] R. Acharya, R. Wasserman, J. Sevens, and C. Hinojosa. Biomedical Imaging Modalities: a Tutorial. Computerized Medical Imaging and Graphics, 19(1):3–25, 1995. [2] K.P. Andriole, R.L. Morin, Arenson; R.L., J.A. Carrino, B.J. Erickson, S.C. Horii, D.W. Piraino, B.I. Reiner, J.A. Seibert, and E. Siegel. Addressing the Coming Radiology Crisis: The Society for Computer Applications in Radiology SCAR Transforming the Radiological Interpretation Process (TRIP) initiative. Journal of Digital Imaging, 17(4):235–243, December 2004. [3] C. Barillot, R. Valabregue, J.P. Matsumoto, F. Aubry, H. Benali, Y. Cointepas, O. Dameron, M. Dojat, E. Duchesnay, B. Gibaud, S. Kinkingnéhun, D. Papadopoulos, M. Pélégrini-Issac, and E. Simon. NeuroBase: Management of Distributed and Heterogeneous Information Sources inNeuroimaging. In Distributed Database and processing in Medical Image Computing workshop (DiDaMIC’04), Saint Malo, France, September 2004. [4] I. Blanquer Espert, V. Hern´ andez García, and J.D. Segrelles Quilis. Creating Virtual Storages and Searching DICOM Medical Images through a GRID Middleware based in OGSA. Journal of Clinical Monitoring and Computing, 19(4-5):295–305, October 2005. [5] D. Budgen, M. Turner, I. Kotsiopoulos, F. Zhu, K. Bennett, P. Brereton, J. Keane, P. Layzell, M. Russell, and M. Rigby. Managing healthcare information: the role of the broker. In Healthgrid’05, Oxford, UK, April 2005. [6] DICOM: Digital Imaging and COmmunications in Medicine. http://medical.nema.org/. [7] M.H. Ellisman, C. Baru, J.S. Grethe, A. Gupta, M. James, B. Ludaescher, M.E. Martone, P.M. Papadopoulos, S.T. Peltier, A. Rajasekar, S. Santini, and I.N. Zaslavsky. Biomedical Informatics Research Network: An Overview. In Healthgrid’05, Oxford, UK, April 2005. [8] C. GERMAIN, V. BRETON, P. CLARYSSE, Y. GAUDEAU, T. GLATARD, E. JEANNOT, Y. LEGRE, C. LOOMIS, I. E. MAGNIN, J. MONTAGNAT, J.-M. Moureaux, A. OSORIO, X. PENNEC, and R. TEXIER. Grid-enabling medical image analysis. Journal of Clinical Monitoring and Computing, 19(4-5):339–349, October 2005. [9] S. Hastings, S. Oster, S. Langella, T.M. Kurc, T. Pan, U.V. Catalyurek, and Saltz J.H. A Grid-based image archival and analysis system. Journal of the American Medical Informatics Association, 12:286–295, January 2005. [10] H. K. Huang. PACS: Picture Archiving and Communication Systems in Biomedical Imaging. Hardcover, 1996. [11] J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legré, I.E. Magnin, L. Maigne, S. Miguet, J.-M. Pierson, L. Seitz, and T. Tweed. Medical images simulation, storage, and processing on the european datagrid testbed. Journal of Grid Computing, 2(4):387–400, December 2004. [12] J. Montagnat, V. Breton, and I.E. Magnin. Using grid technologies to face medical image analysis challenges. In Biogrid’03, proceedings of the IEEE CCGrid03, Tokyo, Japan, May 2003. [13] N. Santos and B. Koblitz. Metadata services on the grid. In Advanced Computing and Analysis Techniques, Berlin, Germany, May 2005. [14] L. Seitz, J.M. Pierson, and L. Brunie. Key management for encrypted data storage in distributed systems. In IEEE Security in Storage Workshop (SISW), Washington DC, USA, October 2003. [15] L. Seitz, J.M. Pierson, and L. Brunie. Encrypted storage of medical data on a grid. Methods of Information in Medicine, 44(2), 2005.


25

Grid Scheduling for Interactive Analysis Cécile Germain-Renaud a,1 , Romain Texier a Angel Osorio b and Charles Loomis c a Laboratoire de Recherche en Informatique b Laboratoire d’informatique et de mécanique pour les sciences de l’ingénieur c Laboratoire de l’Accélérateur Linéaire Abstract. Grids are facing the challenge of moving from batch systems to interactive computing. In the 70s, standalone computer systems have met this challenge, and this was the starting point of pervasive computing. Meeting this challenge will allow grids to be the infrastructure for ambient intelligence and ubiquitous computing. This paper shows that EGEE, the largest world grid, does not yet provide the services required for interactive computing, but that it is amenable to this evolution through relatively modest middleware evolution. A case study on medical image analysis exemplifies the particular needs of ultra-short jobs. Keywords. Medical Image Analysis Grid Middleware Scheduling

1. Introduction In the 70s, the transition from batch systems to interactive computing has been the enabling tool for the widespread diffusion of advances in IC technology. Grids are facing the same challenge. The exponential coefficients in network performance [7] enable the virtualization and pooling of processors and storage. In the field of biomedical application, widespread diffusion of grid technology might require seamless integration of the grid power into everyday use. In the more specific area of medical image processing, algorithms often involve a visual evaluation or exploration of the results. In some cases (e.g. rigid registration of multimodal images of the same patient), algorithms are sufficiently automatic to be executed remotely without interaction and the results sent for visualization to the user. In other cases, such as inter-subject registration, it may be necessary to use the anatomical knowledge of the user to better define the expected result (anatomical correspondences between cortical areas in the brain are loosely defined). In such a case, the interaction may be limited to an alternation of independent distant computations and user correction requests, but a soft real-time interaction would be much more interesting. A last class of image processing algorithm, like pre-operative planning, deeply involves the user and requires at least soft real-time to be really useful. However, the need for fast turnaround time on the grid is not limited to medical image analysis, but encompasses all situations of display-action loop, ranging from a test and debug process on the exploration of databases , to computational steering through virtual/augmented reality interfaces, as well as portal access to grid resources, or complex 1 Correspondence

to: Cécile Germain-Renaud, LRI Université Paris-Sud. ; E-mail: [email protected]

26

C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis

and partially local workflows. A critical system requirement is thus the need to move Grids from exclusive batch-oriented processing to general purpose processing, including interactive tasks. Section 2 of this paper will provide experimental evidence about the reality of the need, from the analysis of the activity of a segment of the EGEE grid heavily used by its biomedical community, the biomed VO. The next question is then a strategy to support interactive jobs on a grid. Virtual machines provide a powerful new layer of abstraction in distributed computing environments [5,2]. The freedom of scheduling and even migrating an entire OS and associated computations considerably eases the coexistence of deadline bound short jobs and long running batch jobs. However, a production grid is a prerequisite for biomedical and clinical potential users. One of the goals of the AGIR project [3] is to interact with production grids in order to define and implement the new grid services required by medical image analysis, with the EGEE grid as an important target. Section 3 of this paper presents some advances towards this goal, in the area of grid scheduling. The EGEE execution model is not based on such virtual machines, thus the scheduling issues must be addressed through the standard middleware components, broker and local schedulers. We demonstrate that QoS and fast turnaround time can be supported by a production grid.

2. EGEE usage The current use of EGEE makes a strong case for a specific support for short jobs. Through the analysis of the LB log of a broker, we can propose quantitative data to support this affirmation. The broker logged is grid09.lal.in2p3.fr, running successive versions of LCG; the trace covers one year (October 2004 to October 2005), with 66 distinct users and more than 90000 successful jobs, all production. This trace provides both the job intrinsic execution time t (evaluated as the timestamp of event 10/LRMS minus the timestamp of event 8/LRMS), and the makespan m, that is the time from submission to completion (evaluated as the timestamp of event 10/LogMonitor minus the timestamp of event 17/UI). The intrinsic execution time might be overestimated if the sites where the job is run accept concurrent execution. Fig. 1 shows the histogram of intrinsic execution times. The striking fact is the very large number of extremely short jobs. We call Short Deadline Jobs (SDJ) those where t < 10 minutes, and Medium Jobs (MJ) those with t between ten minutes and one hour. SDJ consume nearly 20% of the total execution time, in the same range as jobs with t less than one hour (17%). Fig. 2 plots the overhead ratio or as a function of the execution time t. The overhead ratio or is formally defined as (m − t)/t. Its interpretation is the ratio of the overhead o = m − t (which is the time spent "in the system") to the actual execution time t. The components of the overhead are twofold: Queuing time , which depends on the jobs submitted by other grid users. It is a scheduling policy issue. Middleware penalty : these are the various delays incurred along a job lifecycle because of the job management system, which is the cost of traversal of the middleware protocol stack. Here, the issue is the efficiency of the middleware implementation.


27

1,00E+05

Number of jobs

1,00E+04

1,00E+03

1,00E+02

1,00E+01

0, 00 E 6, +0 00 0 E 1, +0 20 1 E 1, +0 80 2 E 2, +0 40 2 E 3, +0 00 2 E 3, +0 60 2 E 4, +0 20 2 E 4, +0 80 2 E 5, +0 40 2 E 6, +0 00 2 E 6, +0 60 2 E 7, +0 20 2 E 7, +0 80 2 E 8, +0 40 2 E 9, +0 00 2 E 3, +0 60 2 E 7, +0 20 3 E 1, +0 08 3 E+ 04

1,00E+00

Execution time (seconds)

Figure 1. Distribution of execution times

Figure 2. The overhead ratio as a function of execution time - Execution time in seconds, overhead ratio dimensionless (see text for explanation)

The two components are orthogonal: even with a perfect middleware, if, for instance, the jobs were served on a first-come-first served basis, a job will be queued (and thus have to wait) until all its predecessors have been served (note that the EGEE basic scheduling scheme is more complicated). Thus, limiting the delays created by these two components must be addressed separatly, as shown in the next section. However, the first information provided by fig 2 is that, for SDJ, the overhead is often many orders of magnitude superior to t. This is absolutely dissuasive for gridenabling SDJ. For MJ, the overhead is of the same order of magnitude as t. Thus, the EGEE service for SDJ is seriously insufficient. One could argue that bundling many SDJ into one MJ could lower the overhead. However, interactivity will not be reached, because results will also come in a bundle: for graphical interactivity, the result must obviously be pipelined with visualization; in the test-debug-correct cycle, there might be not very many jobs to run.

28


With respect to grid management, an interactivity situation translates into a QoS requirement: just as video rendering or music playing requires special scheduling on a personal computer, or video streaming requires network differentiated services, servicing SDJ requires a specific grid guarantee, namely a small bound on the makespan, which is usually known as a deadline in the framework of QoS.

3. Scheduling for interactivity 3.1. A Scheduling Policy for SDJ Deadline scheduling usually relies on the concept of breaking the allocation of resources into quanta, of time for a processor, or through packet slots for network routing. For job scheduling, the problem is a priori much more difficult, because jobs are not partitionable: except for checkpointable jobs, a job that has started running cannot be suspended and restarted later. Condor [6] has pioneered migration-based environments, which provide such a feature transparently, but deploying constrained suspension in EGEE would be much too invasive, with respect to existing middleware. Thus, SDJ should not be queued at all, which seems to be incompatible with the most basic mechanism of grid scheduling policies. The EGEE scheduling policy is largely decentralized: all queues are located on the sites, and the actual time scheduling is enacted by the local schedulers. Most often, these schedulers do not allow time-sharing (except for monitoring). The key for servicing SDJ is to allow controlled time-sharing, which transparently leverages the kernel multiplexing to jobs, through a combination of processor virtualization and slot permanent reservation. The SDJ scheduling system has two components. • A local component, composed of dedicated queues and a configuration of the local scheduler. Technical details for MAUI can be found at [11]. It ensures that: ∗ Immediate execution of SDJ if resource are available. ∗ The delay incurred by batch jobs has a fixed multiplicative bound. ∗ The policy is work-conserving, implying that the resource usage is not degraded, eg by idling processors. ∗ The policies governing resource sharing (VOs, EGEE and non EGEE users,...) are not impacted. • A global component, composed of job typing and mapping policy at the broker level. While it is easy to ensure that SDJ are directed to resources accepting SDJ, LCG and gLite do not provide the means to prevent non-SDJ jobs from using the SDJ queues, and this requires a minor modification of the EGEE Workload Management System. For the local component, the first question is to prove correctness. Extensive experiments have been conducted on the EGEE cluster at LAL. Fig. 3 (a) shows a case where three kind of jobs are allowed to run concurrently: batch, SDJ, and dteam. On a dual-processor, only two of each kind actually runs, which ensures bounded delay. Fig 3(b) gives the overall site view; the fraction intended limitation of SDJ-dedicated resources (10 running jobs maximum) is achieved.

29

C. Germain-Renaud et al. / Grid Scheduling for Interactive Analysis 40

6

35 5

Number of jobs

maxjob max sdj max batch max dteam

3

Number of jobs

30 4

25 job sdj batch dteam

20 15

2 10 5

1

0 58000

0 58000 58200

58400

58600

58800

59000

59200

58200

58400

58600

58800

59000

59200

Time (s)

Figure 3. Local scheduling (a) on one machine (dual-processor) (b) on a site

For the global component, the long-term technical solution would be a modification of the Glue Schema. This schema is the information model currently used by EGEE, Open Science Grid, and many other grid projects. In this schema, the target of a job is a Computing Elements (CE), which is mainly a site queue so far. Thus, a new CE attribute (eg QueueAttribute) should be created with the following functions: publish that this queue accepts SDJ jobs and only them. However, the operational use of the Glue schema as a common ground for interoperability between international grids makes its evolution a long process (even if it can be expected to be satisfied in the medium term, because the requirement for this category of attribute meets other ones of the same type, for instance for MPI jobs). Thus, a short term solution has been set up: on one hand, a boolean attribute in the JDL (SDJ) is created; on the other hand, CE dedicated to SDJ must have a name suffixed by ".sdj"; the user interface will translate the boolean attribute towards a regular expression is the JDL requirement RegExp("*sdj$",other.GlueCEUniqueID); finally, the WMS will interpret the lack of this requirement as a prescription not to direct a job to the sdj-suffixed CE. These features will be integrated in gLite 3.2. It must be noticed that no explicit user reservation is required: seamless integration also means that explicit advance reservation is no more applicable than it would be for accessing a personal computer or a video-on-demand service. In the most frequent case, SDJ will run with under the best effort Linux scheduling policy (SCHED_OTHER); however, if hard real-time constraints must be met, this scheme is fully compatible with preemption (SCHED_FIFO or SCHED_RR policies). In any case, the limits on resource usage(e.g. as enforced by Maui) implement access control; thus the job might be rejected. The WMS notifies rejection to the application, which could decide on the most adequate reaction, for instance submission as a normal job or switching to local computation. 3.2. User-level scheduling Considering the grid middleware penalty for submission, scheduling and mapping of jobs, it cannot be reasonably hoped to reach the order of second, which would be needed for ultra-small jobs, such those considered in the next section. With the most recent and tuned EGEE middleware (gLite 3.0), the middleware penalty remains in the order of minutes. In the gPTM3D project [4], we have shown that an additional layer of userlevel scheduling provides a solution which is fully compatible with EGEE organization of sharing.

30


Figure 4. gPTM3D architecture

4. gPTM3D 4.1. Interactive Volume Reconstruction PTM3D [9] is a fully featured DICOM images analyzer developed at LIMSI. PTM3D transfers, archives and visualizes DICOM-encoded data; besides moving independently along the usual three axes, the user is able to view the cross-section of the DICOM image along an arbitrary plane and to move it. PTM3D provides computer-aided generation of three-dimensional representations from CT, MRI, PET-scan, or echography 3D data. A reconstructed volume (organ, tumor) is displayed inside the 3D view. The reconstruction also provides the volume measurement, required for therapeutic decisions. The system currently runs on standard PC computers and it is used on line in radiology centres. Clinical motivation for grid-enabled volume reconstruction is described in [4]. The first step in grid-enabling PTM3D (gPTM3D) is to speedup compute-intensive tasks, such as the volume reconstruction of the whole body used in percutaneous nephrolithotomy planning. The volume reconstruction module has been coupled with EGEE with the following results: • the overall response time is compatible with user requirements (less than 2 minutes), while the sequential time on a 3GHz, 2MB memory PC is typically 20 minutes. • the local interaction scheme (stop, restart, improve the segmentation) remains strictly unmodified. This first step has implemented fine grain parallelism and data-flow execution on top on a large scale and file-oriented grid system. The architecture based on Application Level Scheduler/Worker agents shown in fig 4 is fully functional on EGEE. The Interaction Bridge (IB) acts as a proxy in-between the PTM3D workstation, which is not EGEE-enabled, and the EGEE world. When opening an interactive session, the PTM3D workstation connects to the IB; in turn, the IB launches a scheduler and a set of workers on an EGEE node, through fully standard requests to an EGEE User Interface; a stream is established between the scheduler and the PTM3D front-end through the IB. When


31

the actual volume reconstruction is required, the scheduler receives contours; the Scheduler/Worker agents follow a pull model, each worker computing one slice of the reconstructed volume at a time, and sending it back to the scheduler, which forwards them to IB from where they finally reach the front-end. The next step will be to implement a scheme where the IB and the scheduler cooperate to respectively define and enforce a soft real-time schedule. User-level scheduling has been proposed in many other contexts, and a case for it has been made in the AppleS [1] project. In a production grid framework, the Dirac [10] project has proposed a permanent grid overlay where scheduling agents pull work from a central dispatching component. Our work differs from Dirac in two respects: first, the scheduling and execution agents are launched just as any EGEE job, and are thus subject to all regulations related to sharing: typically, they are SDJ, thus will be aborted if they exceed the limits of this type of jobs.Moreover, they work in connected mode, more like glogin-based applications [8]. 4.2. Grid-enabling Image Exploration In the previous section, the grid was used only as a provider of computing power, while the data were located on the front-end. Sharing data is a well-known need for algorithmic research, but this is true for clinical research as well. We have started the process of extending PTM3D toward accepting remote data access. The integration of gPTM3D with the Medical Data Management scheme presented in another paper is the final goal. However, at the present time, we consider a most restrictive scheme, which uses the internal format of PTM3D images, where the slices are bundled in a 3D file (bdi and bdg formats). In this context, the main issues are the access latency. The ongoing work targets adaptation to the user activity, mainly trough interactive selection of the image resolution and the region of interest.

5. An architecture for grid interactivity The scheme described in the previous sections virtualizes the resources at the coarse grain of batch versus short deadline jobs. An open issue is scheduling across SDJ. Consider for instance a portal, where many users ask for a continuous stream of execution of SDJ. This situation can be modelled with the so-called (period, slice) model used in soft real-time scheduling, where a fraction (slice) of each period of time should be allocated to each user in order to keep happy. To be coherent with a software architecture based on VOs, global regulation of SDJ should be left to the implementation of sharing policies (ultimately implemented by site schedulers). However, it is the responsibility of the provider of a particular service to arbitrate between its users. The Interaction Bridge described in the previous section is the adequate location for this arbitration. Figure 5 describes the resulting architecture.

Acknowledgements This work was partially funded by ACI Masses de Données AGIR. We thank Fabrizio Pacini of EGEE JRA1 for his help with the Glue schema specification.

32


,QWHUDFWLRQ%ULGJH 73 8VHU,QWHUIDFH

,QWHUDFWLRQ%ULGJH

-66

%URNHU

1RGH &OXVWHU 6FKHGXOHU

&(

8VHU,QWHUIDFH

8VHU,QWHUIDFH

%URNHU

7DVNSULRULWL]DWLRQ

0DWFKPDNLQJ

3HUPDQHQWUHVHUYDWLRQ RQYLUWXDOSURFHVVRUV 7UDQVSDUHQWZKHQXQXVHG

Figure 5. A two-level scheduling architecture

References [1] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS parameter sweep template: user-level middleware for the grid. In Procs 2000 ACM/IEEE conference on Supercomputing (CDROM), 2000. [2] A. Bavier et al. Operating Systems Support for Planetary-Scale Network Services. In Procs. 1st Symp. on Networked System Design and Implementation (NSDI Š04), 2004. [3] C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legré, C. Loomis, J. Montagnat, J-M Moureaux, A. Osorio, and X. Pennec et R. Texier. Grid-enabling medical image analysis. Journal of Clinical Monitoring and Computing, 19(4-5):339–349, 2005. Extended version of the BioGrid 2005 paper. [4] C. Germain, R. Texier, and A. Osorio. Exploration of Medical Images on the Grid. Methods of Information in Medecine, 44(2):227–232, 2005. [5] B. Lin and P. A. Dinda. VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. [6] M. J. Litzkow, M. Livny, and M. W. Mutka. Condor : A hunter of idle workstations. In 8th International Conference on Distributed Computing Systems, pages 104–111. IEEE Computer Society Press, 1988. [7] L. G. Roberts. Beyond Moore’s law: Internet growth trends. Computer, 3(1):117–119, 2000. [8] H. Rosmanith and D. Kranzlmuller. glogin - A Multifunctional, Interactive Tunnel into the Grid. In Procs 5th IEEE/ACM Int. Workshop on Grid Computing (GRID’04), 2004. [9] V. Servois, A. Osorio, and J. Atif et al. A new pc based software for prostatic 3d segmentation and volume measurement. application to permanent prostate brachytherapy (ppb) evaluation using ct and mr images fusion. InfoRAD 2002 - RSNA’02, 2002.


33

[10] A. Tsaregorodtsev, V. Garonne, and I. Stokes-Rees. DIRAC: A Scalable Lightweight Architecture for High Throughput Computing. In Procs 5th IEEE/ACM Int. Workshop on Grid Computing (GRID’04), 2004. [11] SDJ WG wiki site. http://egee-na4.ct.infn.it/wiki/index.php/ShortJobs.

34


Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture: A Web Portal Design F. BELLET, I. NISTOREANU, C. PERA, H. BENOIT-CATTIN CREATIS, UMR CNRS #5515, U 630 Inserm Université Claude Bernard Lyon, INSA, Lyon Bât. B. Pascal, 69621 Villeurbanne, FRANCE

Abstract. In this paper, we present a web portal that enables simulation of MRI images on the grid. Such simulations are done using the SIMRI MRI simulator that is implemented on the grid using MPI and the LCG2 middleware. MRI simulations are mainly used to study MRI sequence, and to validate image processing algorithms. As MRI simulation is computationally very expensive, grid technologies appear to be a real added value for the MRI simulation task. Nevertheless the grid access should be simplified to enable final user running MRI simulations. That is why we develop this specific web portal to propose a user friendly interface for MRI simulation on the grid. The web portal is designed using a three layers client/server architecture. Its main component is the process layer part that manages the simulation jobs. This part is mainly based on a java thread that screens a data base of simulation jobs. The thread submits the new jobs to the grid and updates the status of the running jobs. When a job is terminated, the thread sends the simulated image to the user. Through a client web interface, the user can submit new simulation jobs, get a detailed status of the running jobs, have the history of all the terminated jobs as well as their status and corresponding simulated image.

Keywords. Web portal, Grid application, MRI simulation

1. Introduction The simulation of Magnetic Resonance Imaging (MRI) is an important counterpart to MRI acquisitions [1]. Simulation is naturally suited to acquire theoretical understanding of the complex MR technology. It is used as an educational tool in medical and technical environments [2]. By offering an analysis independent of the multiple parameters involved in the MR technology, MRI simulation permits the investigation of artifact causes and effects [3]. Simulation may also help in the development and optimization of MR sequences [4]. Finally MRI simulator provides an interesting assessment tool of image processing techniques [5] since it generates 3D realistic images from medical virtual objects perfectly known. The SIMRI simulator is a recent 3D MRI advanced simulator [1] that integrates in a unique simulator most of the simulation features that are offered by different

F. Bellet et al. / Magnetic Resonance Imaging (MRI) Simulation on EGEE Grid Architecture

35

simulators. It takes into account the main static field value (Figure 1) and enables realistic simulations of the chemical shift artifact including off-resonance phenomena. It also simulates the artifacts linked to the static field inhomogeneity like those induced by susceptibility variation within an object (Figure 2). It is implemented in C language and distributed under the CECILL public license . The MRI sequence programming is done using high level C functions with a simple programming interface. To manage large simulations, the magnetization kernel is implemented in a parallelized way that enables simulation on PC grid architecture [6] using a standard Message Passing Interface (MPI) API.

Figure 1. 256x256 simulated brain image at 1.5 T with SIMRI: Spin Echo sequence (TE=25 ms TR=500 ms BW=25.6 kHz). As simulation of the MR physics is computationally very expensive, parallel implementation is mandatory to achieve performances compatible with the target applications[4]. As an example it takes 12 hours to simulate a 512² image on a recent PC. This time has to be multiplied by 16 for a 1024² image. In 3D, simulation of a 5123 volume would require 100 years of CPU use! Thanks to the linearity property of the main computation task, the simulation job can be distributed easily with a reduced communications between nodes during simulation [6] and consequently a good scalability. As a consequence, the computation time is reduced in proportion with the available computation nodes. In this context, by offering a virtually unlimited computing power, grid technologies appear to be a real added value for the MRI simulation task [7]. Nevertheless the grid access should be simplified to enable the final users to run MRI simulations. That is why we develop a specific web portal to propose a user friendly interface for MRI simulation on the grid.

36


Figure 2. Illustration of the susceptibility artifact on an air bubble into water with a static field of 7T. 256x256 simulated image obtained with SIMRI and a Spin Echo sequence (TE=20 ms, TR=1000 ms, BW=20 kHz). This paper presents the MRI simulation web portal (Simri@Web) we developed. We present the general architecture of the portal, and we detail the data layer, the process layer and the client user interface.

2. The Simri@Web Web portal 2.1. Functionalities and technical environment The aim of the web portal is to mask to the final users the grid middleware and to provide new functionalities. We target the following: - Access to all simulation parameters: At least 10 parameters including the sequence name, the sequence parameters, the image size, the main field value… - Access to two simulation targets: the EGEE grid and the local cluster of our Lab - Providing a personal account with authentication and history of all the simulation jobs with the corresponding simulated images, the terminated job status history and the running job status. - Providing the simulated images by email. Concerning the technical environment, the jobs must be submitted on EGEE using the LCG2 grid middleware and on the local cluster using the PBS batch manager. The web portal runs on a web server Apache v 2.0.54 with the PHP5 module associated to the libraries libssh2.so and mysql.so. We use for the data layer (see section 2.3) MySql v.4 and for the process layer (see section 2.4) Java 1.4.2 with the class sets jsch.jar and mysql-connector-java-3.jar. Finally, note that the SIMRI code is compiled with the MPI library.


Web server

37

Targeted platform

Job submission

Client JAVA Thread Job add

Simri JobServer

Figure 3. Illustration of job ' travel' within the three layers architecture. The simri job server corresponds to the data layer, the web server and the Java thread are the two parts of the process layer. The client represents the presentation layer. 2.2. Architecture overview The web portal architecture is a client/server architecture divided into three layers of services (Figure 3): - The presentation layer that includes the client graphical interface, some local process to check the user inputs, some data display. - The process layer that is in charge of the application processes including the simulation job management and a dynamic web page generation. - The data layer that manages all the data. This layer is called when a data access is required. 2.3. Data Layer This layer is a MySql data base server that must guarantee the persistence of all the application data like the data relative to the users and those relative to the simulation jobs. This layer is the only communication gateway (Figure 3) between the two parts of the process layer defined below. The corresponding data base is defined by five tables (Figure 4): - The user table contains all the personal user data like their email, their labs (…) and their access right (yes/no) to the cluster and the grid. Accesses are granted or denied by an administrator user. - The job tables (one for the cluster and one for the grid) contain a job id, the user id, the associated simulated image name and the start and stop time of the job that are updated by the process layer.

38


-

The job parameter table contains all the simulation parameters associated to a job. The job status table contains for each job id all the status reached by the job.

Figure 4. The five tables of the data layer. 2.4. Process Layer The process layer is the core of the web portal. It takes into account all the application logic. In our case, the process layer dynamically generates the web pages, collects all the information, manages the target platform connection, the job submission, and collects the simulated images. We chose to separate this layer into two separated parts: One dedicated to the user management and one to the job management. 2.4.1. Process Layer: User management The user management (UM) process layer corresponds to the server side that runs on Apache. It is written in PHP. Each time a user connects to the portal, this layer starts a specific session, checks the user identity in the data layer, and offers to the user a personal space composed of three main pages. The first one concerns a new job submission (Figure 6). In this page, the UM layer collects the data from the client side to fill the data layer (Figure 3). The two other pages concern the running jobs (Figure 7) and the ended jobs. This two pages are dynamically generated by the UM layer and filled with the data collected in the data layer.


39

Grid UI at in2p3.fr

JobGridThread Simri JobServer Reading submited jobs For all jobs (Id, Status) submited Jobs

Edg-job-satus Id Status job ? If new satus

Updating status job table ? If satus=done Edg-get-output ID

Image folder User mailing

scp image

Updating grid jobs table

Close connection Close connection

Figure 5. Illustration of the grid JavaThread chronogram. The thread interacts with the data layer to get the new jobs and it interacts with the grid to submit the jobs, get their status and get the simulated images. 2.4.2. Process Layer: Job Management The job management (JM) process layer is written in Java. It assumes the job submission, the job status and simulated image recuperation. The JM layer is composed of two Java threads: One dedicated to the interaction with the local cluster based on PBS and one with the EGEE grid based on LCG2. Figure 5 gives the chronogram of the grid Java thread. Each time it wakes up, it consults the data layer to get the new jobs. It submits the eventual new jobs. It gets the status of all the submitted jobs and updates consequently the job status in the data layer (Figure 3). For all the terminated jobs, it gets the simulated image and sends to the corresponding user by mail a web link on the image. This split of the process layer is very efficient. The PHP limits are cancelled as it is only used for the UM where it is very efficient. Indeed all possible timeouts while communicating with the grid are handled by the java thread and not by the web PHP scripts. Consequently the users are never impacted by such troubles. The web server is only used for the UM. All the code linked to the simulation process is located in the threads and is easy to maintain.

40


Figure 6. Job submission client interface. The user can choose the target platform (local cluster or EGEE grid). 2.5. Presentation layer: Client user interface With this three independent layers architecture, the client side is very light. It manages only the application presentation with an application logical part to check the user inputs. Consequently, the client side is a simple web browser plus a logical part written in Java scripts. The web browser displays only five different pages: The registration page, the connection page, the job submission page (Figure 6), the running job page (Figure 7) and the ended job page. This page are dynamically generated by the UM layer.

Figure 7. Running jobs client interface and a window that gives the status history of a terminated job.


41

3. Conclusion This Simri@Web portal is effective since September 2005. At the moment, it is opened only to the 6 persons involved in the SIMRI project. After 300 simulations, we observed a job failure rate on the grid of about 20 %. This rate is mainly due to some non homogeneous implementation of MPI on few computing elements (clusters) of the grid and probably to some bad scheduling policies at the grid level. These problems have been reported and are under investigation. At the moment, we don't allow the user to choose the number of requested MPI nodes because this number has a direct impact on the number of available computing elements of the grid. Indeed within LCG2 an MPI job can not spam across several computing elements. So we fix the node number to modest value (12) that provides us more available computing elements and hopefully a quicker simulation result for the user. Such a web interface corresponds perfectly to the type of interface wanted by the end users who appreciate the middleware and batch manager masking, the user account as well as the complementary services. Nevertheless, we plan to develop a new web portal architecture that would use the web service technologies and the Glite middleware that has been recently chosen for the EGEE grid. We target a versatile and open architecture to be able to add more easily in the portal new simulation targets like the CINES1 machines and to add other MRI simulation codes like the one linked to susceptibility effect [8]. Finally, this architecture will integrate a data management service to store the high value simulated images with their corresponding simulation parameter.

4. Acknowledgement This work is in the scope of the scientific topics of the PRC-GdR ISIS research group of the French National Center for Scientific Research CNRS. This work is supported by the European EGEE Project and by the French ministry for research ACI-GRID MEDIGRID project. This work has been also funded by the INSA Lyon French Engineering School.

5. References [1]

H. Benoit-Cattin, G. Collewet, B. Belaroussi, H. Saint-Jalmes, and C. Odet, "The SIMRI project: A versatile and interactive MRI simulator," Journal of Magnetic Resonance, vol. 173, pp. 97-115, 2005. G. Torheim, P. A. Rinck, R. A. Jones, and J. Kvaerness, "A simulator for teaching MR image contrast behavior," MAGMA, vol. 2, pp. 515-522, 1994. M. B. E. Olsson, R. Wirestam, and B. R. R. Persson, "A Computer-Simulation Program For MrImaging - Application to Rf and Static Magnetic-Field Imperfections," Magnetic Resonance in Medicine, vol. 34, pp. 612-617, 1995. A. R. Brenner, J. Kürsch, and T. G. Noll, "Distributed large-scale simulation of magnetic resonance imaging," Magnetic Resonance Materials in Biology, Physics, and Medicine, vol. 5, pp. 129-138, 1997. R. K. S. Kwan, A. C. Evans, and G. B. Pike, "MRI simulation-based evaluation of image-processing and classification methods," IEEE Trans. Medical Imaging, vol. 18, pp. 1085-1097, 1999.

[2] [3]

[4] [5]

1

www.cines.fr

42 [6]

[7]

[8]


H. Benoit-Cattin, F. Bellet, J. Montagnat, and C. Odet, "Magnetic Resonance Imaging (MRI) simulation on a grid computing architecture," Presented at IEEE CGIGRID'03- BIOGRID'03, Tokyo, 2003. J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legré, I. E. Magnin, L. Maigne, S. Miguet, J. M. Pierson, L. Seitz, and T. Tweed, "Medical images simulation, storage, and processing on the European DataGrid testbed," Journal of Grid Computing, vol. 2, pp. 387-400, 2004. S. Balac, H. Benoit-Cattin, T. Lamotte, and C. Odet, "Analytic solution to boundary integral computation of susceptibility induced magnetic field inhomogeneities," Mathematical and Computer Modelling, vol. 39, pp. 437-455, 2004.


43

Towards A Virtual Laboratory for fMRI Data Management and Analysis , Aart J. Nederveen b , Jeroen G. Snel b and Robert G. Belleman a a Informatics Institute, University of Amsterdam b Academic Medical Center of the University of Amsterdam

Silvia D. Olabarriaga

a,*

Abstract. Functional Magnetic Resonance Imaging (fMRI) is a popular tool used in neuroscience research to study brain activation due to motor or cognitive stimulation. In fMRI studies, large amounts of data are acquired, processed, compared, annotated, shared by many users and archived for future reference. As such, fMRI studies have characteristics of applications that can benefit from grid computation approaches, in which users associated with virtual organizations can share high performance and large capacity computational resources. In the Virtual Laboratory for e-Science (VL-e) Project, initial steps have been taken to build a grid-enabled infrastructure to facilitate data management and analysis for fMRI. This article presents our current efforts for the construction of this infrastructure. We start with a brief overview of fMRI, and proceed with an analysis of the existing problems from a data management perspective. A description of the proposed infrastructure is presented, and the current status of the implementation is described with a few preliminary conclusions. Keywords. medical image analysis, grid computing, functional MRI, virtual organizations IT for large population studies

1. Introduction Functional magnetic resonance imaging (fMRI) is a popular tool used in neuroscience research to study brain function. In fMRI studies, large amounts of data are acquired, processed, compared, annotated and stored for future reference. The users of fMRI (in particular psychologists, psychiatrists, radiologists etc.) typically have limited technical background in computing and, as such, face several difficulties to organize their workflow comfortably and efficiently based on individual solutions available for personal computers. Computational resources with higher capacity and performance are needed to properly address the needs of these users. The Virtual Laboratory for e-Science (VL-e) Project1 has taken initial steps to build a grid-enabled infrastructure to facilitate data management and analysis * Correspondence to: Silvia D. Olabarriaga, Kruislaan 403, 1068 SJ, Amsterdam. Tel.: +31 20 525 7549; E-mail: [email protected] 1 http://www.vl-e.nl/

44

S.D. Olabarriaga et al. / Towards a Virtual Laboratory for fMRI Data Management and Analysis

for fMRI studies. This article presents our current efforts for the construction of this infrastructure. We start with a brief overview of fMRI (section 2), and proceed with an analysis of the existing problems from a data perspective (section 3). The proposed infrastructure is described in section 4, followed by a brief discussion and preliminary conclusions in section 5.

2. fMRI at a Glance fMRI enables the study of brain activation in a non-invasive manner. The basic idea is to scan a subject while he/she is submitted to brain stimulation through a physical or cognitive activity. Depending on the type of sensory stimulus (visual, auditory, motor, etc.) or cognitive task, the neuronal activity increases in different parts of the brain, and a heamodynamic response occurs. In simple terms, the active region receives oxygen-rich blood, and the changes in the oxygenation level can be measured with MRI. An fMRI scanning session produces a series of 3D datasets (volumetric images) containing measurements along time, some obtained during stimulation and some at “rest”. These images are subsequently analysed to determine the location of activated areas – refer to [1] for details. First the 3D volumes in the time series are aligned to each other to compensate for artefacts introduced by temporal sampling and motion. Additionally, filters are applied to reduce noise and normalise the measurements. Next, statistical analysis is performed to correlate the measured signal with the stimulation pattern. This step generates a statistical map, which is again submitted to statistical analysis to detect activation based on an adaptive threshold. The final result of the analysis is an activation map that can be further analysed to determine the location and size of activation clusters or activated regions. Activation maps are overlaid to an additional high-resolution structural scan for visual inspection of activation with respect to the anatomy. Instead of a structural scan of the same subject, a reference brain can be used, e.g. the Montreal Neurological Institute average brain[2]. fMRI is largely used in neuroscience studies, for example, to characterize brain function in populations. Often the activated regions detected in different scans are compared, which requires additional image registration steps for their alignment to a common coordinate system. A future perspective is that fMRI will also be used in a broad range of situations in diagnosis, prognosis and treatment planning. Examples include aids to detect anomaly, prediction of functional damage due to trauma, or planning of neurosurgery [3]. And finally, the acquired data and metadata (results, annotations) are typically shared in large multi-centre studies, which are becoming increasingly popular for the characterization of large populations in neurosciences or for the evaluation of new healthcare procedures.

3. Data-related Issues in fMRI Studies From a data perspective, fMRI studies involve data acquisition, storage, analysis and shared access. Note that “data” here refers to a large variety of information


45

and measurements acquired or generated during an fMRI study, being heterogeneous by nature. Examples of data are scanned images (functional, structural), information about the applied stimuli, parameters of the acquisition protocol, subject characterization (e.g., age, gender, pathology), results of the analysis (e.g., statistical and activation maps, locations of activated regions) and interpretation (e.g., annotations). These data are generated by different types of physically dispersed equipments and image analysis utilities, requiring a significant amount of time and effort for adequate management. Such effort is likely to increase as the amount of data grows in response to developments in scanning techniques, analysis methods, and collaborative and multi-centre research. Below we discuss a few of the problems encountered. Data acquisition in this discussion is restricted to the images and associated signals recorded during an fMRI scanning session, involving a collection of equipments in complex experiments2 . Experiment design is based on prior knowledge about brain function and imaging protocols, which is typically accumulated and shared informally by researchers and practitioners in the field. One of the difficulties is to have access to resources with structured information about experimental design (e.g., databases of acquisition protocols or stimuli). By keeping documentation of experimental procedures, such resources could facilitate the validation of experiments, the design of new experiments and the standardization of existing ones. Experiment control involves synchronizing image acquisition (by a scanner ) with stimulation (by a stimulus computer ), as well as recording images and other signals (e.g. electroencephalography - EEG). At the end of an fMRI session, all data (images, stimuli, signals, etc.) are gathered from the acquisition equipments and exported to a remote storage resource. The acquisition equipments are often dispersed, heterogeneous, and located behind a hospital firewall. In a clinical setting, the images are stored in a Picture Archival and Communication System (PACS), while other data are manually transported using some physical medium (DVD, memory stick) or sftp. In a typical research setting, also the images are manually transported to the external storage resource for further analysis. Data gathering could be facilitated by connecting all data acquisition equipments in the scanning site directly and securely to a (remote) storage resource that can store all the collected data. Data storage for fMRI studies present three main difficulties. First, large storage capacity is needed, since studies involve many instances (typically above 20) of large datasets (500MB to 1GB per scanning session). Second, the storage system should be flexible to accommodate heterogeneous data types such as images, signals, and metadata. Although the adoption of a PACS would be the natural choice in a medical environment, current systems are still limited with respect to data capacity, image format, and storage of non-pixel data. And finally, when patient data is used in these studies, high demands are imposed on data confidentiality. Not only secure connection is required, but also all identity information must be striped from the data before they leave the scanning site. Data analysis in fMRI involves applying complex and computation-intensive image processing methods to large amounts of data. It can take more than one 2 In

fact, data and metadata are collected during the whole lifetime of a study.

46


hour to complete the analysis of one fMRI session on workstations that are typically available for researchers in their home institutions. Normally the analysis is executed as a post-processing step, and the results become available a long time after the scanning session. If a problem is detected during the analysis (e.g., motion artefacts), a new scanning session must be scheduled. The usage of high performance computing (HPC) resources could be helpful to reduce latency and enable interactive inspection of data quality while the subject is still at the scanning site. Additional problems are faced for running image analysis in large scale, for example, when the study includes a large number of subjects or for performing parameter optimisation. The complete analysis in these cases can take days to be completed on typical workstations. HPC capacity would be beneficial here for achieving higher throughput by parallel execution of independent tasks (e.g., analysis of individual scans). Moreover, the logistics of data and computational resources require much effort to guarantee proper error handling, enough storage space for intermediate results, proper data conversion and transfer, proper parameter settings, etc. The researchers involved in fMRI usually do not have enough technical knowledge to set-up an infrastructure to perform reliable, efficient and secure image analysis. These users could benefit from sharing a common IT infrastructure for large-scale fMRI studies, in which the workflow can be automated. Data access and sharing are challenging issues because fMRI studies are performed by groups of users associated to the same institution or to multiple centres. Moreover, a growing trend in neuroscience is to share the acquired data and generated metadata (results) with other researchers after the study is completed [4]. In this manner data can be reused, experiments can be reproduced or repeated with different settings, or results can be used for meta-analysis. The following issues characterize the demands for shared data access in this context. First, multiple and physically dispersed sites are involved, requiring remote and secure access to (distributed) data. Second, it is necessary to control and monitor access to data, respecting strict data privacy policies that may be different per site or study. Third, large amounts of data are involved, requiring mechanisms for efficient retrieval such as query based on metadata. And finally, data should be archived for long periods of time, requiring extremely large and permanent storage capacity. The scenario described above indicates that a proper IT infrastructure is fundamental to accomplish fMRI studies successfully. Table 1 summarizes the different problems faced for data management in fMRI studies and the challenges posed to the construction of an adequate infrastructure.

4. A Virtual Laboratory for fMRI: VL-f The construction of an IT infrastructure addressing the challenges in Table 1 obviously requires technical knowledge that is beyond the scope of neuroscience, and perhaps also beyond what could be accomplished with traditional computing paradigms. The characteristics of this application indicate, on the other hand, that it could benefit from grid computing approaches [5] for the following reasons.


47

Table 1. Difficulties for data management in fMRI and associated IT challenges. Characteristics Acquisition: Complex experiments Multiple equipments Storage: Many large data instances Heterogeneous data types Patient data

Challenges

Share and reuse experiment design Access dispersed and heterogeneous systems Large storage capacity Flexible storage system Data confidentiality

Analysis: Computation-intensive analysis Interactive response

HPC for throughput HPC for real-time computation

Large scale processing

Logistics of data and resources

Shared Access: Multiple centres

Remote access to distributed data

Many users Large amounts of data

Controlled access to (confidential) data Query based on metadata

Data Archival

Long term storage/retrieval

First, fMRI studies are data intensive, since large amounts of data are stored, analysed and manipulated. Second, they require high throughput computation on demand for real-time image analysis and for large scale studies. Finally, collaboration and distributed computing are essential, in particular for multi-centre studies, in which data is acquired, analysed and shared from different locations. This application therefore requires coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (VOs), which is the goal of grid technologies [6]. A virtual laboratory for fMRI (VL-f) is under development in the scope of the VL-e project to address some of the challenges listed in Table 1. The goal is to construct a shared computational infrastructure with hardware, software and services to efficiently, reliably and securely perform (large scale) fMRI studies. The following specific goals will be pursued: 1. Facilitate data gathering at the scanning site, by providing homogeneous access to the acquisition equipments. 2. Facilitate data storage and archival, by providing access to large capacity and long-term storage resources. 3. Enable high data analysis throughput, by providing access to HPC resources to perform parallel analysis of mutually independent data. 4. Facilitate the data logistics in (large scale) fMRI studies, by providing tools to automate the workflow (data gathering, removal of subject identity, data format conversion, and image analysis). 5. Provide remote data access via interactive interface to the storage resource from workstations located anywhere.

48


6. Enable secure data sharing, by providing mechanisms for controlling access to the data for users and groups. 7. Facilitate data retrieval, by providing infrastructure for generation of metadata and query mechanisms based on metadata.

Figure 1. Computational resources of the VL-f.

Below we present a description of the resources (section 4.1), the ideal use scenario pursued by VL-f (section 4.2), the plans for the first pilot implementation (section 4.3) and its current status (section 4.4). 4.1. VL-f Resources The simplified scheme in Figure 1 presents the hardware and software resources of VL-f, which are distributed among scanning, research and services sites. Scanning sites are the locations where a scanner and other acquisition devices are installed, normally in the radiology department of a hospital. These equipments are connected to each other directly, and possibly also to others, via an internal network (intranet) protected by a firewall. They are accessible only from workstations located in their physical vicinity (e.g., examination room). Some workstations have access to public networks (e.g., internet), being called grid access nodes (GANs) in the proposed scheme. In the first phase of VL-f, scanning is performed at the radiology department of the Amsterdam Medical Center (AMC), which includes a Philips 3T Intera MRI scanner, a stimulus computer, and a GAN. Research sites are locations where the (neuro)scientist interacts with the data from a workstation. In the first phase of VL-f, the research sites are located at several departments of the University of Amsterdam (UvA), the Free University (VU), and private computing facilities (e.g. at home). Workstations based on Windows or Linux platforms will be supported.


49

Services sites provide compute and storage resources. Although an open gridoriented architecture will be adopted, in the first phase of VL-f only the grid resources provided by the VL-e Proof-of-Concept Environment (VL-e PoC) are used. These resources are provided by SARA Computing and Networking Services3 and consist of computing elements and on- and near-line storage elements. Access to the hardware facilities and software services will be granted to authorized users associated to one or more VOs. The computing elements are accessed via an user interface machine linked to the European DataGrid (EDG). In a first phase, only the Matrix cluster at SARA will be available for the fMRI VO. This cluster consists of 36 IBM x335 nodes equipped with dual Xeon 3.06GHz processors, 2GB memory, and 2 local EIDE disks of 120GB. The nodes are connected with 1 Gb/s Ethernet. Other computers and clusters located at other VL-e partners will be added in the future (e.g., National Institute for Nuclear Physics and High Energy Physics, NIKHEF). Storage resources include on-line storage and tape silos for off-line, permanent and unlimited storage space. The data is transported automatically between the on- and off-line storage systems based on usage patterns. Data integrity and accessibility is provided by automatic and periodic back-ups. The on-line storage resource consists of the Storage Resource Broker (SRB4 ) system, which provides a seamless interface to store and retrieve data and metadata across a wide area network [7]. 4.2. Ideal Scenario The ideal functional scenario pursued by VL-f is illustrated in Figure 2. When a scanning session is completed, data from the scanner and stimulus computer are gathered into a single workstation at the scanning site. Identity information is removed from the images with an application that additionally provides aids to control pseudonyms and real identities in the context of individual or multiple studies. The user schedules the transfer of identity-free data to the SRB, performing SRB authentication with a Grid Security Infrastructure (GSI5 ) certification protocol. Metadata encapsulated by the file format (e.g., DICOM) is automatically associated with the data upon upload. The data are transferred to the SRB, and the user is notified via e-mail when and where (uri) the data have been successfully stored. The user schedules data conversion (images, stimuli data) from a workstation at the research site, indicating that the source and destination files are stored in the SRB. Grid and SRB authentication are used to enable access to the storage and the VL-e PoC computational resources, where the data conversion job is performed. The user is notified via e-mail when the conversion is complete and where the results have been stored. Several software packages can be chosen for image analysis, for example FSL (fMRIB Software Library [8]) and SPM (Statistical Parametric Mapping [9]). The user configures the image analysis parameters from a workstation at the research 3 http://www.sara.nl/

4 http://www.sdsc.edu/srb

5 http://www.globus.org/security/overview.html

50


Figure 2. Overview of VL-f functional components.

site, indicating that the input and output files are stored in the SRB. Authentication (SRB, Grid) takes place, and the analysis job is scheduled to run at the VL-e PoC computing resources. The user is notified when the analysis is completed, with a link to the results in the SRB. The LCG-2 grid middleware6 is used for running jobs to perform data conversion and image analysis on the VL-e PoC. Job submission is performed via a web service that encapsulates the functionality of EDG command-line utilities. Jobs are retried a given number of times, and permanent faulty conditions are notified both to the user (researcher) and to the technical support (image analysis or grid specialist, depending on the problem). At any time, and from any workstation at the research sites, the user can perform the following operations via interactive and intuitive clients and/or web portals: • monitor and control the status of scheduled jobs for data transfer, data conversion and image analysis, • browse data and inspect their content with html, text, image, and other specialized viewers, • transfer data between the SRB and the local workstation, • manage files and control data access permissions for individuals and groups, on a permanent or temporary basis, • edit metadata stored in the SRB metadata catalogue, and 6 http://lcg.web.cern.ch/LCG/activities/middleware.html


51

• retrieve data with queries based on metadata And finally, tools for workflow automation enable the user to combine and schedule at once tasks such as data pseudonymisation, data transfer to the SRB, data conversion, image analysis and metadata generation. 4.3. First Pilot A minimum but complete subset of the ideal functionality described in section 4.2 was selected for implementation in the first pilot. Existing software is used as much as possible to enable rapid development. Issues such as optimisation, development of intuitive GUIs, cross-platform functionality and workflow automation were left for a later phase. Data gathering is performed in a workstation that has access to the file system of the scanner and the stimulus computer as a remote drive or directory. Pseudonymisation, if necessary, is performed by an application that simply removes identity information from 4-D images stored in the PAR/REC Philips Medical Systems proprietary file format. The data is transferred into the SRB using inQ7 , a browser for Windows platforms, which only supports password authentication. inQ is also used for browsing data on the SRB, controlling user access to groups and individuals, transferring data between the SRB and the workstation, general file management, metadata management and query. Image analysis is performed with FSL and consists of a sequence of customized steps implemented by command-line utilities (binary code and scripts). These steps have been encapsulated in FSL by a tcl script that takes parameters from a single configuration file. Some parameters are used to control image analysis, while others indicate the location of input data (complete file paths to images and stimuli data) and output results. The analysis results consist of several files (images, text) that are stored in a single directory with a given name. A summary report in html generated by FLS facilitates browsing the results. For proper execution in the VL-e PoC, the FSL script needs to be wrapped into a higher-level component that handles files stored in the SRB and adequate user notification. The input files are automatically downloaded to the local file system of the computing node prior to running the original script, and the results are uploaded when it completes. Error handling, which is limited to displaying messages in the stderr in the original script, must be extended to also notify the user and the technical support. The analysis is started manually, using a job submission client running on the workstation at the research site. Lists of jobs (e.g., for several scans) can be submitted at once, in which case the analysis is performed in parallel in the available computing elements. Data conversion facilities are limited to the formats required by the FSL utilities. For images, conversion is performed from PAR/REC to NIFTI-18 format. 7 http://www.sdsc.edu/srb/inQ/inQ.html

8 See

http://www.fmrib.ox.ac.uk/fsl/fsl/formats.html.

52


4.4. Current Status The implementation of this pilot is in its early stage. Images can be transferred to the SRB directly, but the data from the stimulus equipment still need manual intervention. Only MS Windows-based workstations are supported for interactive access to the SRB with inQ. Scommands9 are used to upload and download data from the scripts that perform data manipulation autonomously. The data conversion and image analysis tasks have been combined into one script that is executed as a single job. Jobs are scheduled and monitored via existing utilities that offer an interactive GUI for EDG job submission. These utilities must be executed on the EDG user interface machine. Finally, no explicit metadata facilities nor specialized data viewers are present, except for those already implemented by the SRB and inQ. 5. Discussion and Conclusions Other attempts have been reported to provide grid-enabled infrastructures for medical applications. Montagnat et al. present in [10] several medical applications that can benefit from grid technology by using parallel and distributed computing for higher throughput and lower latency. Rogulin et al. describe in [11] the Mammogrid project, which uses grid technology to integrate and provide access to databases of mammograms for computer-aided diagnosis. Barillot et al. describe in [12] the Neurobase project, which uses grid technology for the integration and sharing of heterogeneous sources of information in neuroimaging. Rex et al. present in [13] the LONI Pipeline Processing Environment, a cross-platform, distributed environment for the design, distribution, and execution of image analysis for neuroimaging applications. It has a visual programming interface where a large repertoire of components can be combined to perform the desired image analysis steps. A dataflow model is adopted to support a parallel processing architecture, which enables simultaneous execution of multiple tasks. These attempts have emphasised either information or computation aspects. Our efforts, however, are focused on the construction of an infrastructure that addresses both aspects transparently, efficiently and robustly. The infrastructure proposed in section 4 has the potential to alleviate to a large extent the problems presented in section 3 because it provides large and long term storage capacity, remote and controlled access to distributed and heterogeneous data, facilities for metadata storage and query, access to HPC resources, and workflow automation. This potential remains to be confirmed when the implementation is completed and evaluated from the perspective of the end users. The few experiences with the pilot already indicate that the implementation of VL-f will be a challenging task, and that several issues should be properly addressed before the infrastructure could be considered useful. First, error detection and notification is typically poor in legacy software, being typically limited to messages written to files (stdout, stderr) or a return code. It would be desir9 http://www.sdsc.edu/srb/scommands/index.html


53

able to clearly notify failure and success in a compact manner to more efficiently guide the user in the inspection of results. The relevance of such a feature will increase proportionally with the scale and automation level of workflows. Second, more intuitive tools are needed to submit, monitor and control job execution, in particular for large numbers of jobs. We are currently investigating Nimrod-G [14] as an alternative for large scale job submission. Third, it is important to provide simple means to request grid services (for job submission) from any workstation at the research sites. We are planning to develop platform-independent clients that will communicate with the EDG web service implemented at the VL-e PoC to submit and control/monitor large numbers of jobs. This service is compliant with the Web Services Resource Framework (WSRF) architecture, which is under consideration also to implement functionality such as data conversion and image analysis. And finally, workflow automation must be improved, for example, by integrating the VL-f functionality into the Distributed Workflow Management System currently in use at the AMC [15]. Note that the proposed infrastructure does not include explicit mechanisms for strict management of data confidentiality, besides removing identity information and controlling access to the data via SRB authentication. Although the current strategy may be insufficient for handling patient data, it seems adequate for a large number of research studies in which the subjects are volunteers. Constructing an adequate and useful IT infrastructure, even if for a limited scope of fMRI studies, is the goal of our current efforts in VL-f.

Acknowledgements This work was carried out in the context of the VL-e project. This project is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ).

References [1] S.M. Smith. Overview of fMRI analysis. The British Journal of Radiology, 77:S167– S175, 2004. [2] A. C. Evans, D. L. Collins, , R. R. Mills, E. D. Brown, R. L. Kelly, and T. M. Peters. 3D statistical neuroanatomical models from 305 MRI volumes. In Proceedings of the Nuclear Science Symposium and Medical Imaging Conference, volume 3, pages 1813 – 1817. IEEE, 1993. [3] E.J. Vlieger et al. Functional magnetic resonance imaging for neurosurgical planning in neuro-oncology. Eur Radiology, 14:1143–1153, 2004. [4] J. D. Van Horn, S. T. Grafton, D. Rockmore, and M. S. Gazzaniga. Sharing neuroimaging studies of human cognition. Nature Neuroscience, 7(5):473–481, 2004. [5] I. Foster and C. Kesselman. Computational grids. Communications of the ACM, 35(6):44 – 52, 1998. [6] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 15(3), 2001.

54


[7] C. Baru, R. Moore, A. Rajasekar, and M. Wan. The SDSC storage resource broker. In CASCON’98 Conference, 1998. [8] S.M. Smith et al. Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage, 23:208–219, 2004. [9] K.J. Friston. Statistical parametric mapping and other analysis of functional imaging data. In Brain Mapping: The Methods, pages 363 – 385. Academic Press, 1996. [10] J. Montagnat et al. Medical Images Simulation, Storage, and Processing on the European DataGrid Testbed. Journal of Grid Computing, 2:387–400, 2004. [11] Rogulin D. et al. A grid information infrastructure for medical image analysis. In Proceedings of DiDaMIC Workshop (MICCAI’2004), 2004. [12] C. Barillot et al. Neurobase: Management of distributed and heterogeneous information sources in neuroimaging. In Proceedings of DiDaMIC Workshop (MICCAI’2004), 2004. [13] D.E. Rex J.Q. Ma and A.W. Toga. The LONI pipeline processing environment. NeuroImage, 19:1033–1048, 2003. [14] R. Buyya, D. Abramson, and J. Giddy. Nimrod-GResource Broker for ServiceOriented Grid Computing. IEEE Distributed Systems Online, 2(7), 2001. [15] J. G. Snel, S. D. Olabarriaga, J. Alkemade, H. G. van Andel, A. J. Nederveen, C. B. Majoie, G. J. den Heeten, M. van Straten, and R. G. Belleman. A distributed workflow management system for automated medical image analysis and logistics. In accepted to IEEE-CMBS, special track on Grids for Biomedical Informatics, 2006.


55

Service-oriented Architecture for Gridenabling Medical Applications Anca BUCUR, René KOOTSTRA, Jasper van LEEUWEN and Henk OBBINK Philips Research, High Tech Campus 31, 5656 AE, Eindhoven, the Netherlands {anca.bucur, rene.kootstra, jasper.van.leeuwen, henk.obbink}@philips.com

Abstract. Grid technologies have the potential to enable healthcare organizations to efficiently use powerful tools, applications and resources, many of which were so far inaccessible to them. This paper introduces a service-oriented architecture meant to Grid-enable several classes of computationally intensive medical applications for improved performance and cost-effective access to resources. We apply this architecture to fiber tracking [1,2], a computationally intensive medical application suited for parallelization through decomposition, and carry out experiments with various sets of parameters, in realistic environments and with standard network solutions. Furthermore, we deploy and assess our solution in a hospital environment, at the Amsterdam Medical Center, as part of our cooperation in the Dutch VL-e project. Our results show that parallelization and Grid execution may bring significant performance improvements and that the overhead introduced by making use of remote, distributed resources is relatively small. Keywords. Grid computing, service-oriented architecture, fiber tracking, performance, speedup

Introduction The Grid provides transparent, ubiquitous, scalable and secure access to large amounts of various resources anywhere and at any time. Grid technology may provide organizations like medical institutions with powerful tools through which they can gain coordinated access to resources otherwise inaccessible to them. This technology also has the potential to enable new applications that were not possible before, for example applications that require high-performance or high-throughput computational power, or a large number of various resources usually not available at one site. Healthcare organizations may use Grids to share resources, to access remote resources, but also to become service providers. We propose a service-oriented architecture that enables computationally intensive medical applications to use Grid technologies for improved performance and costeffective access to computational resources. We have first assessed different types of applications that could benefit from Grid technology, and distinguished three distinct classes of such applications for which we developed a generic architecture [3]. Next, we choose several relevant medical applications fitting in the identified classes and apply our architecture to “gridify” (parallelize and enable to use grid resources and technology) these applications. As a first case study, we apply our GAMA architecture to fiber tracking, a computationally intensive medical application suited for parallelization through

56

A. Bucur et al. / Service-Oriented Architecture for Grid-Enabling Medical Applications

decomposition. We run this application in a Grid environment and carry out performance measurements for various sets of parameters. Our end-to-end solution allows for the remote exploitation of powerful Grid resources while preserving the current way of working in the hospital, i.e. the use of Grid resources is transparent for the users of the application. In fiber tracking algorithms, one of the computationally intensive elements regards the amount of starting points that is used to “search” for fibers that satisfy the multiROI (region of interest) search criterion. In this Grid-enabled application we set the algorithm to simply use all voxels of the brain as starting points (full volume fiber tracking). Our results described in this paper show that for this application parallelization and Grid execution bring significant performance improvements. The additional communication overhead introduced by making use of distributed, remote resources is relatively small, with standard network solutions, and does not preclude the cost and performance benefits of deploying Grid-based applications.

1. GAMA Overview In this section we first briefly introduce the applications types targeted by our architecture. Next, we describe the general architecture and reflect on its potential benefits and on the reasons behind our choices. 1.1. Target applications In previous research [3], we have selected a number of computationally challenging medical applications. Their clinical use is currently hampered by the unavailability of sufficient computational power. The medical imaging applications that we have analysed are all suited for parallelization through decomposition and fit into three distinct classes of decomposition patterns that allow them to exploit parallelism in three ways: computational, domain and functional decomposition. Their algorithms exhibit a significant degree of spatial locality in the way they access memory as well as time locality in the sequence of operations that are performed. These applications manipulate large amounts of data, but the communication entailed by their parallel solution is low enough not to preclude (Grid-based) distributed execution. We assume that once gridified their bandwidth requirements should allow for the use of standard network solutions available to healthcare organizations. It is expected that the sizes of the data sets will increase in time, requiring increasingly powerful computational resources. Furthermore, with the availability of increasing computational power and wide access to (geographically) distributed resources it may also be expected that new applications, not possible before, will emerge and some existing applications will become more relevant. 1.2. Architecture In this context, our goal is to provide a generic architecture suitable to host a wide range of applications. Next to improving the performance of each individual application, the architecture should be scalable relative to the data volume, and should


57

allow changes in the computational algorithm with a minimum of changes in the algorithmic structure and preferably no change at all in the architecture itself. By studying the common and differentiating features of the application classes that we selected, we were able to design a generic Grid architecture that can be applied to applications fitting at least one of the above mentioned decomposition patterns, and that enables us to gridify these applications, i.e. parallelize them and let them make use of resources available on the Grid. The three decomposition patterns we selected can all be modeled within the general framework depicted in Figure 1. The underlying architecture is designed to simultaneously support distinct applications fitting at least one of the decomposition patterns. This architecture is service-oriented, in the sense that it supports the transition from software licensing to services: the application provider may install a thin client interface in the hospital to access remote (Grid) resources where the actual application would run, while the cost for the end-user would be based on usage. As Figure 1 shows, we chose for a basic two-tier client-server architecture. The server can simultaneously provide different sets of services for each of the application clients, which normally reside in medical workspots somewhere in the clinical workflow. Our primary intention is to make this framework minimally invasive, in the sense that the use of external remote resources is transparent to the end-user. As initial alternative, we considered that instead of enforcing the Grid-based solution it could be offered as an option: Depending on the computational requirements and the quality of service needed and on the availability of Grid resources, the client application may automatically choose whether to use external Grid resources or not. In the event that insufficient external resources are available, the applications would automatically fall back to use locally available ones. In the future, when enough Grid providers are available and they can reliably ensure the required quality of service, such that sufficient remote resources are always obtainable when needed, the local version can be safely removed. Currently, the medical applications as part of the clinical workflow are hosted on Windows workstations at the client side. At the other end, Grid technology is centered around Globus [4], a software interface that provides the Grid “middleware”. Since Globus is Unix-based, in order to enable such applications to use the Grid for their execution while maintaining the standard way of working in hospitals, the computeintensive part of the application has to be “removed” from the rest of the application and ported to the Grid environment. To provide an interface from the Windows environment to a Grid infrastructure that is designed around Globus, we designed a Unix-based “Grid Access Point” (GAP) module that receives the requests from the client side and allocates the needed processors on the Grid, passing on the requests and returning the results to the client. We opted for a lightweight client providing user interfaces and visualization. The GAP uses Globus for submitting the requests to the Grid nodes, thereby exploiting the security and execution facilities offered by it. Globus is entirely Unixbased, so it could not be used if we chose for directly connecting the client side to the Grid nodes.

58


Medical Workspot

Medical Workspot

Medical Workspot

Application 1

Application 2

Application 3

client server

client server

GRID Access Point

GRID Resources

Figure 1. GAMA overview

Often the client side, which may be a regular clinical site, had slow external network connections. Therefore, the overall performance of the application benefits from keeping the client side as little involved as possible in the computational algorithm and from restricting the communication with the rest of the application to job-submission requests only. For high throughput, the GAP should be connected to the Grid infrastructure via a fast network.

2. Case Study: Fiber Tracking In this section we describe our results of applying the GAMA architecture to a compute-intensive medical application suitable for computational decomposition, the fiber tracking application [5, 6]. 2.1. Description White fiber tracking is an indirect medical imaging technique, based on Diffusion Weighted Imaging, that allows for the extraction of the connecting pathways between brain structures. There are various solutions to fiber tracking, but the common feature is that starting from a number of points, white matter fibers have to be tracked in the entire domain. The more starting points are considered in a certain area, the more fibers can be detected that cross that area. For areas with a high concentration of fibers, too many detected fibers may also lead to a clogged, indistinguishable image. Also extending the


59

region in which the starting points are distributed yields a larger number of detected fibers. Considering too few starting points or a region too small may result in low accuracy, i.e. relevant fibers being missed. The best selection method for the relevant fibers is therefore not to consider few starting points, but to specify a number of regions of interest (ROI) that the tracked fibers should cross.The time to run such an application depends on the number of starting points, the algorithm, and the size of the data set, and can amount to several hours. When tracking the fibers crossing one or more ROIs, a common approach is to start tracking from all the voxels in one or more of the ROIs. Another approach is to start tracking from the entire domain, either from every voxel, or from a selection of voxels (uniformly distributed or not) spread over the entire domain. The advantage of the full volume tracking is that it detects a larger number of fibers. It can also detect crossing, splitting and touching fibers. It is however slower, the runtime of the algorithm amounting to several hours (depending on the voxel size and on the number of voxels selected as starting points). This type of algorithm is the ideal case for parallelization: a distributed solution can increase the throughput without decreasing the accuracy of the result. In order to pay off, the parallel solution should to be scalable, and the communication overhead among the processors performing the algorithm should to be small. For this application, fibers are tracked in the entire data domain, so directly decomposing the domain among the participating processors would not be viable because of the large need for communication and synchronization among the processors. The starting points however can be distributed among the processors without generating additional synchronization or communication needs. Therefore, this problem is well suited for computational decomposition, meaning that each processor taking part in the computation receives the entire data domain, but the computation domain, i.e. the starting points, is divided among the processors. Instead of tracking fibers starting from the entire domain, starting from the voxels in one or more ROIs is faster, but fewer fibers are detected. The algorithm may miss crossing or splitting fibers when the region of interest does not include the areas where fibers split or cross. This case is also not suited for parallelization for small ROIs, because the communication overhead may be larger than the decrease in computation time due to parallelism. For large ROIs a parallel solution may still provide a reasonable speedup. The threshold for which a parallel solution starts being advantageous can only be detected through experiments. When the region of interest is a single voxel, at most one fiber, crossing that voxel, can be detected. For multiple ROIs, fibers crossing all ROIs have to be detected. There are two options, either all fibers crossing one ROI are detected and then it is checked whether those fibers also cross the other ROIs, or for each ROI the fibers are tracked and then the intersection of the set of fibers for all ROIs is computed. For a sequential solution the first approach, which is inherently sequential, should provide better throughput. The second approach may be parallelized by splitting the ROIs among the participating processes, but the intersection of the fibers has to be done sequentially, after all the processors finish their computation, which requires a large communication overhead and a strict synchronization. This solution is not scalable. Also in this case experiments can provide the numbers of starting points and regions of interest for which a parallel solution that splits the ROIs among participating processes improves the performance, but probably parallelization does not pay off for a realistic (small) number of regions of interest.

60


An algorithm with multiple regions of interest can be extended to use starting points from the entire domain, in order to detect crossing, splitting and touching fibers. It can be parallelized efficiently by computational decomposition, splitting the starting points. Similarly to what we explained above, a second level of parallelization that would split the regions of interest among processors does not seem to be advantageous. Separately detecting for each set of starting points the set of fibers crossing each of the ROIs and at the end computing an intersection of the sets of fibers introduces both communication and computation overhead. Only a large number of ROIs and a large number of fibers crossing all (most of) the ROIs may compensate for this overhead. We can investigate the existence of such a threshold, but the lack of scalability makes such a solution unsuited for parallelism. Our Grid-enabled fiber tracking application described in this paper implements the full volume approach, and uses ROIs to select the relevant fibers. 2.2. Use Scenarios for the Fiber Tracking Application In this section we present several scenarios describing possible uses of the fiber tracking application in healthcare, addressing as well the main requirements in those scenarios and the benefits of a distributed solution. Table 1 presents the results of this analysis. 2.2.1. The screening scenario One of the applications of fiber tracking in healthcare is the processing of large numbers of data sets to obtain information regarding the geometry of the brain or to detect neurological afflictions. For this type of medical applications there is the need for high accuracy and high throughput. Full volume fiber tracking has a large execution time in the sequential version. With a parallel solution on the Grid, the execution time can be significantly reduced, despite the extra communication overhead. 2.2.1.1. Scientific medical research This type of applications need to perform fiber tracking on a large number of data sets, in order to obtain the full geometry of the brain and to extract information relevant for the medical research, such as a brain map, changes in the brain with age, or due to various neurological illnesses. Such research receives much attention in healthcare, but it is not yet fully enabled by the existing applications and technology. Fiber tracking could provide important information about brain activity, connections among brain regions, and brain modifications as a consequence of aging, of a learning process, or of diseases. However, the current (sequential) application cannot fulfill the requirements of a research tool, because it is either too slow, or it does not reach the desired accuracy. For this use scenario the results must be very accurate in order to be relevant. The performance is also important since typically such research needs many (batch) computations of the geometry on lots of data sets in order to be able to reach conclusions. The execution time for individual data sets and application runs is not critical, and often batch processing is used. But due to the large number of application runs on many data sets that are necessary in order to make the results conclusive, a large throughput becomes essential.


61

2.2.1.2. Screening for detecting neurological disease and modifications of the brain The existing fiber tracking application is not yet suited for a screening procedure, in which a large number of patients would be scanned for detecting neurological problems, or just for a follow up of the changes in the brains of recovering patients (e.g. patients re-acquiring speech abilities lost as a consequence of accidents or brain surgery). These are still research issues, but we may expect that they will become current practice in healthcare. For this medical application to be relevant a large number of patients need to be scanned in a relatively short time, therefore the execution time of the algorithm has to be short to provide a good quality of service for individual patients and a high throughput for the screening process. Again high accuracy and high performance are required, which brings along the need for a parallel solution. 2.2.2. Intra-Operative Fiber Tracking Assisted Neurosurgery and Preoperative Planning In the case of lesions or tumors in the brain, surgical intervention is used as a last resort when other types of treatment have failed. The ability to distinguish between different tissue types during surgery, and more importantly during cutting, is of utmost importance to the surgeon as injury to healthy brain tissue may result in nerve or muscle paralysis and loss of mental functions. In addition to this, connectivity between different brain areas is of increasing interest as it provides additional information to infer the brain functional organization. White matter fiber bundles form the foundation for this connectivity. Various methods are in use to guide the surgeon during surgical intervention. Before the procedure begins, imaging techniques such as MRI and CT are used to determine the location and shape of the malignant tissue and its orientation with respect to healthy tissue. White fiber tracking can be used to image the connectivity between different brain regions. In lesion or tumor extraction, an opening is made in the skull and the malignant tissue is carefully removed without harming healthy tissue. To aid in the extraction, intra-operative ultrasound or X-ray is used to guide surgical instruments while in some cases laser projection on the brain and other guidance systems are used to direct the surgical instruments. Unfortunately, the brain changes shape both when the initial opening is made in the skull as well as during the extraction of tissue. This is especially a problem with "deep" tumors that are not located near the surface of the skull. Intra-operative MRI and Diffusion Tensor Imaging (DTI) techniques such as fiber tracking can aid the surgeon in determining the changed morphology of the brain. In this scenario, surgery takes place in an MRI facilitated operating theater. During surgery, the surgeon can decide to acquire new images of the brain by sliding the patient in-and-out of the MRI scanner. Each time a new MR scan has been acquired, all post-processing of the data needs to be completed within a strict timeframe so that surgery is not halted for too long. To accomplish this, a parallelized version of fiber tracking has to be used. The data set is first transferred to the parallel computation platform, where the fibers are tracked, after which the results are transferred back to the operating theater for presentation. This scenario will only be effective if the overhead caused by transferring the original data to the fiber tracking algorithm and the results back to the operating theater is much smaller than the performance gain obtained from parallelization.

62


2.2.3. The training and education scenario In this scenario, fiber tracking is used to provide medical students, radiologists and surgeons information about brain activity, connections among brain regions, and brain modification as a consequence of accidents or neurological afflictions. It could also develop into a useful tool for training surgeons, by providing access to a database of interesting cases and follow-ups of past interventions, and even a virtual intervention tool. Also for this scenario a high performance and a high accuracy of the fiber tracking are desired. However, these requirements are less critical than in the previous scenarios. Table 1. The importance of the main requirements for the different scenarios

Scenario Screening for scientific research Screening for disease detection Preoperative planning Assisted neurosurgery Training and education

Response time + ++ ++ +++ ++

Throughput ++ +++ +++ +++ +

Accuracy +++ +++ +++ +++ ++

Critical No No No Yes No

2.3. The environment This research is carried out as part of the “Virtual Laboratory for eScience” (VLe1) project. The VL-e project addresses challenging problems, including the manipulation of large scientific datasets, computationally demanding data analysis, access to remote scientific instruments, collaboration, and data dissemination across multiple organizations. The methods, techniques, and tools developed within the VL-e project are targeted for use in many scientific and industrial applications. The project will develop the infrastructure needed to support these and other related e-Science applications, with the aim to scale up and validate the developed methodology. The VL-e philosophy is that any Problem Solving Environment (PSE) based on the Grid-enabled Virtual Laboratory will be able to perform complex and multidisciplinary experiments, while taking advantage of distributed resources and generic methodologies and tools. In the VL-e project, several PSEs will be developed in parallel, each applied to a different scientific area. One of these areas is Medical Diagnosis and Imaging which is the focus of this research. The sequential fiber tracking application was built with the Philips Research Imaging Development Environment (PRIDE), which allows for the creation and the execution of prototype tools and other experimental software, on a Windows NT-based machine using the Interactive Data Language (IDL)2. This language has built-in algorithms and routines, a drag-and-drop GUI builder, data visualization capabilities, and cross-platform portability. Our experiments with the distributed fiber tracking were performed on the Distributed ASCI Supercomputer (DAS-2)3. DAS-2 is a five-cluster wide-area computer system located at five Dutch universities. The system was designed and deployed by the Advanced School for Computing and Imaging and is used for research 1

http://www.vl-e.nl/ http://www.rsinc.com/idl/ http://www.cs.vu.nl/das2

2 3


63

in parallel and distributed computing. It consists of 200 1.0 GHz Dual-Processor Pentium-III nodes split into five clusters, one with 72 nodes, the other four with 32 nodes each. SURFnet4 (100 Mb/s) interconnects the clusters, while Myrinet5 (2 Gb/s) is used for local communication. The operating system on the DAS-2 is RedHat Linux version 7.3. Programs are started using the SGE batch queuing system, which allocates the requested number of nodes for the duration of a program run. We submitted the jobs to the DAS-2 using the Globus toolkit, which is installed on all DAS-2 clusters. The client workstation is connected to the GAP by a 10 Mb network, while the GAP is connected to the DAS-2 system by a 100 Mb network. Finally, as part of the cooperation within the VL-e project we have deployed the Grid-enabled Fiber Tracking application and ran experiments at the Amsterdam Medical Center in Amsterdam, while keeping the Grid Access Point running at Philips Research Laboratories in Eindhoven, and allocating Grid resources from the DAS-2 system. Also these experiments, in a realistic hospital environment and with standard network solutions, have shown that the fiber tracking application can obtain significant performance benefits from the use of Grid technologies. 2.4. Architecture and design details The specific architecture resulting from applying the generic GAMA architecture to the fiber tracking application is depicted in Figure 2. The purpose of this experiment is to validate the GAMA architecture and to assess the performance and the scalability of the Grid-enabled application. We started from a sequential version implementing full volume fiber tracking, i.e. fibers are tracked from every voxel in the data domain. The Grid-enabled application should gain performance by distributing its computational part, the fiber tracking algorithm, across Grid computational resources. hospital

grid access

client

results

gap

dataset params regions results

node 0

dataset params regions

distributor

results

node 1

worker

dataset

fiber tracking

dataset params regions

grid cluster

scanner

dataset params regions results

node n

worker

Figure 2. GAMA applied to fiber tracking

We first developed a distributed solution to the fiber tracking algorithm and assessed its performance in a Grid environment for different numbers of processors. 4 5

http://www.surfnet.nl/ http://www.myri.com/

64


Our previous experiments [3] had shown that tracking long fibers takes noticeably longer than tracking short fibers, or checking areas with no fibers. It is also the case that fibers are grouped in large bundles. Since in our solution the results are only sent back to the GAP when all the tasks computing valid starting points have completed, the longest task determines the execution time of the module. This implies that simply splitting the computational domain into a number of sub-domains equal to the number of processors is not an efficient solution: Processors receiving parts of the domain with many long fibers perform a large amount of work, while processors receiving parts of the domain with no fibers spend most of the time waiting. As an alternative, we have split the domain on one of the axes in slices of width equal to the size of the voxel and implemented a workpool–based solution. One of the nodes executing the algorithm (the distributor) is only responsible for sending the parameters and the data to the other nodes (workers), distributing the workload, i.e. the slices with starting points, and collecting and sending the results at the end of the computation (see Figure 3). Each worker node takes one slice at a time from the distributor, tracks the fibers from the starting points in that slice, and computes and stores the valid starting points. This repeats until the worker node receives the termination message from the distributor, indicating that all the slices have been distributed. The worker node can then return the valid starting points identified and finish. The distributor assembles all partial results received from the worker nodes, sends them to the GAP, and terminates. Next, we extended the initial fiber tracking application with the distributed fiber tracking module. The original code of the application has been modified to use the new, gridified module instead of the local algorithm. The role of the GAP is to receive the requests and the data from the client side of the application, to start the distributed module on Grid resources, to send the requests and the data to the distributor on the Grid, to collect the results and to send them back to the client. The GAP can simultaneously serve multiple client applications. The communication between the client side of the application and the GAP, and between the GAP and the distributor is packet based. The total size of the datasets used in our experiments amounts to 22.4 MB; the sizes of the other packets are comparatively very small.

Figure 3. The flow of starting points and results for the distributed fiber tracking algorithm


65

3. Results In this section we assess the performance of our gridified fiber tracking application. In Figure 4, the fiber tracking application is used to detect fibers in the proximity of a brain tumor (the pre-operative planning scenario). Figure 5 depicts a screenshot of the fiber tracking application.

Figure 4. Fiber tracking through a single ROI for a patient with brain tumor

Figure 5. Screenshot of the Grid-enabled fiber tracking application

66


We first evaluate the performance of the distributed algorithm implementing the computational part of the fiber tracking application. In that case, only the communication overhead among the worker nodes is considered. The scalability results for the distributed fiber tracking algorithm for up to 128 worker nodes (plus one distributor) are depicted in Figure 6. The speedup of our solution is almost linear and the communication overhead for transferring data and results between the distributor and the worker nodes is very small. In all these cases we have also studied the load balancing among the worker nodes. Due to the dynamic load balancing we implemented, the durations of the tasks were similar for up to 128 nodes. Therefore, we concluded that the size of the work items (slices with starting points) we chose is small enough for a good load balancing in this case. Decreasing the work size would yield higher communication overhead between the distributor and the worker nodes, increasing it may result in some of the workers spending time waiting for the others to finish, both cases having a negative influence on performance. For this problem, the fraction of the algorithm that tracks a fiber from a starting point is inherently sequential and sets a hard limit on the speedup. With increasing numbers of worker nodes, the work-items size may finally reduce the speedup, even for dynamic load balancing. However, taking into account the performance of the application on 128 nodes, we conclude that it is not useful to increase the number of nodes performing the computation and that the chosen work-items size is well suited for our case study.

Figure 6. The speedup of the distributed fiber tracking application

Table 2 depicts the performance of the end-to-end solution, taking into account beside the communication overhead of the distribution solution, the transfer overhead of the parameters and results between the client and the GAP, and the GAP and the distributor node. The results show that despite the extra overhead introduced, the Grid-based fiber tracking outperforms the local, sequential solution when it uses at least 3 worker nodes.


67

Table 2. The response time of the distributed algorithm at the client side

Version Sequential Distributed Distributed Distributed Distributed Distributed Distributed Distributed Distributed

Resources Local Grid Grid Grid Grid Grid Grid Grid Grid

No. of nodes 1 1 2 4 8 16 32 64 128

Response time [s] 464.50 1266.48 636.85 319.69 164.31 81.13 42.16 22.51 11.79

Table 3. Comparison of the relevant steps in the sequential and the Grid-based fiber tracking

Applicat.

Sequential

Comput.

Transfer dataset (22.4Mb)

464.5s

Visualiz.

0

Transfer params results

Communic. & 0

0

< 1s

< 2s

24.5s

Grid-based, 128 nodes

14.5s 8.8s

24.5s

The rendering of the fibers from the valid starting points (visualization) is not part of the computational algorithm, and was preserved at the client side. Therefore, the duration of this step is identical for the local and for the Grid-enabled solution. Table 3 compares the duration of the main steps for the sequential fiber tracking and for the distributed fiber tracking on 128 nodes. These estimates show that the overhead introduced with transferring the data to the Grid, and the communication overhead of the distributed solution are rather small, even for standard (low bandwidth) network solutions. Visualization, which in the sequential fiber tracking was almost unnoticeable for the end-user in comparison with the computational algorithm, became the longest step in the gridified application.

4. Conclusions and Future Work

In this paper we have described our service-oriented architecture aiming to enable compute-intensive medical applications to make use of the Grid. We have applied this architecture to a real clinical application, allowing it to access Grid resources for its computational part, while preserving its Windows-based interface. This way, the use of Grid technologies and resources is transparent to the end-user of the application, and the current way of operation at the hospital side is maintained. Our end-to-end solution exhibits significant performance improvements when run on standard compute systems and making use of standard network connections. We have also deployed the Grid-enabled application in a hospital environment, at the Amsterdam Medical Center, with similarly good results.

68


Although not apparent for the fiber tracking application, we expect that the communication overhead from transferring images to the computing back-end would significantly impact the performance, and reduce the speedup obtained from parallelization, for applications processing very large data sets. Therefore, GAMA would benefit from an architecture where data is stored “closer” (in terms of time required to communicate) to the computing back-end. Such a situation could occur when hospitals store medical data on remote storage resources, possibly part of a Grid. Having the data close to the computational resources would decrease the communication overhead that now occurs in GAMA. The fiber tracking test case described in the previous section illustrates an example of computational decomposition, one of the three decomposition paradigms that we have identified. Our current and future work includes applying the GAMA architecture to applications fitting the other two decomposition paradigms, extending the GAP to serve multiple types of applications, and implementing a distributed GAP to deal with the bandwidth bottleneck when running simultaneously a large number of applications.

5. Acknowledgements

Part of this work was carried out in the context of the Virtual Laboratory for e-Science project. This project is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ). Philips Medical Systems provided the data sets used in our experiments and the sequential fiber tracking application that constitutes the basis for our Grid-enabled case study application.

References [1] [2] [3] [4] [5] [6]

S. Mori and P.C. van Zijl. Fiber tracking: principles and strategies - a technical review. NMR Biomed, 15(7-8):468–480, 2000. D. Xu, S. Mori, M. Solaiyappan, P.C. van Zijl, and C. Davatzikos. A framework for callosal fiber distribution analysis..Neuroimage, 17(3):1131–1143, 2002. A.I.D. Bucur, R. Kootstra and R. Belleman. A grid architecture for medical applications. HealthGrid, 2005. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl. J. Supercomputer Applications, 11(2):115–128, 1997. C. Preibisch , U. Pilatus, R. Bunke, F. Hoogenraad, F. Zanella, H. Lanfermann. Functional MRI using sensitivity-encoded echo planar imaging (SENSE-EPI). Neuroimage, 19(2):412-421, 2003. F.G.C. Hoogenraad, R.F.J. Holthuizen, R. Brijder. High angular resolution diffusion weighted MRI. International Patent, WO2005076030, 2005.


69

Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation of Statistical Parametric Mapping Analysis S. BAGNASCOa,1 , F. BELTRAMEb, B. CANESIb , I. CASTIGLIONIc, P. CERELLOa, S.C. CHERANa,d, M.C. GILARDIc , E. LOPEZ TORRESe, E. MOLINARIb, A. SCHENONEb, L. TORTEROLOb a

Istituto Nazionale di Fisica Nucleare, Sezione di Torino, Torino, Italy b BIOLAB, Dipartimento DIST, Università di Genova, Italy c IBFM CNR, Università di Milano Bicocca, Istituto H San Raffaele, Milano, Italy d Dipartimento di Informatica, Università di Torino, Italy e CEADEN, Habana, Cuba

Abstract. A quantitative statistical analysis of perfusional medical images may provide powerful support to the early diagnosis for Alzheimer’s Disease (AD). A Statistical Parametric Mapping algorithm (SPM), based on the comparison of the candidate with normal cases, has been validated by the neurological research community to quantify ipometabolic patterns in brain PET/SPECT studies. Since suitable “normal patient” PET/SPECT images are rare and usually sparse and scattered across hospitals and research institutions, the Data Grid distributed analysis paradigm (“move code rather than input data”) is well suited for implementing a remote statistical analysis use case, described in the present paper. Different Grid environments (LCG, AliEn) and their services have been used to implement the above-described use case and tackle the challenging problems related to the SPM-based early AD diagnosis.

Keywords. Alzheimer’s disease, statistical analysis, distributed databases, grid computing

Introduction Alzheimer’s Disease (AD) is the leading cause of dementia, accounting for more than half of all dementias in elderly people. Clinically, AD is characterized by a progressive loss of cognitive abilities, the memory loss typically being the earliest sign of the disease. The qualitative analysis of medical images hardly provides useful suggestions for the diagnosis in AD. On the other hand, a statistical comparison of PET/SPECT images from 1 Corresponding author: Stefano Bagnasco, Istituto Nazionale di Fisica Nucleare, Via Pietro Giuria 1, 10125 Torino, Italy. E-mail: [email protected]

70

S. Bagnasco et al. / Early Diagnosis of Alzheimer’s Disease Using a Grid Implementation

suspect AD patients with PET/SPECT images from a database of normal cases is a powerful tool for an early diagnosis of AD. With this goal the use of Statistical Parametric Mapping Analysis (SPM) for the quantification of ipometabolic patterns in brain PET/SPECT studies of patients in early stages of AD has been proposed in literature [1]. In Section 1 the scenario and the software tools of the clinical application are described. Section 2 describes the general features of the Grid architecture. Section 3 concerns middleware issues and a detailed description of the two different implementations. In Section 4 some preliminary conclusions are presented.

1. The Statistical Analysis Use Case The SPM software library was originally developed and is made freely available by the Functional Imaging Lab (FIL) at the Wellcome Department of Imaging Neuroscience (London University College) for activation studies in functional MRI [2]. Since then, the use of SPM was extended and, through a specifically defined analysis protocol, SPM routines are presently the standard within the neurological research community as regards a voxel-based analysis of PET/SPECT studies for the early diagnosis of AD. In order to achieve correct results, the SPM software library provides a number of functionalities related to image processing and statistical analysis: normalization, coregistration, smoothing, parameter estimation, statistical mapping.


71

Figure 1. Results of a SPM analysis on a PET study of glucose metabolism in a patient with dementia. Ipometabolic pattern in the frontal cortex: design matrix (top right), statistically significant clusters on a glass brain in three orthogonal planes (top left) and on a 3D brain rendering (bottom)

The statistical parametric mapping algorithm (the most important functionality for our goal) performs a statistical analysis in order to compare, on a voxel-by-voxel base, the perfusion values in the test images against the corresponding values in normal images. A number of parameters such as the age of the patient and the average cerebral flow are taken into account. The whole software analysis sequence has been scientifically validated and even a small alteration would imply the necessity of a new evaluation by the Scientific/Clinical Community. The results of a SPM statistical analysis, shown in Figure 1, include ipometabolic maps and the related views of the brain. As the understanding of functions used in the image analysis is very important to provide the correct values of parameters and to understand the results, only selected users should access SPM analysis in order to avoid errors in diagnosis. On the other hand the remote access to SPM analysis could provide doctors from peripheral hospitals with an invaluable tool to increase the “comparison database” and therefore improve the AD diagnosis. On these bases, as a result of a previous research project [3], remote access to SPM is being made available through the Italian Portal of Neuroinformatics.The portal contains a section entirely dedicated to the statistical analysis of PET/SPECT images, accessible by authorized users only. Doctors or researchers accessing the portal may thus be supported in running analysis tasks on suspect AD patient studies. Directly from the portal, a user can upload the suspect AD image and select the normal cases for statistical calculation (Figure 2). The SPM application is available to authorized users without downloading any software tool. In order to use it, no particular hardware resource or specific computer knowledge is needed.

72


Figure 2. Functional data flow of access to SPM application through portal

The SPM Graphic User Interface (GUI) manages the decisional and computational data input and the graphical output. In order to make the application available to users on the net, the first step is the replacement of the SPM GUI with a web portal interface ZOPE, an open source application server for building content management systems, intranets, portals, and custom applications [4], has been used for the construction of the Portal, implemented with the Python programming language. The information for the statistical analysis is therefore collected through a configuration file created by the web GUI (Figure 3).

Figure 3: Connection between system and portal.

The main issue with this configuration is the large set of options for the execution of the SPM algorithm. In order to help users to conduct a correct statistical analysis, a lot of parameters have been set to default values and only a few parameters are set by the user. A script that drives the input data collection for SPM from normal images has been implemented.


73

2. The Grid Implementation In order to evaluate the potential advantage of porting such an implementation to a Grid environment, it is worth noting that during the statistical parametric mapping a large set of images of normal patients is required to be used for comparison. This is because the accuracy of hypoperfusion maps is strictly related to the number of normal studies compared to the test image. On the other hand, due to ethical issues and to the high costs of neuroimaging technologies, PET and SPECT studies on normal subjects are very rare. The NEST-DD project, funded by the European Commission, collected a database of about 100 images in order to make available the first large dataset for these studies. Moreover the images of normal subjects are covered by privacy and security issues and for this reason they cannot be freely moved on the net or published by the centre that made the analysis. As a consequence, only doctors working at very large institutions, locally owning large databases of normal images, can usually carry out SPM-based analyses. Starting from these considerations, the aim of our project has been to enable doctors from small peripheral hospitals to use large sets of normal PET/SPECT images provided by medical research institutes distributed on the net, by remotely extracting the information needed for the statistical analysis from the normal images and collecting it without moving the original image files.

Figure 4. A Grid implementation of the SPM portal services.

Furthermore, the execution time of the analysis must be compatible with an interactive clinical application in a busy medical environment. The time required for the analysis can be reduced, since: x some aspects of calculation could be parallelized and distributed on the computational resources associated to the remote databases of normal images; x the time required for data transfers over the network would be reduced, since the code amounts to just few KB, compared to images sizing up to 100 MB.

74


The use of GRID technologies well matches all of the above issues and allows easy access to distributed data as well as to distributed computational resources. In particular, through data-management grid services doctors can access normal PET/SPECT image databases without moving images between hospitals, thus complying to privacy regulations. Through computational grid services statistical information can be extracted from normal images, and the image matrix and other information needed for the statistical analysis can be transferred to the management node without moving images. Moreover, this process can be executed in a parallel way on every repository machine to improve computing performances. The basic architecture of the GRID implementation of the SPM portal service is described in Figure 4. The different steps needed to complete the analysis sequence are listed below: 1. Acquisition of the test image on the user node 2. Transfer of the test image to the management node 3. Query on DB catalogue of normal images 4. Transfer of a small software executable for information extraction to the repository nodes 5. Extraction from normal images of the information needed for the statistical analysis 6. Transfer of the extracted information to the management node 7. SPM statistical analysis on the management node 8. Transfer of SPM results to the user node Thus, in terms of Grid elements, repository nodes are grid sites, comprising at least a Storage Element (SE) and a Computing Element (CE) service, while the management node runs a User Interface (UI) functionality, since it must access remote central services (the data and metadata catalogues and possibly some job submission system). With this configuration, there is no need to install grid-specific software on user nodes, since all services are accessed through the web portal on the management node. The portal and the remote central services are connected through some queries, built with specific ZOPE functionalities, to the data and metadata catalogues. 2.1. Security and user authentication The user authenticates to the ZOPE portal via simple username/password authentication. For grid interactions, however, further authentication is needed via the user’s X509 certificate and a MyProxy delegated credential mechanism [5]. Briefly, the user registers a renewable proxy in a MyProxy server, then the portal gets a delegated proxy, being authorized by the user with a specific password. From then on, that proxy is used to authenticate to both systems (AliEn and LCG) and the security infrastructure is the one provided by the underlying middleware.


75

3. Middleware Issues and Implementation Choices Since the beginning of Grid research and activities, a number of different software suites have appeared, sometimes in the form of low-level toolkits (like Globus, of which some components are becoming de-facto standards), sometimes as individual services and, in some cases, as fully-fledged end-to-end solutions. Given the European context of the project, and the background of many authors, two software suites were evaluated: LCG/gLite [6], [7] and AliEn, [8] the latter being already used by other Grid applications developed by the MAGIC-5 collaboration. As correctly pointed out in the HealthGrid whitepaper [9], although the ultimate goal would be the creation of a single EU-wide HealthGrid comprising all eHealth resources, the development path will include a number of application- or community-specific, independent Grids. Currently available middleware are still lacking many of the security and privacy enforcing features needed by biomedical (and even more by eHealth) applications. The choice of starting with a single Grid application, thus having to cope with only a relatively small number of sites and users, does reduce the number of security and privacy constraints. Another important constraint is imposed by the need to deploy Grid elements in hospitals, where the environment is often much different from the one usually found in research centers. Very few, if any, hospitals are willing to devote a large amount of resources (in terms of network, computing resources, and manpower) to the installation and maintenance of complex systems. Since this is a research project, the software to install is quickly evolving, and maintenance of such a system may be nontrivial. Thus, ease of deployment and maintenance is one of the most important constraints in the choice of the middleware to be used. In compute- and data-intensive applications on very large infrastructures (like HEP applications on the worldwide LHC Computing Grid), with several Virtual Organizations competing for resources, one of the issues is the distribution of jobs across sites, and the optimal usage of available resources. In our case, as in many other eHealth grid applications, the crucial point is a reliable and efficient data and metadata management, in this case for identifying suitable normal images across hospitals. As a basic choice, in order to avoid too much duplication of functionality and code, as many function as possible have been designed to be common to the two implementations; thus, for example, if the job splitting (see below) is done by the portal code and not by the JobSplitter (AliEn) or by writing a DAG (LCG), the piece of code is identical, with a switch governing the type of JDL file to be produced. 3.1. The AliEn-based implementation The following observations, along with the availability of an existing AliEn infrastructure for GPCALMA, the MAGIC-5 mammographic CAD project, suggested the choice of AliEn as one of the technologies for the prototype service, currently being developed:

76


x x x x x

A standard AliEn site can be installed in less than an hour on a single machine by an inexperienced user, thus allowing for very fast deployment of prototype sites. AliEn has an integrated data and metadata catalogue that has been tested and used by the ALICE collaboration for a few years, as well as by the MAGIC-5 mammography project GPCALMA [10]. the AliEn data access implementation relies on the widely used xrootd [11] protocol. AliEn provides tight integration with ROOT [12], the data analysis C++ framework adopted by the MAGIC-5 collaboration for software development, and with PROOF [13], the Parallel Root Facility. AliEn is interfaced with the LCG/gLite infrastructure and middleware, thus allowing in the future, should the need arise, to integrate our prototype in a larger system based on a different technology.

However, in the assumption that middleware services and their interfaces will evolve, we designed our interface in such a way to minimize it and keep it as modular as possible. The VO server is hosted at INFN Torino on hardware shared with other MAGIC-5 applications (mammogram and lung CT analysis). It runs the users and configuration databases, along with the central data and metadata catalogue. Storage Elements for the prototype deployment are currently installed at BIOLAB Genova and INFN-Torino. 3.1.1. AliEn-based Data Management AliEn provides an integrated solution that offers data and metadata management services for a Virtual Organisation. Its performance, up to several million entries in the catalogues, is being continuously and thoroughly tested by the ALICE Collaboration [14]. The integrated data and metadata catalogue comprises two layers, a central catalogue and distributed Local File Catalogues. The central file catalogue holds the correspondence between a Logical File Name (LFN), a unique identifier (GUID) and a list of Storage Elements with a replica of the file. The LFN syntax is a filesystem-like tree structure, which is mirrored in the DB structure with tables representing nodes (logical directories) which can hold pointers to leaves (file entries) or other nodes: this approach has also the advantage of providing an easily browsable structure. In AliEn, metadata management is built in the central catalogue DB structure, with no need of an extra product. Metadata are implemented by further tables, linked to the relevant logical tree nodes. Privacy of sensitive data can be ensured (as is done, e.g., in the GPCALMA project) by separating the data into two different subtrees, with specific access privileges for different users. In the second layer, correspondences between GUIDs and Physical File Names (PFNs, in the form of Storage URLs) are stored, using a distributed approach in which the DB is kept local to the site hosting the relevant SE. Thus the load on the central services is reduced and the system allows local management of the physical storage; alternatively, the site catalogues can be centralised using more tables in the central DB, in the case the


77

remote site does not provide a DB service. Remote catalogue implementations can be based on a number of DB backends, including the LCG File Catalogue (LFC). File access can be implemented by simply pre-staging the file from the local SE to the WN (which, in this deployment, is always on the same network): this can be done automatically by the Alien JobWrapper. Alternatively, and specially if the job is run via PROOF, the xrootd access protocol and POSIX-like APIs can be used to remotely access files without moving them from the SE. For our application, a dedicated service (with MySQL backend for the central services) was deployed on the MAGIC-5 Server, which can be accessed through regular AliEn clients (e.g. the AliEn shell extensions) or via application-specific GUIs which can be integrated in access portals. Two tasks are to be performed on the Data Catalogue from the web portal running on the management node: x query to the metadata catalogue to find images relevant to the current statistical analysis x select which images to use, find out their Physical File Names for access and possibly download the images if the remote centre allows it (while this is not part of the described use case, it is useful to have the functionality available - e.g. for debugging). Both of the functionalities have been implemented as perl scripts, which can be used either as independent command-line tools or integrated with the web portal, thus hiding atomic native catalogue functions (even if the internal language for the portal engine is python, exploiting the AliEn perl native APIs justified the small extra effort for integration). Integration with the portal allows seamless GUI interaction with the available functionalities from the catalogue, the analysis software and portal services. 3.1.2. PROOF-based Analysis Once the data management service provided the required information about the input images, the SPM analysis algorithm can be started. It is possible to opt for a batch analysis, by sending a set of one job per remote image, as described in the LCG-based implementation (see next paragraph). The configuration of the individual jobs can be either done by having the system generate a set of JDL files, or exploiting the AliEn “job splitting” feature. Alternatively one can make use of PROOF [13] for a distributed interactive analysis. A very small PROOF cluster, with 3 nodes on 2 different domains, was configured in order to implement and test the functionality. The access takes place on the master node and goes through the ROOT shell. The output of a query to the data management services (a list, each entry being the site and the physical file name of a selected image) is used to dynamically generate the analysis script from a template that implements the analysis algorithm that compares the images. The script is executed in parallel on different input files stored on the three sites and the results are sent back to the master node.

78


Presently, a full integration with the WEB Portal is not available yet. When testing will be completed, the node that hosts the WEB Portal will also become the PROOF master node, and the interactive analysis will be triggered by a user request on a WEB form. 3.2. The LCG-based implementation The LCG project provides a series of sites and services spread all over the world on which LCG and gLite middleware are installed and several Virtual Organisations are enabled. LCG middleware allows to effectively couple a wide variety of machines, including supercomputers, storage systems, data sources, and to create an uniform interface for connecting heterogeneous data resources over a network and for accessing data and metadata. As a subset of this community, a test-bed entirely dedicated to disseminate the potentialities of grid computing, named GILDA, has been deployed by the EGEE project [15]. A LCG node has been installed at BioLab laboratory, University of Genoa, and is now an official site of the GILDA test-bed for biomedical applications. The objectives of the LCG implementation of the above described SPM application are: • to distribute PET/SPECT images on different storage resources available on the GRID and register them on a catalogue. • to associate metadata to images in order to search and select the images for comparison using their own attributes. • to access images from User Interface using logical file names (LFN) without copying them on Worker Nodes. 3.2.1. LCG-based Data Management The LFC (LCG File Catalog) was selected: it allows users and applications to locate files (or replicas) on the LCG Grid maintaining mappings between logical and physical file names. As next step we integrated AMGA (ARDA Metadata Grid Application, [16]), a component fulfilling also the second requirement. Actually, LCG does not provide a satisfactory metadata management system and AMGA fills this hole. The collected metadata are associated to files stored on the LCG Grid through a reference on the LFC catalogue system and are used to select images directly through the portal. An important feature provided by AMGA is the ability to allow only certain people access to specified attributes. This is very important because all medical data should be considered as sensitive to preserve patient privacy. Furthermore, on a grid the distributed nature of data makes security problems more sensitive. In this context the federation and proxying functionalities provided by AMGA are very important because they allow to leave highly confidential data in the original places (hospitals) and to avoid copies of data in other database backends.


79

To meet the third requirement, LCG Data Management and File access tools have been used. In order to understand the architecture of the LCG implementation, the different APIs available for Data Management operations in LCG-2 are shown in Figure 5.

Figure 5. Available LCG APIs for Data Management.

lcg util is a C Application Program Interface (API) that provides the same functionality as the lcg-* commands (lcg-utils). This layer should cover most basic needs of user applications. It transparently interacts with the LFC catalogue and makes use of the correct protocol for file transfer. Grid File Access Library (GFAL ) [17] provides calls for catalogue interaction, storage management and file access and can be very handy when an application requires access to some part of a big Grid file but does not want to copy the whole file locally. The library hides the interactions with the LCG-2 catalogues and the SEs and SRMs and presents a POSIX-like interface for the I/O operations on the files. GFAL accepts GUIDs, LFNs, SURLs and TURLs as file names, and, in the first two cases, it tries to find the closest replica of the file. Depending on the type of storage where the file’s replica resides in, GFAL will use one protocol or another to access it. GFAL can deal with GSIFTP, secure and insecure RFIO, or gsidcap in a transparent way for the user. In order to make use of the above-mentioned LCG tools, the application code has been modified and structured in the following way: 1.

Registration and storage of data files (PET/SPECT images) on Storage Elements is available using lcg_utils. 2. AMGA has been used to insert metadata and to interact with the Portal in order to make the user able to select the normal images for the statistical analysis. 3. Development of a C program which makes use of the GFAL API in order to access distributed images using their Logical File Names and to extract some information necessary to SPM analysis without copying them locally.

80


4.

Job Submission: creation of a JDL file to submit the executable (and not the images) to the GRID. Due to the sequential nature of the process, the job can be splitted into a number of smaller sub-jobs in order to execute them in parallel and directly on the CEs closest to the SEs where images reside. We are also evaluating the possibility to adopt DAG solution for job submission and synchronization but it still needs an assessment work. 5. Statistical Analysis: running of final SPM analysis steps on results obtained by remote jobs. Statistical analysis is performed locally, outside of the Grid environment.. Figure 6 represents the application GRID infrastructure of LCG implementation.

Figure 6. Structure of the grid implementation.

4. Conclusions and outlook A new approach for the implementation of SPM-based early Alzheimer’s disease diagnosis is described. It leverages the functionalities provided by Grid computing and data services to gain access to a distributed database of normal images. Two implementations, respectively based on AliEn and LCG middleware, were developed, deployed and tested. Both provide the required functionalities: a detailed description of the features of the two approaches is given in the previous sections. Both implementations provide methods and tools for accessing remote distributed data satisfying security issues, for extracting information from those data without moving files


81

on the net, for managing related metadata, for building and maintaining catalogues, for submitting jobs on the Grid together with all the needed parameters. One of the more clear advantages of AliEn in a medical environment is its ease of installation and maintenance. On the opposite, the wider base of users and applications (as well as being the outcome of a larger project) makes LCG a reliable middleware providing a large set of resources in a production-grade environment with a complete documentation and an effective technical support. As a next step, the access to application through a Grid based portal will be provided. A more detailed comparison about computational performances is also planned.

Acknowledgments The authors gratefully acknowledge the support provided by the MAGIC-5 collaboration funded by INFN and the Grid.it project funded by MIUR. Thanks to the Gilda Team at INFN Catania for their invaluable support.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

K. Herholz et al., “Discrimination between Alzheimer Dementia and Controls by Automated Analysis of Multicenter FDG PET,” NeuroImage 17 (2002) 302–316 K.J. Friston., “Statistical Parametric Mapping and Other Analysis of Functional Imaging Data.” In Brain Mapping The Methods, pages 363-385. Academic Press, 1996. S. Scaglione et al., “Neuroinformatics portal as knowledge repository and e-service for neuroapplication and data mining,” proceedings of Medicon 2004, Napoli, August 2004 Zope Corporation, Inc. http://www.zope.org/ J. Basney, M. Humphrey, and V. Welch., “The MyProxy Online Credential Repository”. Software: Practice and Experience, 35 (2005) 801-816. A. Delgado Peris et al., “LCG-2 User Guide” EGEE EC Project, see http://egee.itep.ru/User_Guide.html http://public.eu-egee.org/; http://glite.web.cern.ch/glite/ P. Saiz et al., “AliEn - Alice Environment on the Grid,” Nucl. Instrum. Meth., A502 (2003) 437-440. The HealthGrid Association, HealthGrid Whitepaper, see http://whitepaper.healthgrid.org (2004) P. Cerello et al., “GPCALMA: a Grid-based Tool for Mammographic Screening,” Methods Inf. Med. 44 (2005): 244-8. A. Hanushevsky, “The Next Generation Root File Server,” proceedings of CHEP04, Interlaken, September 2004. R. Brun, F. Rademakers, “ROOT - An Object Oriented Data Analysis Framework,” proceedings of the AIHENP'96 Workshop, Lausanne, Sep. 1996, Nucl. Inst. Meth. A389 (1997) 81-86. See also: http://root.cern.ch M. Ballintijn et al., “The PROOF Distributed Parallel Analysis Framework based on ROOT,” proceedings of CHEP03, La Jolla, March 2003. http://aliceinfo.cern.ch/ https://gilda.ct.infn.it N. Santos and B. Koblitz, “Metadata services on the grid,” proceedings of ACAT'05, Zeuthen, Berlin, May 2005 See http://grid-deployment.web.cern.ch/grid-deployment/gis/GFAL/gfal.3.html

82


Using the Grid to Analyze the Pharmacokinetic Modelling after Contrast Administration in Dynamic MRI Ignacio Blanquera, Vicente Hernándeza, Daniel Monleónb, José Carbonella, David Moratala, Bernardo Celdab, Montse Roblesa, Luis Martí-Bonmatíc a

Universidad Politécnica de Valencia, Valencia, Spain Departamento de Química Física, Universitat de València, Valencia, Spain c Servicio de Radiología, Hospital Universitario Dr. Peset, Valencia, Spain b

Abstract. The analysis of the angiogenesis in hepatic lesions is an important marker of tumour aggressiveness and response to therapy. However, the quantitative analysis of this fact requires a deep knowledge of the hepatic perfusion. The development of pharmacokinetic models constitutes a very valuable tool, but it is computationally intensive. Moreover, abdominal imaging processing increases the computational requirements since the movement of the patient makes images in a time series incomparable, requiring a previous pre-processing. This work presents a Grid environment developed to deal with the computational demand of pharmacokinetic modelling. This article proposes and implements a four-level software architecture that provides a simple interface to the user and deals transparently with the complexity of Grid environment. The four layers implemented are: Grid Layer (the closest to the Grid infrastructure), the Gate-toGrid (which transforms the user requests to Grid operations), the Web Services layer (which provides a simple, standard and ubiquitous interface to the user) and the Application Layer. An application has been developed on top of this architecture to manage the execution of multi-parametric groups of co-registration actions on a large set of medical images. The execution has been performed on the EGEE Grid infrastructure. The application is platform-independent and can be used from any computer without special requirements.

1. INTRODUCTION AND MOTIVATION The liver is the largest organ of the abdomen and there are a large number of lesions affecting it. Both benign and malignant tumours arise within it. The liver is also the target organ for most solid tumours metastasis. Angiogenesis is quite an important marker of tumour aggressiveness and response to therapy. Even more, the presence of chronic inflammatory change affects a large proportion of the population The blood supply to the liver is derived jointly from the hepatic arteries and the portal venous system. Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCEMRI) is extensively used for the detection of primary and metastasis hepatic tumours. However, the assessment of early stages of the malignancy and other diseases like cirrhosis require the quantitative evaluation of the hepatic arterial supply. To achieve this goal, it is important to develop precise pharmacokinetic approaches to the analysis of the hepatic perfusion. The influence of breathing, the large number of

I. Blanquer et al. / Using the Grid to Analyze the Pharmacokinetic Modelling

83

pharmacokinetic parameters and the fast variations in contrast concentration, in the first moments after contrast injection, reduce the efficiency of traditional approaches. On the other hand, the traditional radiological analysis requires the acquisition of images covering the whole liver, which greatly reduces the time resolution for the pharmacokinetic curves. The combination of all these adverse factors makes the analytical study of liver DCE-MRI data very challenging.

2. STATE OF THE ART The current use of Internet as main infrastructure for the integration of information through web based protocols opened the door to new possibilities. The Web Services (WS) are one of the most consolidated technologies in web environments. They are based on the Web Services Description Language (WSDL), which defines the interface and constitutes a key part of the Universal Description Discovery and Integration (UDDI) [7]. WSs communicate through the Simple Object Access Protocol (SOAP) [8], a simple and decentralized mechanism for the exchange of typed information structured in XML (Extended Mark-up Language). As is defined in [9] a Grid provides an abstraction of resources for sharing and collaborating through different administrative domains. These resources can be hardware, data, software and frameworks. The key concept of Grid is the Virtual Organization (VO) [10], defined as a temporal or permanent set of entities or groups that provide or use resources. The usage of Grid Computing is currently in expansion. In this process of development, many basic middlewares such as the different versions of Globus Toolkit [10] (GT2, GT3, GT4), Unicore [15] or InnerGrid [16] have arisen. At present, Grid technologies are converging towards Web Services technologies. The Open Grid Services Architecture (OGSA) [11] represents an evolution in this direction. OGSA seems to be an adequate environment for obtaining efficient and interoperable Grid solutions; some issues (such as the security) still need to be improved. Globus GT3 implemented OGSI (Open Grid Service Infrastructure) which was the first implementation of OGSA. OGSI was deprecated and substituted by the implementation of OGSA by the Web Services Resource Framework (WSRF) [18] in GT4. WSRF is totally based in WSs. Although there are newer versions, Globus GT2 is a well-stablished batch basic Grid platform which has been extended in several projects in a different way in which GT3 and GT4 have evolved. The DATAGRID project [12], developed the EDG (European Data Grid), a Middleware based on GT2, which improved the support of distributed storage, VO management, job planning and job submission. The EDG middleware has been improved and extended in the LCG (Large Hadron Collider Computing Grid) [2] and Alien Projects to fulfil the requirements of the High Energy Physics community. Another evolution of the EDG is gLite [14], a Grid Middleware based in WS and developed in the frame of the Enabling Grids for E-sciencE (EGEE) project [14]. gLite has extended the functionality and improved the performance of critical resources, such as the security, integrating the Virtual Organisation Membership System (VOMS) [13] for the management of VOs. VOMS provides information on the user's relationship with the Virtual Organization defining groups, roles and capabilities. These middlewares have been used to deploy Grid infrastructures comprising thousands of resources, increasing the complexity on the usage of the Grid. However,

84


the maturity of these infrastructures in terms of user-friendliness is not sufficient yet. Configuration and maintenance of the services and resources or fault tolerance is hard even for experimented users. Programming Grid applications usually involve a nontrivial degree of knowledge of the intrinsic structure of the Grid. This article presents a software architecture that abstracts the users from the management of Grid environments by providing a set of simple services. Although the architecture proposed is open to different problems, this article shows the use for the implementation of an application for the co-registration of medical images. This application is oriented to either medical end-users or researchers for a large number of executions using different values for the parameters that control the process. Researchers can tune-up the algorithms by executing larger sets of runs, whereas medical users can obtain the results without requiring powerful computers. The application does not require a deep knowledge of the Grid environments. It offers a high-level user-friendly interface to upload data, submit jobs and download results without requiring knowing the syntax of commands, Job Description Language (JDL) data and resource administration or security issues. 2.1. Pharmacokinetic Modelling The pharmacokinetic modelling of the images obtained after a quick administration of a bolus of extra-cellular gadolinium chelates contrast can have a deep impact on the diagnosis and the evaluation of different pathogen entities. Pharmacokinetic models are designed to forecast the evolution of an endogenous or exogenous component on the tissues. To follow-up the evolution of the contrast agent a sequence of MRI volumetric images is obtained at different times following the injection of contrast. Each of these images comprises a series of image slices that cover the body part explored.

Figure 1: The MR pulse sequence includes 24 slices covering the whole liver (a). In (b) there are depicted the first 9 images of the dynamic acquisition.

The pharmacokinetic model considers that the liver receives contrast through the hepatic artery and the portal vein. Each input flow is determined by a parameter kai for the arterial flow and kpi for the portal vein. The total amount of contrast delivered to the liver at the timestamp ‘t’ depends on the concentration of contrast in the hepatic artery (Ca(t)) and on the portal vein (Cp(t)) and the values of those constants. The result will be the concentration at the liver (Cl(t)). The liver also outputs contrast at a rate defined by klo. Next figure shows a schema of this process and the equations that drive it.


85

Figure 2: Pharmacokinetic model

The known values on the model are the concentrations of contrast (obtained from the images) and the values to be obtained are the flow rates (kai, kpi, klo). The study of pharmacokinetic models for the analysis of hepatic tumours is an outstanding example of the above. However, since the whole acquisition process takes a few minutes, images are obtained in different break-hold periods. This movement of the patient produces artefacts that make images directly incomparable. This fact is even more important in the area of the abdomen, which is strongly affected by the breathing and the motility of the organs. A prerequisite for the computation of the parameters that govern the model is the reduction of the deformation of the organs in the obtained images. This process can be performed by co-registering all the volumetric images with respect to the first one. 2.2. Co-registration The co-registration of images consists on aligning the voxels of two or more images in the same geometrical space by using the necessary transformations to make the floating images as much as possible similar to the reference image. In general terms, the registration process could be rigid or deformable. Rigid registration only uses affine transformations (displacements, rotations, scaling) to the floating images. Deformable registration enables the use of elastic deformations on the floating images. Rigid registration introduces fewer artefacts, but it can only be used when dealing with body parts in which the level of internal deformation is lower (e.g. the head). Deformable registration could introduce unrealistic artefacts, but is the only one that could compensate the deformation of elastic organs (e.g. in the abdomen). Image registration can be applied in 2D (individually to each slice) or in 3D. Registration in 3D is necessary when the deformation happens in the three axes. The co-registration process implemented in this case is based on the ITK software library [1]. This process includes in a first stage a rigid 3D registration of the Gaussian filtered volume images (Mutual Information Metric and Gradient Descent Optimizer), and a 3D deformable registration (Mutual Information Metric, Gradient Descent Optimizer and BSpline Transformation Transform).

86


2.3. Post processing Although the co-registration of images is a computationally complex process which must be performed before the analysis of the images, it is not the only task that needs high performance computing. Extracting the parameters that define the model and computing the transfer rates for each voxel in the space will require large computing resources. The platform implemented has been designed to cope with following post-processing in the same way.

3. ARCHITECTURE The basic Grid middleware used in this architecture is the LCG, developed in the LHC Computing Grid Project, which has a good support for high throughput executions. A four-layered architecture has been developed to abstract the operation of this middleware. The registration application has been implemented on top of this architecture. Medical data is prone to abuse and need careful treatment in terms of security and privacy. This is even more important when the data has to flow from different sites. It is crucial both to preserve the privacy of the patients and to ensure that people accessing the information are authorised to do so. The way in which this environment guarantees the security is referenced in other publications [1]. 3.1. Layers As it has been mentioned in previous sections, the development of this framework has been structured into four layers, thus providing a higher level of independence and abstraction from the specificities of the Grid and the resources. The following sections describe the technology and the implementation of each layer. Figure 3 shows the layers of the proposed architecture. Application: EGEE Registration Launcher

Web-Services Middleware

WS Container

FTP Server

Gate-to-Grid User Interface

EGEE Grid Infrastructure

Figure 3: The proposed architecture.

3.1.1. Grid Layer The system developed in this work makes use of the computational and storage resources being deployed in the EGEE infrastructure along a large number of computing centres distributed among different countries. EGEE currently uses LCG, although there are plans for migrating to gLite. This layer offers the “single computer” vision of the Grid through the storage catalogues and workload management services that tackle with the problem of selecting the rightmost resource.


87

The Job Description Language (JDL) is the way in which jobs are described in LCG. A JDL is a text file specifying the executable, the program parameters, the files involved in the processing and other additional requirements. A description of the four layers of the architecture is provided along the following subsections. 3.1.2. Gate-to-Grid Layer The Gate-to-Grid Layer constitutes the meeting point between the Grid and the Web environment. In this layer there are WSs providing the interaction with the Grid similarly as if the user were directly logged in the UI. The WSs are deployed in a Web container in the UI which provides this mediation. The use of the Grid is performed through the UI by a set of scripts and programs which have been developed to ease the task of launching executions and managing group of jobs. The steps required to execute a new set of jobs in the Grid are the following: 1. A unique directory is created for each parametric execution. This directory has separated folders to store the received images to be co-registered, the JDL files generated and the output files retrieved from the jobs. It also includes several files with information about the jobs, such as job identifiers and parameters of the registration process. 2. Files to be registered are copied to a specific location in this directory. 3. For each combination of parameters and pair of volumes to be registered, a JDL file filled-in with the appropriate values is generated. 4. Files needed by the registration process are copied in the SE and registered in the RC and the RLS. 5. The jobs are submitted to the Grid through an RB that selects the best available CE according to a predefined criterion. 6. Finally, when the job is done and retrieved, folders and temporal files are removed from the UI. The files registered in the SE that are no longer needed are also deleted. The different Grid Services are offered through the scripts and programs aforementioned. These programs work with the UI instructions in order to ease the tasks for job and data management. The access to these programs and scripts is remotely available through the WSs deployed in the UI. The copying of the input files from the computer where the user is located to the UI computer is performed through FTP (File Transfer Protocol). The most important WSs offered in the UI are: InitSession. This service is in charge of creating the proxy from the Grid user certificates. The proxy is then used in the Grid environment as a credential of that user, providing a single sign-on for the access to all the resources. GetNewPathExecution. As described before, jobs launched in a parametric execution (and not yet cleared) will have in the UI their own group folder. This folder has to be unique for each group of jobs. This service will get a unique name for each job group and it will create the directory tree to manage that job execution. This directory will store the image, logs, JDLs and other information files.

88


Submit. The submit call starts an action that carries on the registration of the files from the UI to the SE, creates the JDLs according to the given registration parameters and the files stored on the specified directory of the UI. It finally submits the jobs to the Grid using the generated JDL files. GetInformationJobs. This service gets information about the jobs belonging to the same execution group. The information retrieved by this call is an XML document with the job identifiers and the associated parameters. CancelJob. This call cancels a single job (part of an execution group). The cancellation of a job implies the removal of the registered files on the Grid. CancelGroup. This service performs the cancellation of all the jobs launched to the Grid from a group of parametric executions. As in the case of CancelJob, the cancellation of jobs implies the removal of its associated files from the SEs. Moreover, in this case the temporal directory created on the UI is also removed when all the jobs are cancelled. GetJobStatus. This service informs about the status of a job in the Grid, given the job identifier. The normal sequence of states of a job is: submitted, waiting, ready, scheduled, running, done and cleared. Other possible states are aborted and cancelled. PrepareResults. This service is used to prepare the results of an execution before downloading them. When a job finishes, the resulting image, the standard output and the standard error files can be downloaded. For this purpose the PrepareResults service retrieves the results from the Grid and stores them in the UI. The executable must exist in the UI system and it has to be statically compiled so that it can be executed without library dependencies problems in any machine of the Grid. The implemented registration of this project is based on the Insight Segmentation and Registration Toolkit (ITK) [4] software library. ITK is an Open Source software library for image registration and segmentation. 3.1.3. Middleware Web Services Layer The Middleware Web Services Layer provides an abstraction to the use of the WSs. The abstraction of the WSs Layer has two purposes. On one hand, to create a unique interface independent from the application layer, and on the other hand to provide methods and simple data structures to ease the development of final applications. The development of a separate software library for the access to the WSs will ease future extensions for other applications that share similar requirements. Moreover it will enable introducing optimizations in this layer without necessarily affecting the applications developed on top of this layer. Moreover, this layer offers a set of calls based on the Globus FTP APIs to perform the data transferring with the Gate-to-Grid Layer. More precisely, the abstraction of the WSs lies, in first place, on hiding the creation and management of the necessary stubs for the communication with the published WSs. In second place, this layer manages the data obtained by the WSs by means of simple structures closer to the application. From each of the available WSs in the Gate-to-Grid layer there exists a method in this layer that gets the information in XML given by the WSs and returns that information in basic types or structured objects which can be managed directly by the application layer.


89

3.1.4. Application Layer This layer is the one that offers the graphical user interface which will be used for the user interaction. This layer makes use of functions, objects and components offered by the middleware WS layer to perform any operation on the Grid. The developed tool has the following features available: x

Parameter profile management. The management of the parameters allows creating, modifying and removing configurations of parameters for the launching of multi-parametric registrations. These registration parameters are defined as a rank of values, expressed as by three values: initial value, increment and final value. The profiles can be loaded from a set of templates or directly filled-in before submitting the co-registrations according to these parameters.

x

Transferring of the volumetric images that are going to be registered. The first step is to upload the images that are going to be registered. The application provides the option to upload the reference image and the other images that will be registered. These files are automatically transferred to the UI to be managed by the Gate-to-Grid layer.

x

Submission of the parametric jobs to the Grid. For each combination of the input parameters, the application submits a job belonging to the same job group. The user can assign a name to the job group to ease the identification of jobs in the monitoring window.

x

Job monitoring. The application offers the option to keep track of the submitted jobs for each group.

x

Obtaining the results. When a Grid execution has reached the state done, the user can retrieve the results generated by the job to the local machine. The results include the registered image, the standard output and standard error generated by the program launched to the Grid. The user can also download the results from a group of jobs automatically.

4. RESULTS The first result of this work has been the LCG Registration Launcher tool, which has been developed on top of the architecture described in this article. Figure 4 shows two screenshots of the application, one showing the panel for uploading reference and floating volumes, and the other one showing the panel for the monitoring of the launched jobs.

90


Figure 4: Screenshots of the LCG registration launcher application.

The results obtained can be considered in terms of performance and scientific results. The results presented in this section are related to the images from a clinical trial with 20 patients obtained at the Hospital Dr. Peset for this work. Considering the performance, the required time to perform a registration of a volumetric image in a PIII at 866 Mhz with 512 MB of RAM is approximately 1 hour and 27 minutes. Considering that the complete study performed involved 20 patients the total cost would be 2331h 22m. Using a 20-procs computing farm the complete process took 132h 50m. The computational cost using the Grid was 17h 35m, using the resources of the EGEE grid. More than 200 computers were available, but since the system is shared with other users, several cases had to wait on local queues. Moreover, several jobs failed and needed to be rescheduled. If the same resources were used in a batch processing approach, running manually the jobs on the computing farm, the computing time would be an 8% shorter. The overhead of Grids is due to the use of secure protocols, remote and distributed storage resources and the scheduling overhead, which is in the order of minutes due to the monitoring policies which are implemented in a poll fashion. Regarding the co-registration results obtained, Figure 5 shows a tiled composition of two images before the co-registration (a) and after the process (b). The figure clearly shows the improvement in the alignment of the voxels of the image. Clear differences are observed on the top of the abdomen and on the area of the ribs.

Figure 5: Images before coregistration (left) and after coregistration (right)


91

Finally, and considering the final output of the whole perfusion analysis process, figure 6 shows a parametric image obtained as a function of the parameters that drive the model (voxel concentration versus arterial concentration). This image has been obtained solving the overdertermined system of equations of the pharmacokinetic model described in figure 2 and using as input the concentrations of the contrast in the different slices of the co-registered images. The values on the intensity of the pixels are defined by the value of kLo in each voxel.

Figure 6: Final Result: Parametrical Image.

5. CONCLUSIONS AND FURTHER WORK The final application developed in this work offers an easy-to-use high level interface that allows the use of the LCG2-based EGEE Grid infrastructure for image co-registration by Grid-unaware users. With the use of the tool described in this work, the user achieves a large computational performance for the co-registration of radiological volumes and the evaluation of the parameters involved. The Grid is an enabling technology that provides the clinical practice of processes with processes that, by its computational requirements, were not feasible with a conventional approach. It also offers a high throughput platform for medical research. The proposed architecture is adaptable to different platforms and enables the execution of different applications changing the user interface. This work is a starting point for the realization of a middleware focused on the abstraction of the Grid to ease the development of interfaces for the submission of complex jobs to the Grid. The developed application is a part of a larger project. As introduced in section 1, the co-registration application is a first step of the pharmacokinetic model identification. The next step to be treated will be the extraction of the pharmacokinetic model from a set of acquisitions. For this task, it was necessary to have the co-registration tool that has been developed in this work. Finally, the middleware WS layer will be enlarged to give support to new functionalities related with the extraction of the pharmacokinetic model. Finally, the application currently supports the Analyze format [5], although the extension to other formats such as DICOM [4] is being considered among the priorities.

92


6. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

L. Ibañez, W. Schroeder, L. Ng, J. Cates, “The ITK Software Guide”, second edition, 2005, http://www.itk.org, Jan 2006. I. Blanquer, V. Hernández, D. Segrelles, “A Framework Based on Web Services and Grid Technologies for Medical Image Registration”, Lecture Notes in Computer Science, Biological and Medical Data Analysis: 6th International Symposium, (ISBMDA), ISSN: 0302-9743, vol 3745, pp 22-33, 2005. "LHC Computing Grid", http://lcg.web.cern.ch/LCG, Jan 2006 Ibanez, Schroeder, Ng and Cates, “The ITK Software Guide”, Edited by Kitware Inc, ISBN 1-93093410-6 National Electrical Manufacturers Association, “Digital Imaging and Communications in Medicine (DICOM)”, 1300 N. 17th Street, Rosslyn, Virginia 22209 USA Mayo Clinic; “Analyze 7.5 File Format” Scott Short, “Creación de Servicios Web XML para la Plataforma .NET", Mc-Graw-Hill",ISBN 8448137027, 2002 "Universal Description, Discovery and Integratin (UDDI)", http://www.uddi.org, Jan 2006 "Simple Object Access Protocol (SOAP)", http://www.w3c.org, Jan 2006 Expert Group Report, “Next Generation Grids”, edited by the European Commission, 2004, http://www.cordis.lu/ist/grids/index.htm, Jan 2006 I. Foster and C. Kesselman “The GRID: Blueprint for a New Computing Infraestructure”, Edited by Morgan Kaufmann Publishers, Inc, 1998 I. Foster and C. Kesselman and J. Nick and S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, The Globus Project, 2002, http://ww.globus.org/research/papers/ogsa.pdf.", Jan 2006 “The DATAGRID Project", http://www.eu-datagrid.org, Jan 2005 Virtual Organization, http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/voms.html, Jan 2006 “gLite. Lightweight Middleware for Grid Computing”, http://glite.web.cern.ch/glite, Jan 2006 Dietmar Erwin, "Unicore Plus Final Report",2003 "InnerGrid Users' manual", Edited by GridSystems, 2003 “Enabling Grids for E-sciencE”, http://www.eu-egee.org, Jan 2006. “The Globus Alliance”, http://www.globus.org, Jan 2006


93

Medical image registration algorithms assesment: Bronze Standard application enactment on grids using the MOTEUR workflow engine Tristan Glatard a , Johan Montagnat b , and Xavier Pennec c a CNRS, I3S laboratory b CNRS, I3S laboratory c INRIA Sophia Antipolis Abstract. Medical image registration is pre-processing needed for many medical image analysis procedures. A very large number of registration algorithms are available today, but their performance is often not known and very difficult to assess due to the lack of gold standard. The Bronze Standard algorithm is a very data and compute intensive statistical approach for quantifying registration algorithms accuracy. In this paper, we describe the Bronze Standard application and we discuss the need for grids to tackle such computations on medical image databases. We demonstrate MOTEUR, a service-based workflow engine optimized for dealing with data intensive applications. MOTEUR eases the enactment of the Bronze Standard and similar applications on the EGEE production grid infrastructure. It is a generic workflow engine, based on current standards and freely available, that can be used to instrument legacy application code at low cost.

1. The Bronze Standard application Computerized medical image analysis is now a well established area that provides assistance for diagnosis, modeling, and pathologies follow-up. With the growing inspection capabilities of imagers and the medical data production growth, the need for large amounts of data storage and computing power increases. Grids have been identified as a tool suitable for dealing with medical data. Successful example of grid application deployment for image databases analysis, optimization of medical image algorithms, simulation, etc, have already been reported [7]. 1.1. Medical images registration Medical image registration algorithms are playing a key role in a very large number of medical image analysis procedures. Together with image segmentation algorithms, they are fundamental processings often needed prior to any subsequent

94

T. Glatard et al. / Medical Image Registration Algorithms Assessment

analysis. Image registration consists in searching a 3D transformation between two images, so that the first one can superimpose on the second one in a common 3D frame. The transformation may be rigid (the composition of a translation and a rotation) to express a 3D change of frame or non rigid to express local deformations of space. A rigid registration is useful for aligning similar data (such as images of a same patient acquired at different times) into a single frame. A non-rigid registration is useful for computing the deformation map between different data (such as data acquired from two different patients). In addition, the registration is said mono-modal when both images have been acquired using the same imaging modality (thus sharing some common signal characteristics) or multi-modal when the modalities differ (signal differences have then to be compensated for). The computational load of these algorithms greatly varies depending on the type of registration computed, the size of the images to process, and the algorithms themselves. In general non-rigid, multi-modal algorithms are more costly than rigid, mono-modal algorithms. On typical 3D images and using up-to-date PCs, the computation time varies from a few minutes in the simplest cases to tens of hours in most compute intensive registrations. 1.2. Registration algorithms assessment Given the very common use of registration algorithms and the different contexts for their application, a large number of new algorithms is developed by the research community. There are approximately a hundred of new research papers published on that subject each year. A difficult problem, as for many other medical image analysis procedures, is the assessment of these algorithms robustness, accuracy and precision [4]. Indeed, there is no well established gold standard to compare to the algorithm results. Different approaches have been proposed to solve this issue. It is possible to synthesize images by simulating the acquisition physics and to experiment the algorithm on the synthetic images produced [1]. However, realistic images are difficult to produce and hardly perfect enough for fine assessment of the algorithms. Phantoms (manufactured objects with properties close to human tissues for the imaging modality studied) can also be used to acquire test images. However, it is also very difficult to manufacture realistic enough phantoms. 1.3. The Bronze Standard method An alternative for assessing registration algorithms is a statistical approach called the Bronze Standard [9]. The goal is basically to compute the registration of a maximum of image pairs with a maximum number of registration algorithms so that we obtain a largely overestimated system to relate the geometry of all the images. It makes this application very compute and data-intensive. Suppose that we have n images of the same organ of one patient and m registration algorithms. We have in fact only n−1 free transformations to estimate that relate all these images, say T¯i,i+1 . The transformation between images i and j is obtained using a compositions such as T¯i,j = T¯i,i+1 ◦ T¯i+1,i+2 ◦ . . . ◦ T¯j−1,j if i < j (or the inverse of both terms if j > i). The free transformation parameters are computed by minimizing the prediction error on the observed registrations:


min

T¯1,2 ,T¯2,3 ,...,T¯n−1,n

2 k d Ti,j , T¯i,j

95

(1)

i,j∈[1,n],k∈[1,m]

k is the transformation computed between image i and j by the k th regiswhere Ti,j tration algorithm, and d is a distance function between transformations chosen as a robust variant of the left invariant distance on rigid transformation [11]. The estimation T¯i,i+1 of the perfect registration Ti,i+1 is called bronze standard because the result converges toward Ti,i+1 as the number of methods m and the number of images n become larger. Indeed, considering a given registration method, the variability due to the noise in the data decreases as the number of images n increases, and the registration computed converges toward the perfect registration up to the intrinsic bias (if there is any) introduced by the method. Now, using different registration procedures, based on different methods, the intrinsic bias of each method also becomes a random variable, which is hopefully centered around zero and averaged during the minimization procedure. The different biases of the methods are now integrated into the transformation variability. To fully reach this goal, it is important to use as many independent registration methods as possible. In this process, we do not only estimate the optimal transformations, but also the rotational and translational variance of the “transformation measurements”, which are propagated through the criterion to give an estimated of the variance of the optimal transformations. These variances should be considered as a fixed effect (i.e. these parameters are common to all patients for a given image registration problem, contrarily to the transformations) so that they can be computed more faithfully by multiplying the number of patients. An important variant of the Bronze Standard is to relax the assumption of the same variances for all algorithms, and to unbias their estimation. This can be realized by using only m − 1 out of the m methods to determine the bronze standard registration, and use the obtained reference to determine the accuracy of the last method. In this paper, we are considering m = 4 different registration algorithms in our implementation of the bronze standard method: (1) Baladin and (2) Yasmina are intensity-based. The former uses a block matching strategy while the later optimizes a similarity measure on the complete images using the Powel algorithm. (3) CrestMatch is a prediction-verification method and (4) PFRegister is based on the ICP algorithm. Both CrestMatch and PFRegister register features (crest lines) extracted from the input images. These algorithms are further described in [9]. Figure 1 illustrates the application workflow. Each box in figure 1 represents an algorithm and arrows show computation dependencies.

2. Enacting the application workflow on the EGEE production grid Even though registration computations are usually tractable on simple PCs, the large number of input data and registration algorithms needed to compute the bronze standard makes this method very compute intensive. A grid infrastructure can handle the load of the computations involved and help in managing the medical image database to process.

96


2.1. EGEE infrastructure In order to evaluate the relevance of our prototype and to compare real executions to theoretically expected results, we made experiments on the EGEE production grid infrastructure1 . This platform is a pool of thousands computers (standard PCs) and storage resources accessible through the LCG2 middleware 2 . The resources are assembled in computing centers, each of them running its internal batch scheduler. Jobs are submitted from a user interface to a central Resource Broker which distributes them to the resources available. On such a grid infrastructure, the application parallelism can be exploited to optimize the execution time. Several instances of each service will be concurrently submitted to the grid and executed on different processors. 2.2. Application workflow The Bronze Standard application is composed as a workflow of algorithms represented on figure 1. The two input image sources on top correspond to the image sets on which the evaluation is to be processed. The upper box corresponds to an initialization needed for the registration algorithms. Then come the registration algorithms themselves and format conversion and result collection services. Finally, the bottom (gray) service is responsible for the evaluation of the accuracy of the registration algorithms, leading to the outputs values of the workflow. It computes means from all the results of the registration services considered but one, and evaluates the accuracy of the specified registration method. This service has to be synchronized: it must be enacted only once every data have been processed in the workflow. The six services with a triple contour are compute intensive initialization and registration algorithms while the other boxes represent more lightweight computation steps such as data format transformations. 2.3. Medical workflows Similarly to the Bronze Standard application presented above, medical image analysis procedures are often not based on a single image processing algorithm but rather assembled from a set of basic tools dedicated to process the data, model it, extract quantitative information, and analyze results. Given that interoperable algorithms packed in software components with a standardized interface enabling data exchanges are provided, it is possible to build complex workflows to represent such procedures for data analysis. High level tools for expressing and handling the computation flow are therefore expected to ease computerized medical experiments development. When dealing with medical experiments, the user often needs to process datasets made of e.g. hundreds of individual images. The workflow management is therefore data driven and the scheduler responsible for sharing the load of computations should take into account the input data sets as well as the workflow graph topology. 1 Enabling

2 LCG2

Grids for E-sciencE, http://www.eu-egee.org middleware, http://lcg.web.cern.ch/LCG/activities/middleware.html


97

Figure 1. MOTEUR interface representation

3. MOTEUR workflow engine We implemented an hoMe-made OpTimisEd scUfl enactoR (MOTEUR) prototype to manage application workflows. MOTEUR is written in Java, in order to be platform independent. It is available under CeCILL Public License (a GPLcompatible open source license) at http://www.i3s.unice.fr/ glatard. The workflow description language adopted is the Simple Concept Unified Flow Language (Scufl) used by the Taverna workbench [10]. Figure 1 shows the MOTEUR web interface representing a workflow that is being executed. Each service is represented by a color box and data links are represented by curves. The services are color coded depending on their current status: gray services have never been executed; green services are running; blue services have finished the execution of all input data available; and yellow services are not currently running but waiting for input data to become available. MOTEUR is interfaced to the job submission interfaces of both the EGEE infrastructure and the Grid50003 experimental grid. In addition, lightweight jobs execution can be orchestrated on local resources. MOTEUR is able to submit different computing tasks on different infrastructures during a single workflow execution. 3.1. Service-based approach To handle user processing requests, two main strategies have been proposed and implemented in grid middlewares: 3 Grid5000,

http://www.grid5000.org

98


1. In the task based strategy, also referred to as global computing, users define computing tasks to be executed. Any executable code may be requested by specifying the executable code file, input data files, and command line parameters to invoke the execution. The task based strategy, implemented in GLOBUS [3], LCG2 or gLite 4 middlewares for instance, has already been used for decades in batch computing. It makes the use of non gridspecific code very simple, provided that the user has a knowledge of the exact syntax to invoke each computing task. 2. The service based strategy, also referred to as meta computing, consists in wrapping application codes into standard interfaces. Such services are seen as black boxes from the middleware for which only the invocation interface is known. The services paradigm has been widely adopted by middleware developers for the high level of flexibility that it offers. However, this approach is less common for application code as it requires all codes to be instrumented with the common service interface. The service-based approach is naturally very well suited for chaining the execution of different algorithms assembled to build an application. Indeed, the interface to each application component is clearly defined and the middleware can invoke each of them through a single protocol. In addition, the service-based approach offers a large flexibility for managing applications requiring the processing of complete image databases such as the Bronze Standard described above. The input data are treated as input parameters, and the service appears to the end user as a black box hiding the code invocation. When a service is dealing with two input data sets or more, the semantics of the service with regard to the data composition needs to be specified. MOTEUR implements two data composition patterns: • The one-to-one composition: each input of the first data set {A}i∈[1,m] is processed with each input of the second data set {B}i∈[1,n] , thus producing min(m, n) output data. • The all-to-all composition: all input of {A}i∈[1,m] are processed with all input of {B}i∈[1,n] , thus producing m × n output data. The use of these two composition strategies, embedded in the Scufl language, significantly enlarges the expressiveness of the workflow language. It is a powerful tool for expressing complex data-intensive processing applications in a very compact format. MOTEUR is implementing an interface to both Web Services [13] and GridRPC [8] application services. We developed an XML-based language to be able to describe input data sets. This language aims at providing a file format to save and store the input data set in order to be able to re-execute workflows on the same data set. 3.2. Enabling legacy codes In the service based approach, all application codes need to be wrapped into a standard service envelope. This increases the code complexity on the application 4 gLite

middleware, http://www.glite.org


99

developer side and this prevent the use of legacy code which cannot necessarily be modified and recompiled for various reasons. In order to face this limitation, we have developed a legacy code application wrapping service similar to GEMLCA [5]. The idea is to propose a standard web service capable of submitting any legacy executable on the target grid infrastructure. This generic application service, is dynamically composing the executable invocation command line before submission. For this purpose, it needs a description of the executable command line parameters. We have defined a simple XMLbased parameters description format. For each legacy code to gridify, the user only needs to produce the corresponding XML document. The generic service is taking as input both the executable and the description document. The generic application service is installed on the grid user interface and it does not require any deployment on the grid computing resources. It submits jobs to the grid through the standard workload management system. 3.3. Optimizing the execution of data intensive applications Some workflow managers, such as the CONDOR DAGMan 5 have adopted the task-based approach, coupling processings and data to be processed. This static and complete description of the graph of tasks to be executed eases the optimization of the workflow execution as it provides all information necessary for mapping the workflow and data to available resources (see for instance the Pegasus system [2]). However, it poorly deals with large data sets since a new task need to be explicitly written for each input data to be processed. In service-based workflow managers such as MOTEUR, Kepler [6], Taverna [10] or Triana [12], each processor is invoking external services whose data is dynamically transmitted as parameter. However, the services invocation is an extra layer between the workflow manager and the execution grid infrastructure. The workflow manager has no direct access to the grid resources and therefore it cannot directly optimize the job submissions scheduling. Performances are critical in the case of data-intensive applications and MOTEUR is implementing several optimization strategies to ensure optimal workflow execution by exploiting the massively parallel resources available on the grid infrastructure. Workflow parallelism. The workflow encompasses an inherent degree of parallelism as several independent services may be invoked in parallel asynchronously by the workflow engine. Data parallelism. The computations described in the workflow can be performed independently for each input data segment. When dealing with large input data sets, this is a considerable potential optimization that consists in processing all these data in parallel on different grid resources. Also quite obvious, the data parallelism is not straight forward to implement. Indeed, parallel execution over different data leads to loose computation sequences (a data can overtake another one in the workflow) and potential causality problem if the ordering is not reestablished. MOTEUR’ strategy to avoid this problem is to associate to each processed data segment a complete history tree of the former processings that unambigu5 CONDOR

DAGMan, http://www.cs.wisc.edu/condor/dagman

100


ously describes the data provenance. To deal with the all-to-all composition strategy, MOTEUR also keeps in memory all data segments sent to the input of each service. Thus, when a delayed data arrives it can be composed with all formerly identified input data by repetitive invocations of the service. Services parallelism. The computations of different services over different input data sets can overlap in time. Parallel computing of such tasks enables a pipelining optimization similar to the one exploited inside CPUs. Theoretically, this service parallelism should not bring an extra level of parallelism when data parallelism is exploited. If all data could be processed in parallel in constant time, there would be no overlap of successive services. In practice though, execution times on a loaded production infrastructure are highly variable and unpredictable. The desynchronization of the computations creates the need for service parallelism optimization. Jobs grouping. Finally, sequential jobs might be grouped and executed to lower the number of services invocation and minimize the grid overhead resulting from jobs submission, scheduling and data transfers. Jobs grouping is not feasible in general on a service-based infrastructure as services are completely independent and can only be invoked separately by the workflow engine. The internal logic of all services implemented through the generic wrapping service is known though. The workflow engine is thus capable of translating the calls to two consecutive generic services into a call to a single service submitting a compound job with two consecutive executable command line invocations. To our knowledge, MOTEUR is the first service-based workflow manager implementing all these levels of parallelism.

4. Results and conclusions MOTEUR is evaluated on the Bronze Standard application with a realistic experimental setting. We executed our workflow on different inputs data sets, with various sizes. Input image pairs are taken from a database of injected T1 brain MRIs from the cancer treatment center ”Centre Antoine Lacassagne” in Nice, France, courtesy of Dr Pierre-Yves Bondiau. All images are 256×256×60 and coded on 16 bits, thus leading to a 7.8 MB size per image. Each of the input image pair was registered with the 4 algorithms and leads to 6 grid job submissions (triple contour services in figure 1). The 4 rigid registration algorithms used reached a sub-voxel accuracy of 0.15 degree in rotation and 0.4 mm in translation for the registration of these images. 4.1. MOTEUR performances The first experiment, reported in figure 2, is a comparison of MOTEUR performances against the Taverna workflow manager [10]. Taverna is a service-based workflow manager targeting bioinformatics application that is being developed in the UK eScience MyGrid project. Taverna has become a reference workflow manager in the eScience community. The figure displays the execution times ob-

101

T. Glatard et al. / Medical Image Registration Algorithms Assessment 35000 Taverna-EGEE MOTEUR-EGEE 30000

Execution time (s)

25000

20000

15000

10000

5000

0 0

20

40

60

80

100

120

140

Number of input image pairs

Figure 2. Execution times of MOTEUR vs Taverna on the EGEE production infrastructure

tained with Taverna and MOTEUR w.r.t. the number of input data sets. The figure shows that MOTEUR introduces an average speed-up of 2.03. Even more interesting, this speed-up is growing with the number of input data sets to process. The performance gain is due to the full exploitation of the data and services parallelism: Taverna does not provide service parallelism and data parallelism is limited to a fixed number of parallel invocations. The second experiment reported in figure 3 quantifies the performance gain introduced by the different level of optimization implemented in MOTEUR. We executed the Bronze Standard workflow on 3 different inputs data sets composed by 12, 66 and 126 image pairs, corresponding to images from 1, 7 and 25 patients respectively. In total, the workflow execution resulted in 6 times more job submissions (72, 396 and 756 jobs respectively). We computed the Bronze Standard with different optimization configurations in order to identify the specific gain provided by each optimization. The reference curve (plain curve, labeled NOP) corresponds to a naive execution where only workflow parallelism is activated. The Job Grouping optimization (JG curve) reduces the jobs submission overhead as expected. The time gain is almost constant independently of the number of input data. Unsurprisingly, the most drastic optimization is the Data Parallelism (DP curve) for this data intensive application. The speed-up grows with the number of images to be processed (the DP curve slope is lower than the reference curve slope). Theoretically, the DP curve should be horizontal (no overhead introduce by the increasing number of data) given that the number of grid processing units exceeds the number of jobs submitted. However, the EGEE grid is exploited in production mode (24/7

102

T. Glatard et al. / Medical Image Registration Algorithms Assessment 40 NOP JG DP SP + DP SP + DP + JG

35

Execution time (hours)

30

25

20

15

10

5

0 0

20

40

60 80 Number of input image pairs

100

120

140

Figure 3. Comparison of the execution times obtained for different optimization configurations

workload) by a large multi-users community. Therefore, the Service Parallelism optimization (DP+SP curve) further improves performances. Finally, combining all these optimizations (SP+DP+JG curve) provides the best result. The final speed-up is higher than 9.1 when considering the largest scale experiment. 4.2. Conclusions Data intensive applications are common in the medical image analysis community and there is an increasing need for computing infrastructures capable of efficiently processing large image databases. The Bronze Standard application is a concrete example to registration algorithms assessment with an important impact for medical image analysis procedures. The application is assembled from a set of legacy code components, wrapped into a generic web service and enacted on the EGEE grid through the MOTEUR workflow enactor. We demonstrated MOTEUR capabilities and performances. This workflow engine is conforming to the Scufl workflow description language. It implements interfaces to Web and GridRPC services. MOTEUR has been interfaced to the EGEE production grid infrastructure and the Grid5000 experimental infrastructure. The workflow execution is optimized using different parallelization strategies enabling the exploitation of the grid parallel resources. MOTEUR is freely available for download under a GPL-like license. Acknowledgments This work is partly funded by the French research program “ACI-Masse de données” (http://acimd.labri.fr/), AGIR project (http://www.aci-agir.org/). We


103

are grateful to the EGEE European IST project (http://www.eu-egee.org) for providing the infrastructure used in the experiments presented.

References [1] H. Benoit-Catttin, F. Bellet, J. Montagnat, and C. Odet. Magnetic Resonance Imaging (MRI) Simulation on a Grid Computing Architecture. In Biogrid’03, proceedings of the IEEE CCGrid03, Tokyo, Japan, May 2003. [2] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, and G. Mehta et al. Mapping abstract complex workflows onto grid environments. Jnl of Grid Comp., 1(1):9 – 23, 2003. [3] Ian Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. In International Conference on Network and Parallel Computing (IFIP), volume 3779, pages 2–13. Springer-Verlag LNCS, 2005. [4] P. Jannin, J.M. Fitzpatrick, D.J. Hawkes, X. Pennec, R. Shahidi, and M.W. Vannier. Validation of medical image processing in image-guided therapy. IEEE Trans. on Medical Imaging, 21(12):1445–1449, December 2002. [5] Pter Kacsuk, Ariel Goyeneche, Thierry Delaitre, Tams Kiss, Zoltn Farkas, and Tams Boczko. High-Level Grid Application Environment to Use Legacy Codes as OGSA Grid Services. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID ’04), pages 428–435, Washington, DC, USA, 2004. IEEE Computer Society. [6] Bertram Ludscher, Ikay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 2005. [7] J. Montagnat, F. Bellet, H. Benoit-Cattin, V. Breton, L. Brunie, H. Duque, Y. Legré, I.E. Magnin, L. Maigne, S. Miguet, J.-M. Pierson, L. Seitz, and T. Tweed. Medical images simulation, storage, and processing on the european datagrid testbed. Journal of Grid Computing, 2(4):387–400, December 2004. [8] Hidemoto Nakada, Satoshi Matsuoka, K Seymour, J Dongarra, C Lee, and Henri Casanova. A GridRPC Model and API for End-User Applications. Technical report, Global Grid Forum (GGF), jul 2005. [9] Stphane Nicolau, Xavier Pennec, Luc Soler, and Nicholas Ayache. Evaluation of a New 3D/2D Registration Criterion for Liver Radio-Frequencies Guided by Augmented Reality. In International Symposium on Surgery Simulation and Soft Tissue Modeling (IS4TM’03), volume 2673 of LNCS, pages 270–283, Juan-les-Pins, 2003. INRIA Sophia Antipolis, Springer-Verlag. [10] Tom Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, Mark Greenwood, Tim Carver, Kevin Glover, Matthew R. Pocock, Anil Wipat, and Peter Li. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics journal, 17(20):3045–3054, 2004. [11] X. Pennec, R.G. Guttman, and J.-P. Thirion. Feature-Based Registration of Medical Images: Estimation and Validation of the Pose Accuracy. In Medical Image Computing and Computer-Assisted Intervention (MICCAI’98), volume 1496 of LNCS, pages 1107–1114, Cambridge, USA, October 1998. Springer. [12] Ian Taylor, Ian Wand, Matthew Shields, and Shalil Majithia. Distributed computing with Triana on the Grid. Concurrency and Computation: Practice & Experience, 17(1–18), 2005. [13] (W3C) World Wide Web Consortium. Web Services Description Language (WSDL) 1.1, mar 2001.


Part II Ethical, Legal and Privacy Issues on HealthGrids



107

The Ban on Processing Medical Data in European Law: Consent and Alternative Solutions to Legitimate Processing of Medical Data in HealthGrid Jean Herveg Maître de conférences aux FUNDP – Faculté de Droit – D.E.S. D.G.T.I.C. Centre de Recherches Informatique et Droit Avocat au barreau de Bruxelles

Abstract. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data bans the processing of medical data owing to their highly sensitive nature. Fortunately the Directive provides that this ban does not apply in seven cases. The paper aims first to explain the reasons for this ban. Then it describes the conditions under which medical data may be processed under European Law. The paper investigates notably the strengths and weaknesses of the data subject’s consent as base of legitimacy for the processing of medical data. It also considers the six other alternatives to legitimate the processing of medical data. Keywords . Processing of Medical Data – Legitimacy – European Law – HealthGRID

INTRODUCTION 1. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data [1] bans the processing of personal data concerning health (medical data) [2]. Naturally this prohibition applies equally to the processing of medical data in HealthGrid. This petitio principii could have led to serious problems , notably for HealthGrid, if the Directive had not provided that this ban does not apply in several cases [3]. Before considering these exceptions, it seems relevant to remind the reason for this ban particularly since the latter apparently opposes the free movement of personal data [4].

1. THE BAN ON PROCESSING MEDICAL DATA 2. The regulation of the processing of personal data is based upon two main ideas. The first idea is that the economical, social, cultural and individual activities, with no public or private distinction, require in various extents the processing of information

108

J. Herveg / The Ban on Processing Medical Data in European Law

relative to natural persons. The second idea, intimately bound to the first one, is that natural persons must be protected against any infringement to their fundamental rights and freedoms that might arise from the processing of information relative to them. In other words, the processing of personal data is frequently needed for multiple good reasons. But, in the same time, the processing of personal data induces the danger to expose natural persons to grave risks of discriminations or infringements to their fundamental rights and freedoms. With respect to this and with this aim in view, the processing of personal data must comply with several rules expressing the balance between all the interests in presence. In this context Directive 95/46/EC aims to ensure the protection of fundamental rights and freedoms of natural persons (data subject), and in particular their right to privacy with respect to the processing of personal data [5]. This protection requires regulating the processing of personal data in order to prevent any infringement to the fundamental rights and freedoms of the data subject. To be effective and coherent this regulation has to be built on the analysis of the risks capable to affect the fundamental rights and freedoms of the data subject. It is only possible to determine the conditions under which personal data can be processed in full respect of the fundamental rights and freedoms of data subjects when these risks are identified. This risk assessment is particularly important since the recent evolutions of Information and Communication Technologies have multiplied the possibilities to process personal data and therefore increased the risks of infringement to the fundamental rights and freedoms of the data subject. The use of a new technology such as HealthGrid should naturally induce the assessment of the new risks attached to its implementation especially in healthcare regarding the protection of medical data. 3. The general principle is that the risk of infringement to the rights and freedoms of the data subject does not depend on the information content. But the risk depends on the purpose of the processing of personal data. In other words the potential or actual danger for the fundamental rights and freedoms of the data subject has to be assessed regarding the purpose of the processing of personal data. But the principle is slightly – though not entirely – different for sensitive data [6]. It is commonly admitted that the sole content of these data already exposes the data subject to the risk of infringement of his or her fundamental rights and freedoms, whatever could be the purpose of the data processing. Put differently, any use of sensitive data is susceptible to create grave risks of discrimination for the data subject. Therefore sensitive data require a special protection taking into account their content and the purpose of their processing. With this end in view the Directive has decided that “data which are capable by their nature of infringing fundamental freedoms or privacy should not be processed (…)” [7]. The ban on processing medical data is the special protection provided by the Directive to ensure the respect of the fundamental rights and freedoms of the data subject regarding the processing of his or her medical data. Hence the ban on processing medical data should not be seen as opposed to the free movement of personal data. The ban on processing medical data is more a limit than an exception to the free movement of personal data. In fact the free movement of personal


109

data can only be conceived in the full respect of the fundamental rights and freedoms of the data subject and this respect includes the ban on processing medical data.

2. EXCEPTIONS TO THE BAN ON PROCESSING MEDICAL DATA 4. Nevertheless the Directive grants permission to process medical data in seven hypotheses. In these ones the legitimacy of the processing of medical data (the balance between the interests in presence [8]) is formally presumed (cf. infra the necessity to really assess its legitimacy). This comes from the fact that, in principle, the situations described in these hypotheses should justify the processing of medical data, without prejudice for the other conditions ensuring the lawfulness of the data processing. These exceptions to the ban on processing medical data must be restrictively interpreted. The processing of medical data is strictly forbidden beyond these exceptions. The first hypothesis granting permission to process medical data is the consent of the data subject. The data subject’s consent is frequently presented as the natural base for the legitimacy of the processing of medical data. 2.1. The consent of the data subject 5. According to the Directive the ban on processing medical data does not apply where the data subject has given his or her explicit consent to the processing of his or her medical data [9]. In this case the Directive entrusts the data subject with the power to authorize the processing of his or her medical data [7]. This empowerment of the data subject represents without any doubt a very strong expression of his or her informational self determination – the power of the data subject upon his or her personal data – [10]. But this empowerment could also surprise. Is the data subject always capable to decide in a reasonable way about the processing of his or her medical data? Isn’t it too dangerous to give the data subject such power when most of the time he or she represents the “weakest” party or at least the “demanding” person in the processing his or her medical data? By example, how could a patient oppose the processing of his or her medical data for scientific purpose (ex. for a clinical trial) before a surgery or any other investigation? How to ensure the validity of the data subject’s consent and avoid a complete masquerade? This empowerment of the data subject should not be seen as unlimited or under no control. In fact when given this power the data subject has to evaluate the interest(s) that could justify the processing of his or her medical data. With this end in view the data subject has to put correctly into balance the interests in presence and to act accordingly. Otherwise the consent will not be able to legitimate the processing of his or her medical data (see infra about the real control of the legitimacy of the processing of medical data and the determination of the interests in presence). The Directive confirms this analysis.

110


6. Regarding the Directive the data subject's consent means “any freely given specific and informed indication of his wishes by which the data subject signifies his agreement to personal data relating to him being processed” [11]. First the consent has to be indubitable, indisputable, without any doubt. Then the consent of the data subject must have been freely given. In this regard the consent has to be free of any vice, constraint or pressure. With respect to this any direct profit (such as the benefit for his or her health) or indirect profit (such as the participation to the progress of medical science) for the patient should not affect automatically the validity of the data subject’s consent. Would the financial retribution of the data subject (beyond the cover of his or her eventual expenses ) invalidate his or her consent? Again, the answer to this question should not be absolute. It should depend upon the circumstances of each considered case and on how the applicable law deals with the protection of the data subject. Moreover the consent of the data subject has to be specific and informed. To be specific reminds insistently that the data subject must know exactly what he or she consents to. The latter implies necessarily the prior and adequate information of the data subject concerning the processing of his or her medical data. Without this prior and adequate information the consent of the data subject shall not be specific. Therefore and in any case the consent of the data subject could not ground the processing of his or her medical data. In this view the next question is logically the determination of the detail level of the provided information to the data subject. Articles 10 and 11 of the Directive determine the minimum content of this information. The latter must permit the complete enforcement of all the aspects of the data processing – such as the data quality, the data subject’s rights, the security and confidentiality measures, the notification to the supervisory authority, etc. –. However there is no doubt that the information has to be more accurate and complete particularly since very sensitive data as medical data are processed. In any case the data subject may not give an unspecified or uninformed consent to the processing of his or her medical data. Further processing of medical data is prohibited when incompatible with the initial purpose for which data have been collected. The consent must be given prior the time of the data collection. It must not be given necessarily at the same time; it only has to be obtained prior the processing. 7. The consent of the data subject must be explicit to allow the processing of his or her medical data [12]. A contrario, the requirement of an explicit consent should exclude any implicit consent – whatever could mean this last notion –. With respect to this, beyond the indisputable character of the data subject’s consent, its explicit characteristic presumes that it has been expressed. Several Member States have decided to transpose this requirement by asking for a written consent from the data subject. However the explicit consent could be deduced from some other behaviour of the data subject especially regarding the circumstances of the case. Indeed some positive actions could express the explicit consent of the data subject to the processing of his or her


111

medical data, such as the participation to a foundation fighting against the disease affecting the data subject or as the demand to be treated in a special medical unit notoriously known as being a research unit. 8. In all these circumstances the consent of the data subject induces a presumption of legitimacy of the processing of his or her medical data. It is assumed that the data subject has correctly assessed the interests in presence and acted accordingly. If the data subject has not correctly assessed the interests in presence and if the interests in presence are not respected, his or her consent will not legitimate the processing of his or her medical data. The latter will not be legitimate on this ground. In other words the consent of the data subject does not exonerate the data controller from pursuing a legitimate purpose (inducing the balance between the interests in presence) and the consent of the data subject may not cover the illegitimate interest or the lack of interest of the data processing. 9. The Directive provides that Member States may oppose the possibility for the sole consent of the data subject to lift the prohibition from processing medical data [13]. 10. In any case the data subject may always revoke his or her consent to the processing of his or her medical data. What are the consequences of this revocation? Does it mean that, in the future, new operations upon the data subject’s medical data will not be any more possible (without any effect on the existing data processing) or do we have to considered that the operations realised upon the medical data on the ground of the initial consent of the data subject may not be pursued ? Since the data subject has revoked his or her initial consent there is no more legitimate base for the processing of the medical data. The operations may not be pursued. That does not mean that the past operations realised upon the medical data of the data subject are now unlawful. It simply means that they can not be pursued except on the ground of another base of legitimacy. 11. Finally the Directive gives no formal indication on the nature of the consent given by the data subject or on the possible contractual relationship between the data controller and the data subject. In our views the solution to these questions depends on how the applicable law deals with the relationship between the data controller and the data subject and with the relationship between the data subject and his or her personal data. In any case the possible contract should obey the special rules imposed through the transposition of the Directive in the applicable law such as the characteristics of the data subject’s consent, the data quality, the data subject’s rights, the security and confidentiality measures, the notification to the supervisory authority, etc. The applicable law determines also the capacity to consent for underage or disable persons. Regarding the previous developments, it is not sure that the consent of the data subject represents the best solution to ground the legitimacy of the processing of medical data in HealthGrid. Fortunately the Directive provides alternative solutions to legitimate the processing of medical data.

112


2.2. Carrying out obligations and specific rights of the data controller in the field of employment law 12. The ban on processing medical data does not apply where the “processing is necessary for the purposes of carrying out the obligations and specific rights of the controller in the field of employment law in so far as it is authorized by national law providing for adequate safeguards” [14]. With respect to this, the purpose of the data processing is only to allow the data controller to fulfill his obligations and rights in the matter of Employment Law, the latter must being specific. This hypothesis seems to cover Medical Inspection. Then the processing of medical data has to be necessary and not only useful to this purpose. Therefore the data controller has to prove the necessity to process medical data to carry out his obligations and specific rights in the field of Employment Law. Finally this kind of processing has to be authorized by the applicable law providing for adequate safeguards, the latter being not further determined. 2.3. Vital interests 13. The third hypothesis allowing for the processing of medical data is where “processing is necessary to protect the vital interests of the data subject or of another person where the data subject is physically or legally incapable of giving his consent” [15]. The notion of “vital interest” means expressly and exclusively the situation of an imminent danger to the life of a natural person. This covers the protection of the vital interests of the data subject but also of any other natural person. However in this last situation the Directive adds that the data subject mu st be physically or legally incapable of consenting to the processing of his or her medical data. It can not be deduced from this disposition that the data subject, physically or legally capable of consenting, could, without any consequence, refuse to authorize the processing of his or her medical data when the vital interests of another person are at stake. The qualification of this behaviour should be qualified under the applicable law. 2.4. Non profit organisation 14. The processing of medical data could be legitimate where the “processing is carried out in the course of its legitimate activities with appropriate guarantees by a foundation, association or any other non-profit-seeking body with a political, philosophical, religious or trade-union aim and on condition that the processing relates solely to the members of the body or to persons who have regular contact with it in connection with its purposes and that the data are not disclosed to a third party without the consent of the data subjects” [16]. With respect to this the organization must have a non profit purpose and the latter has to be relative to the exercise of fundamental rights and freedoms [7].


113

2.5. Data manifestly made public and establishment, exercise or defence of legal claims 15. The ban on processing medical data does not apply where “the processing relates to data which are manifestly made public by the data subject or is necessary for the establishment, exercise or defence of legal claims” [17]. It has to be reminded that, even if manifestly made public by the data subject, the processing of his or her sensitive personal data falls nevertheless under the scope of the Directive. Hence the data controller must comply with all the other conditions ensuring the lawfulness of the data processing. 2.6. Healthcare purpose 16. The ban on processing medical data does not apply “where processing of the data is required for the purposes of preventive medicine, medical diagnosis, the provision of care or treatment or the management of health-care services, and where those data are processed by a health professional subject under national law or rules established by national competent bodies to the obligation of professional secrecy or by another person also subject to an equivalent obligation of secrecy” [18]. The healthcare purpose should be interpreted broadly [19] including the management of healthcare services. The latter should include secondary purposes necessary to provide healthcare such as medical secretaries, computer Departments, etc. By contrast, this hypothesis does not include Social Security purposes or Public Health purposes (cf. infra 2.7). Medical data mu st be processed by a health professional, but this last notion has not been further defined. The health professional has to be subject under national law or rules established by national competent bodies to professional secrecy. When not processed by a health professional, the processing may be carried out by another person if he or she is subject to an equivalent obligation of secrecy notably due to his or her status or by way of contractual stipulation or term. It is quite remarkable that the patient’s consent is not required to legitimate the processing of medical data. Might there be confusion with the consent to the provision of healthcare ? 2.7. Reasons of substantial public interest 17. The Directive grants Member States with permission to lay down additional exemptions for reasons of substantial public interest [20]. Hence the Member State has to prove in each case the real existence of the considered substantial public interest(s). The Directive had essentially in mind substantial public interests relative to Public Health and Social Security “especially in order to ensure the quality and costeffectiveness of the procedures used for settling claims for benefits and services in the health insurance system (…)” [21]. It had also in mind scientific research and public statistics [21].

114


The cases where medical data may be processed must be laid down by national law or by decision of the supervisory authority. But Member States may only allow for the processing of medical data if these exceptions are subject to the provision of suitable safeguards to protect the fundamental rights and freedoms of the data subjects and especially their right to respect for private life [21]. The Directive does not determine these safeguards. Member States must notify to the European Commission the exemptions to the ban on processing medical data adopted on this base [22]. Member States must determine the conditions under which a national identification number or any other identifier of general application may be processed [23].

3. REAL ASSESSMENT OF THE LEGITIMACY OF THE PROCESSING OF MEDICAL DATA 18. The legitimacy of the processing of medical data is not complete when only formally fitting into one of these exceptions to the ban on processing medical data, even with the consent of the data subject. Indeed these exceptions are only hypotheses where the legitimacy of the data processing is formally assumed. Now the legitimacy of the processing of medical data – the balance of the interests in presence – has to be really assessed. First the interests in presence have to be identified. Are they only the interests of the data controller and of the data subject or should we also consider the interests of third concerned parties and of the whole society? In our view these two last categories of interests should be taken into account when evaluating the legitimacy of the processing of medical data. Then the explicit and valid consent of the data subject presumes, until contrary proof, the existence of an acceptable balance between the interests in presence in the processing of his or her medical data. However, in this case, it is quite difficult to assume that the data subject has adequately taken into account interests other than one’s own. In any case the processing of medical data will not be legitimate if the balance between the interests in presence is not respected, even with the regular consent of the data subject. 19. But the legitimacy of the processing of medical data is definitely and very usefully strengthened by the additional consent of the data subject. That is the reason why we must firmly approve and recommend the ethical practice aiming to obtain the consent of the data subject when processing medical data. This practice is frequent in the conduct of clinical trials and in telematic networks in healthcare. 20. Finally, it has to be stressed that the data controller may not legitimate the processing of medical data on other bases. That excludes necessarily the use of the hypotheses of formal legitimacy enumerated in article 7 of the Directive for nonsensitive personal data. By example the data controller may not legitimate the


115

processing of medical data by the balance of the interests in presence without respecting the hypotheses enumerated in article 8.

CONCLUSIONS 21. The protection of medical data implies to fix the rules applicable to the processing of medical data and hence to determine their conditions. With regard to their highly sensitive nature, medical data require a special protection taking into account their content and the purpose of their processing. Therefore Directive 95/46/EC has decided to prohibit the processing of medical data. However the Directive provides that this ban does not apply in several cases. In these cases the legitimacy of the processing of medical data is formally assumed without prejudice for the other conditions ensuring the lawfulness of the data processing. These exceptions to the ban on processing medical data have to be restrictively interpreted. The explicit and valid consent of the data subject constitutes the very first source of legitimacy for the processing of his or her medical data even if, at the same time, it is the weakest base to legitimate the processing of medical data due to the strict conditions for its validity and to the possibility for the data subject to revoke his or her consent at any time and without justification (but with reasonable notice in some case?). Nevertheless even if the data controller may legitimate the processing of medical and even with the consent of the data subject, the legitimacy of the data processing must be really assessed in each case by the balance of the interests in presence. These include the interests of the data subject, of the data controller, of third concerned parties and of the society. In any case the consent of the data subject does not cover the lack of legitimacy or the illegitimacy of the processing of his or her medical data. The consent of the data subject only creates the presumption of legitimacy of the processing of medical data until proof of the contrary. Finally we must approve and recommend very strongly and warmly the ethical practice requiring the consent of the data subject when processing medical data, even the latter might rely on another base of legitimacy.

Endnotes [1] [2]

[3]

On the Directive : Y. Poullet, M.-H. Boulanger, C. de Terwangne, Th. Leonard, S. Louveaux et D. Moreau, La protection des données à caractère personnel en droit communautaire, Journal des Tribunaux de droit européen, Bruxelles, Ed. Larcier, 1997, p. 121 (in three parts). Directive 95/46/EC, art. 8.1. The notion of medical data includes all information relative to any aspect, physical or psychological, of the present, past or future health condition, good or bad, of a living or dead natural person. On the definition of medical data : Explanatory report of Convention n° 108, recital 45 ; Rec. (97) 5 of the Council of Europe relative to the protection of medical data , art. I of the annex ; C.J.C.E., 6 Nov. 2003, Bodil Lindqvist, case C-101/01, obs. C. de TERWANGNE, « Affaire Lindqvist ou quand la Cour de justice des Communautés européennes prend position en matière de protection des données personnelles », R.D.T.I., 2004, pp. 67-99 ; Groupe européen d’éthique des sciences et des nouvelles technologies, avis n° 13 du 30 juillet 1999 sur les aspects éthiques de l’utilisation des données personnelles de santé dans la société de l’information. Directive 95/46/EC, art. 8.2.

116

[4] [5] [6] [7] [8] [9] [10]

[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]


On the free movement of personal data : Directive 95/46/CE, art. 1.2, and recitals 3, 4, 5, 6, 7, 8, and 9. Directive 95/46/CE , art. 1.1. Usually, sensitive data are personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership and personal data concerning health or sex life. Directive 95/46/CE , recital 33. Cf. infra for the identification of these interests. Directive 95/46/CE , art. 8.2. a. The national law may provide that the data subject’s consent may not lift the prohibition. On the notion of informational self-determination : Fr. RIGAUX, La protection de la vie privée et des autres biens de la personnalité, Bruxelles, Paris, Bruylant, L.G.D.J., 1990, p. 588-589, n° 532 : « (…) La juridiction constitutionnelle a déduit du droit de la personnalité l’un de ses attributs, à savoir : « le pouvoir reconnu à l’individu et résultant de la notion d’auto-détermination, de décider en premier lieu lui-même quand et dans quelle mesure des faits relatifs à sa propre existence sont divulgués (…) Cet attribut du droit de la personnalité est appelé « droit à la maîtrise des données personnelles » (…) Il n’est toutefois pas sans limite. (…) » ; Council of Europe, Resolution 1165 (1998), 26 June 1998, Droit au respect de la vie privée (24th Session), point 5. Directive 95/46/CE , art. 2, h. Directive 95/46/EC, art. 8.2, a) and recital 33. Directive 95/46/EC, art. 8.2, a). Directive 95/46/EC, art. 8.2, b). Directive 95/46/EC, art. 8.2, c). Directive 95/46/EC, art. 8.2, d). Directive 95/46/EC, art. 8.2, e). Directive 95/46/EC, art. 8.3. However the Directive seems to include only certain purposes relative to healthcare (cf. recital 33). Directive 95/46/EC, art. 8.4. Directive 95/46/EC, recital 34. Directive 95/46/EC, art. 8.6. Directive 95/46/EC, art. 8.7.


117

Development of Grid Frameworks for Clinical Trials and Epidemiological Studies Richard SINNOTT, Anthony STELL, Oluwafemi AJAYI National e-Science Centre, University of Glasgow, United Kingdom Abstract. E-Health initiatives such as electronic clinical trials and epidemiological studies require access to and usage of a range of both clinical and other data sets. Such data sets are typically only available over many heterogeneous domains where a plethora of often legacy based or in-house/bespoke IT solutions exist. Considerable efforts and investments are being made across the UK to upgrade the IT infrastructures across the National Health Service (NHS) such as the National Program for IT in the NHS (NPFIT) [1]. However, it is the case that currently independent and largely non-interoperable IT solutions exist across hospitals, trusts, disease registries and GP practices – this includes security as well as more general compute and data infrastructures. Grid technology allows issues of distribution and heterogeneity to be overcome, however the clinical trials domain places special demands on security and data which hitherto the Grid community have not satisfactorily addressed. These challenges are often common across many studies and trials hence the development of a re-usable framework for creation and subsequent management of such infrastructures is highly desirable. In this paper we present the challenges in developing such a framework and outline initial scenarios and prototypes developed within the MRC funded Virtual Organisations for Trials and Epidemiological Studies (VOTES) project [2].

1. Introduction Clinical trials allow for the large-scale assessment of the moderate effects of treatment on various diseases and conditions. Typically the various stages of a trial involve identifying willing participants, evaluating their eligibility for the study, obtaining their consent, beginning the course of treatment and undertaking follow-up study both during and potentially long after the treatment has completed. Statistical analysis of the impact of the trials, e.g. on the efficacy of the drugs being tested can then be undertaken. The large-scale processes involved in this can be broadly broken down into three areas: patient recruitment; data management, and study administration and coordination. Until recently it was the case that clinical trials and epidemiological studies would be human intensive and paper based. Examples include, the West of Scotland Coronary Prevention Scheme (WOSCOPS) study [3] conducted at the University of Glasgow, where over 20,000 letters were sent out to eventually recruit 6595 middle-aged men (age 45-64) with a mean cholesterol of 7.0 +/- 0.6mmol. On a much larger scale the UK BioBank effort [4] will be sending many millions of letters to potential trial participants in the hope of recruiting 500,000 members of the population between 40-69 years of age. Not only are these expensive solutions, they are also highly inefficient and human intensive often with members of the population being contacted that do not meet the appropriate constraints for the given trial, e.g. their cholesterol is too high or too low,

118

R. Sinnott et al. / Development of Grid Frameworks

or they are on other drug treatments etc. E-health initiatives are now moving towards electronic based clinical trials which in principle offer solutions to improve how trials are set up and subsequently managed. However, establishing an electronic trial is not without its own challenges. Each individual trial will face the same kinds of challenges for recruitment, data management and study co-ordination, hence a framework supporting a multitude of trials would be extremely beneficial and is something currently being explored within the MRC funded VOTES project [2]. To establish an e-Infrastructure for clinical trials requires addressing heterogeneity and distribution of systems and data sets, and differences in general practices, e.g. how data is backed up (or not) at given sites. One of the key challenges from an IT perspective is security. The “weakest link” adage applies to security and a single site that does not take appropriate security considerations, both in terms of the technologies they have used, how they are using them and their general practices, can in principle jeopardise the security of all collaborating sites [5]. The risk of data disclosure is an ever present security risk that cannot be ignored. Ensuring that Caldicott guardians and other independent senior health professionals with strategic roles for the management of the data protection or confidentiality associated with patient data sets are involved in the decisions that influence the development of such infrastructures is crucial to their success; from their development, their acceptance, and perhaps more importantly their ethical usage. It could be argued that the immediate hurdle in establishing an electronic clinical trial is how to recruit people. Key sources of data in Scotland include national census data sets such as the General Register Office for Scotland [6] which includes information such as the registration of births, marriages, deaths as well as being the main sources of family history records. The access to such information whilst useful does not include direct health related information which will likely impact upon the suitability of patients to a trial. Primary care and secondary health care data sets are other immediate choices, however access to and usage of these data sets will likely require ethical approval. Patients should have the opportunity to consent that their data can be accessed and used. However in running a clinical trial, it is often the case that statistical information is enough. Thus rather than disclosing information on specific patients, statistical information is sufficient. Even here however, questions on ethics are raised. At the very least, doctors and their patients need to be included in any data access decisions. Yet the establishment and running of electronic clinical trials is a compelling one with data often being stored in some form of digital format, albeit across a multitude of databases behind firewalls. One of the key challenges is to allow secure access to these data sets to the right people for the right purpose. High levels of security should not be at the cost of usability. A good example of this is the remote control car key - a far improved and more complex technologically, security solution, but easier to access and use. Similarly, end users of e-Infrastructures should be largely unaware of the fine grained security solutions that are restricting and controlling their access and usage of the facilities. Usability of the infrastructures is of uppermost importance to their success and take-up [7]. In this paper we describe our attempts to establish and support a Grid framework at the National e-Science Centre (NeSC) in Glasgow as part of the initial phase of the VOTES project. As this work is in the early stages, the solution presented is necessarily grounded in this specific use-case but is conducted with a view to scaling up and generalising as the project proceeds. Through this framework we expect to support the


119

efficient establishment and subsequent conduct of clinical trials and studies. In the rest of this paper we present the technical and non-technical challenges facing the design and development of this framework, along with an outline of the early proof of concept prototypes currently supported. We also outline the future work of the project and challenges, still to be addressed to realise the vision of an e-Infrastructure for a range of clinical trials and studies.

2. Existing Infrastructures and Data Sets across Scotland The VOTES project [2] is a collaborative effort between e-Science, clinical and ethical research centres across the UK including the universities of Oxford, Glasgow, Imperial, Nottingham and Leicester. The primary focus of VOTES is to build an infrastructure to support a multitude of clinical virtual organisations. Virtual organisations (VOs) are a common concept in the Grid community and provide a conceptual framework through which the rules associated with the participants, their roles and the resources to be shared can be are agreed and subsequently enforced across the Grid. VOs in the clinical trials domain are characterised by a much greater degree of emphasis on security, data access and data ownership. We term these Clinical Virtual Organisations (CVOs) since they place requirements not typical to other High Performance Computing-oriented VOs common to the wider Grid community. Rather than developing bespoke CVOs for each individual clinical trial, it is our intention to develop a framework supporting a multitude of CVOs. Each of these CVOs will be derived from the framework and adapted depending on the needs of the trial or study being conducted. Common phases of many clinical trials and epidemiological studies, and the primary focus for core components that will exist in the VOTES Grid framework are: • Patient recruitment enabling semi-automated large-scale recruitment methods for investigators conducting large-scale clinical studies in a variety of settings; • Data collection incorporating data entry including intermittent connectivity to other resources, such as a trial-specific databases, code lists for adverse events and non-study drugs, randomization programs and support for internationalisation of case report forms; • Study administration supporting the administration of the study, including logging details of essential documents, enabling rapid dissemination of study documentation and by co-ordinating transport of study treatment and collection of study samples. The first step in developing a Grid framework for clinical trials is to identify the potential sources of data and services that allow access to such data. Close liaison with data providers, data owners and existing services is essential. Within the Scottish element of VOTES we are working closely with the NHS in Scotland who have identified the following data sets and software which provide initial coverage of the sets of data needed for clinical trials and epidemiological studies1: • The General Practice Administration System for Scotland (GPASS) [8] is the core IT application used by over 85% of clinicians and general practitioners involved in primary care across Scotland; 1

This does not imply that this data is readily available directly, but that these are the sources of data and software which we should be eventually interfacing with.

120


•

Scottish Morbidity Records (SMR) [9] includes records relating to all patients discharged from non-psychiatric and non-obstetric wards in Scottish hospitals (including datasets on death, cancer, hospital admissions, etc.) • Scottish Care Information Store (SCI Store) [10] - a batch storage system which allows hospitals to add a variety of information to be shared across the community, e.g. pathology, radiology, biochemistry lab results are just some of the data that are supported by SCI Store. Regular updates to SCI Store are provided by the commercial supplier using a web services interface. Currently there are 15 different SCI Stores across Scotland (with 3 across the Strathclyde region alone). Each of these SCI Store versions has their own data models (and schemas) based upon the regional hospital systems they are supporting. The schemas and software itself are still undergoing development. • NHS data dictionary [11] - a one-stop shop for health and social care data definitions and standards. It contains a summary of concepts for SMR datasets including online manuals for the datasets; information on the clinical datasets in use in healthcare and social care datasets along with the data standards upon which they are based. The Scottish component of the Grid framework under development within VOTES is being targeted to these resources. Components which allow secure and ethical access to GPASS for example will provide a highly generic reusable solution applicable to over 85% of all practices across Scotland. Contemporaneously, solutions accessing NHS resources are also being developed by the other partners. A summary of the challenges involved includes broadly: the need for a common definition of clinical standards, the need to maintain security whilst still taking advantage of the flexibility of Grid solutions, the need for scalability, authorization and anonymisation. The following sections address these challenges in more detail.

3. Data Federation and Distributed Security Challenges As CVOs necessarily span heterogeneous domains, a pre-requisite to the construction of distributed queries and aggregation or joining of data returned is the development and use of a standard method of classification or common vocabulary more generally. This includes the naming of the data sets themselves, the people involved and their roles (privileges) in the access to and usage of these data sets amongst other things. Ideally these data and roles should be standardised so that comparisons can be drawn and queries joined together for example across a range of clinical data sets. There are numerous developments in standards for the description of data sets used in the clinical trials domain. However, this can be an involved process depending on standards groups developing and acting on strategies put together through major initiatives such as Health-Level 7 (HL7) [12], SNOMED-CT [13] and OpenEHR (Open Electronic Health Records)[14]. There are often a wide range of legacy data sets and naming conventions which impact upon standardisation processes and their acceptance. The International Statistical Classification of Disease and Related Health Problems version 10 (ICD-10) [15] is used for the recording of diseases and health related problems and is supported by the World Health Organisation. In Scotland, ICD10 is used within the NHS along with ICD version 9 and Read codes in the SMR data


121

sets for example. ICD-10 was introduced in 1993, but the ICD classifications themselves have evolved since the 17th century [16]. An explicit example of the problems facing large scale (international) clinical trials is the term “neoplasia” which means “new growth for benign/malignant tumours” in Northern Europe but “cancer” in Southern Europe. Hence, the type of treatment provided depends heavily on the location of the patient. Global Grid frameworks that incorporate appropriate meta-data identifying the different local data classifications can provide capabilities to address such discrepancies. The standardisation process itself may influence how readily any given standard is adopted. For example, standards developed to specific deadlines during the standardisation-making process, and standards bodies producing regular updates with solutions readily available for implementation are more likely to gain acceptance. This is also the case within the Grid community. Linking standardised data descriptions between domains so that entities and relationships within one organisational hierarchy can be mapped or understood within the context of another domain is fundamental to the development of the Grid applications proposed in VOTES. Once it has been established how meaningful comparisons can be made between the schemata of differing domains, this knowledge can be applied to a generic clinical trial that could run queries across heterogeneous domains, bringing back generic results, richer in scope and information than if single local sites had been independently queried. Information stored in clinical trials is by its nature, highly sensitive – drug treatments, conditions and diseases that patients have must be kept in the strictest confidence and the exact details should only be known about by a few privileged roles in the trial. This is one of the most fundamental challenges in this work – to realise the opportunities and benefits that can be brought to this field by Grid technology but to also maintain the high security standards that must be strictly adhered to. Within the Grid community VO security issues are generally grouped into the categories of: • Authentication – the discovery of a user’s identity. This is achieved in most Grid applications by the use of the well-established Public Key Infrastructure (PKI) technology [17]. • Authorization – the discovery of that user’s privileges based on their identity. This is less well-established in the Grid community. Various software solutions are available for the establishment of user privilege assertions – PERMIS [18] (which implements the Global Grid Forum Authz API [19]), Community Authorization Service (CAS) [20], Virtual Organisations Management Service (VOMS) [21], Akenti [22] – with no single model having been adopted over the others. • Accounting – logging the activity of users so that they can be held accountable for their actions within a system. This is also less well-established with many implementations coming from “home-grown” solutions within different projects. Though important in an overall security strategy, this area is usually addressed once the solid platform of authentication and authorization has been established. Authentication in the Grid is achieved using PKI technology. This involves using a combination of public certificates and public and private keys to verify that a user is who they say they are. This is a well-established way of establishing user identity however it has limitations as a standalone security solution in terms of general usability, security granularity and overall scalability [23,24]. A more scalable, user-oriented solution which is being explored within the VOTES project is the Internet2 Shibboleth technology [25]. Shibboleth allows the delegation of

122


authentication to the local sites involved. Through agreed federations where security attributes for fine grained authorisation are pre-agreed, the users are able to access and use remote Grid resources through local (home) authentication [26,27]. Typically they will log in with their own usernames/passwords at their home institution and the security attributes (which might include their roles in particular clinical trials for example) are then released and used by the target site to determine whether access to the resources being requested should be granted. As well as supporting seamless single sign-on to Grid infrastructures, this model moves the whole process of identity establishment and authentication to the home site. It also minimises the potential dangers of users writing down their PKI passwords and transparently restricts what they are able to do on the remote Grid resources. In the clinical trial domain, it is paramount that site autonomy is supported. If the home site at which a user authenticates themselves does not release all necessary attributes as agreed within the federation, then the user will not be allowed access to and usage of the remote resource. We note that the Shibboleth model is inherently more static than the true dynamic vision of the Grid where data and resources are found and used “on-the-fly”. This static oriented model is consistent with the clinical domain however where it is highly unlikely that new people, new data sets or new services are continually, dynamically added or removed from the clinical environment. The issue in Grid security that is much less well-established that authentication is that of privilege management – what a user can actually do once their identity has been verified. The main issue is that of the heterogeneous nature of the domains across which the data is being federated. Security policies will naturally differ between local sites, which leads to several challenges when defining and implementing policies that take account of both local and remote security concerns. These include: • Applying a generic policy that takes into account of each local policy or linking local policies together using a standard interface. • Dynamically enforcing these policies so that, for example, restrictions applied by a site not providing pertinent information for a particular query will not impact on the sites that are involved. • Building a trust chain that allows local sites to authenticate to the VO and therefore, by proxy, be authenticated to limited resources at other sites without compromising protected resources at those other sites. • Prevention of inference (statistical disclosure) that arises when data is aggregated from numerous sources. • Maintaining data ownership and enforcing ownership policies regardless of where the data might be moved to or stored or used. In addition to authentication and authorization, another artefact of security that is essential in this domain is that of “anonymisation”. This process involves allowing less-privileged users to gather statistical data for the purposes of studies or trials, but without revealing the associated identifying data – this only being available to users with greater privileges. The NHS in Scotland currently achieves this by encrypting a unique number associated with all patients across Scotland: the Community Health Index (CHI) number. Once an anonymised patient has been matched for a clinical trial, this encrypted value can in principle be sent to the Practitioners Service group (http://www.psd.scot.nhs.uk/) of the NHS who will as one of the many services that they provide, decrypt it and contact the patients directly (assuming ethical permission


123

has been granted for so doing) to ask if they wish to join the clinical trial. Several challenges must be overcome to support this including ensuring that only privileged users are able access and use data sets including this encrypted CHI number. A further challenge is that there are currently many independent solutions across the NHS for how they manage their infrastructures. Thus for example, there is no standardised way in which encryption is undertaken. Hence it is often difficult or impossible to ask Practitioners Services Division (PSD) to de-anonymise an encrypted CHI number if it is generated by arbitrary NHS trusts. Pragmatic solutions overcoming the nuances of NHS systems are thus necessary. Throughout the VOTES project, continuous ethical and legal overview of the solutions being put forward and the data sets being accessed are being made. This includes the perceived benefits of the research for the public, and is undertaken by independent ethical oversight committees. To support this, superior security roles for oversight committee members which allow access to all data sets and reports for given clinical trials will be made available. 4. Initial VOTES Scenarios, Architecture and Implementation In designing a reusable Grid framework for clinical trials immediate restrictions are imposed on the possible architectural solutions. Thus it is unlikely that direct access to and usage of “live” NHS data sets and resources will be achieved, where direct here implies that the Grid infrastructure can issue queries to a remote NHS controlled resource containing un-anonymised patient information, i.e. to a resources behind the NHS firewall. Nevertheless, it is possible to design solutions capturing sufficient information needed for a clinical trial without over-riding existing security solutions or assuming ethical permissions where none have been granted. Possible solutions being explored here include a push model (where anonymised NHS data sets are exported) to the academic Grid community (or to an NHS server in a demilitarised zone of the NHS). Another model is to allow the GPs and clinicians to drive the recruitment process, provided they consider that this is in the best interests of the patients. The exploration of these solutions may provide a basis for follow-up projects in this field. The following scenario presents a representative sequence of interactions demonstrating how primary care identification and recruitment of patients can be ethically achieved with patient and doctor consent. The scenario in Figure 1 is based on discussions with Scottish clinicians, NHS IT personnel and GPASS developers and is currently being prototyped in VOTES. Trials coordinator 0

Trials Portal

Personalised Services

1

GP with browser 5

3 4

5 7

Trial #2

Trial #3

8

Transfer Grid Node

OGSA-DAI

GPs Private Data Sets 6

Trial #1

2

9

Secure Data Repository

Figure 1: Example use of patient recruitment Grid application

124


0. A trials coordinator logs into a portal hosting various CVOs associated with a variety of clinical trials2. At this point, a personalised environment is established based upon the specific role (in this case, that of the trials coordinator) in the CVO and the location from where they are accessing the portal. Thus they should only see the Grid services pertinent to the appropriate trial applicable to them, and hence the data sets associated with those services. 1. The trial coordinator wishes to recruit patients for a particular trial. These patient details are only available in GPs local (and secure) databases – extensions to this scenario dealing with access to and usage of hospital databases are also possible. Emails are sent to the GPs/hospitals with information describing the particular trial to be conducted, the general criteria applicable to matching patients and other information, e.g. financial information about partaking in the trial. The email contains a link to a Grid service (trial #1). The GPs themselves are described in policies associated with the tentative set up of a CVO for patient identification and recruitment. 2. We assume that the GP is interested in entering into the trial, i.e. they know that they have matching patients and they follow the attached link. Depending upon whether a PKI has been rolled out to this GP and a suitable certificate (e.g. using the X509 standard) is already in the browser or a username and password combination is used instead, the GP securely accesses the Grid service. In this scenario we assume trusted certificates are being used. 3. After extracting more information about the trial from the portal, the GP decides to download a signed XML pro-forma pre-designed for this specific trial. This is a mostly complete document describing the main information relevant to this trial as documented in the trial protocol, where the empty fields need to be filled through a query to the GPs database. 4. The signature of the signed pro-forma document is checked to ensure its authenticity and that it has not been corrupted. If these are both true, the document is used as the basis for an XML query against the GP’s database (GPASS supports such an interface). This query might in turn result in further information being extracted from other resources. 5. At this point, letters describing the trial to matching patients can be automatically produced. These are used to obtain patient consent before continuing further with the trial. 6. The matching patients may then consent to entering into the trial. Note that these letters of consent may be sent directly to the trial coordinator instead of the GP as depicted here. 7. The forms are automatically completed based on the results of the queries to the GP database, digitally signed and returned to the Grid service for that particular trial (trial #1). 8. The returned signed XML document is authenticated and checks on the sender (the GP) being authorised to upload this document are made, e.g. through checking that they were one of the GPs contacted initially. The document is validated to ensure its correctness, e.g. by ensuring it satisfies the associated schema and the relevant data fields are meaningfully completed (and match the desired constraints associated with participation in the trial). At this point, the responding GP is formally added to the

2 Of course there are scenarios which predate this one, e.g. how CVO is established in the first instance and the policies by which the VO will be organised, managed, enforced.


125

CVO. Further follow up information may subsequently be sought, e.g. monitoring information related to the matching patients. 9. The completed XML document and the associated meta-data describing the history of how this information was established, by whom, when, for which trial etc are uploaded and securely added to the CVO repository for this particular trial. It is important to note in this scenario that patient consent is given (step 6) before patient data is returned to the clinical trials team. Another important aspect here is that the GP can decide whether this might be in the patients’ interest. The patient may ultimately say no and hence is always involved in the process. We note also that software solutions also exist for several parts of this scenario, e.g. automatic production of letters inviting patients to join the trial. Similar scenarios covering user-resource interactions are being developed and implemented within VOTES supporting secondary care patient recruitment as well as for general data collection and study management. In this scenario we include a secure repository accessible via the Open Grid Service Architecture Data Access and Integration (OGSA-DAI) middleware [28]. This repository forms part of what we term the “Transfer Grid” as indicated in Figure 2. The Transfer Grid infrastructure provides the core of the Grid infrastructure that the will underpin future CVOs, i.e. it is the platform, upon which the Grid solutions developed for security, data access and management, and data movement between repositories hosted at the partner and collaborating institutions can be supported. Since the Transfer Grid exists in the academic domain and not behind the NHS firewall, a variety of solutions for accessing and using the clinical trial data sets can be explored. The Grid applications pertinent to the clinical trials domain are constructed over this layer providing the deliverable trial services. This infrastructure will be expanded to include external peer sites of two classes: • Routine repositories such as those held by general practices, hospitals, disease-specific registries, device registries or the Office for National Statistics (ONS). • Study repositories such as research systems developed for a particular trial or observational study. These external peers will supply their own security policies, and may be intermittently connected to the Transfer Grid. As such, interfacing with routine repositories will be a highly involved and politically sensitive process. This motivates the need for the initial solution to be scalable. Clinical Virtual Organisation Framework Used to realise CVO-1 (e.g. for data collection)

CVO-2 (e.g. for recruitment)

LeiNott

GLA

Transfer Grid

Disease registries

Hospital databases

GPs OX

IMP

Clinical trial data sets

Figure 2: CVO Framework, Transfer Grid and Key Sources of Data

126


4.1. Current Software Architecture The basic architecture of this Grid framework, which supports federated queries in a user oriented but secure manner, is depicted in Figure 3. This infrastructure corresponds to one node of the Transfer Grid outlined above and is hosted on a trial test bed at the National e-Science Centre (NeSC) at the University of Glasgow. Portal

Grid Server

Data Server

Globus Container

OGSA-DAI Service

Oxford

Glasgow SCI Store 1 (SQL Server)

Driving DB

SCI Store 2 (SQL Server)

Consent DB (Oracle 10g)

RCB Test Trials DB (SQL Server)

Figure 3: Software architecture schematic. The “Oxford” box indicates how other institutions will be added to the current design – the current implementation only incorporates the test databases running in Glasgow.

A GridSphere [29] portal front-end communicates to a Globus Toolkit [30] (v4.0) grid service, which in turn provides access to an OGSA-DAI [28] data service. This runs queries from the “driving database” using standard Simple Object Access Protocol (SOAP) message-passing, but also in turn runs queries from the subsidiary databases available from the pool for which it is responsible, using direct Java Database Connectivity (JDBC) connections. The technology used in this implementation places strong emphasis on the use of grid services – essentially web services with the additional notion of permanent state. Within the Grid community this paradigm has been largely seen as the most effective solution to implementing transient and dynamic virtual organisations. An example of this is the Web Services Resource Framework (WS-RF) [31] as implemented in version 4.0 of the Globus Toolkit. Issues of access control are integrated within this framework by means of a Security Assertion Markup Language (SAML), which allows a standard exchange of security assertions and attributes. A popular implementation of this standard has been the OpenSAML project [32], which is now following the latest release of SAML, v1.1, and is currently developing an implementation of v2.0 [33]. The user accesses this infrastructure through a Gridsphere portal at [2]. With the appropriate privileges, users can currently bring back data from the database backends implemented in multiple test repositories of SCI Store and GPASS. Unprivileged users can retrieve limited data-sets, with the identifying patient data anonymised and other restrictions applied. Through the use of this application, the end user is able to seamlessly access a set of resources, pertinent to clinical trials, in a dynamic, secure and pervasive fashion. Depending on the user’s privileges, the results returned have varying degrees of verbosity thereby allowing limited statistical analysis without compromising the privacy restrictions necessarily applied in such sensitive data.


127

In the current version of the system to explore the problem space and gain familiarity with the clinical data sets used across Scotland, several “canned queries” representing valid clinical trial queries can be run which seamlessly access and use distributed back-end test databases as depicted in Figure 4.

Figure 4: Screen-shot of VOTES portal welcome screen (left) showing several “canned queries” with the type of result returned based on whether the user is privileged or not (right).

Users with insufficient privileges may still be able to run queries but may not be able to see all of the associated identifying data sets (see Figure 5). It is important to note that all of this is completely transparent to the end users of the system.

Figure 5: Results from an unprivileged user running a canned query. Identifying data is blanked out whilst statistically relevant data is available. Also the number of databases across which the query has been run is reduced because of lack of privileges.

Another key aspect of this infrastructure is how patient consent is handled. Currently the system supports a variety of models which are allowing exploration of the potential solution space for patient consent across Scotland. For example solutions have been prototyped which allow patients to consent to their data being used for a specific clinical trial, for a particular disease area or consent for their data being used generally. In addition, the system also allows for patients to opt out, i.e. their data sets may not be used for any purposes. Numerous variations on this are also being explored,

128


e.g. the patients’ data may only be used provided they are contacted in advance. To support this, a consent database has been established and is used when joining of the federated queries is undertaken to decide whether the data should be displayed, displayed but anonymised, or not displayed at all. The NeSC at Glasgow have extensive experiences in a range of fine grained authorisation infrastructures across a range of application domains [34-36]. Whilst we expect to move the existing prototype to a more robust authorisation solution, for rapid prototyping purposes to explore the problem space and get user feedback as early as possible, we have developed an authorization infrastructure based on an access matrix as shown in Figure 6.

U2(R1 ∆ h2) = 0 U3(R3 ∆ h1) = 1 U4(R2 ∆ R3 ∆ h4) = 0 U1(R1 ∆ h3) = 1 where ∆ is a combination function, 0, 1 are bit-wise privileges, RX, hX are resources and Ux is a subject Figure 6: Access Matrix Model

The authorisation mechanism implements an access matrix model [37] that specifies bit-wise privileges of users and their associations to data objects in the CVO. The access matrix is designed to enforce discretionary and role based access control policies and has been constructed to be scalable for ease of growth parallel to the growth of the infrastructure as a whole. Comparison of this approach with other solutions such as Role Based Access Control solutions such as PERMIS will be undertaken, where user views of data sets will be mapped to CVO roles. The federated data system [38] is currently composed of four autonomous test sites, each providing a clinical data source using either SQL Server [39] or Oracle [40]. The data sources exposed by these sites are configured as data resources on an OGSADAI data service. The OGSA-DAI data service implements a head node model to drive the data federation. The head node is selected based on rules or request requirements and is responsible for decomposing queries, distributing sub-queries and gathering and joining query results. In the current implementation, data federation security is achieved at both local and remote level. The local level security, managed by each test site, filters and validates requests based on local policies at Database Management System (DBMS) levels. The remote level security is achieved by the exchange of access tokens between the designated Source of Authority (SOA) of each site. These access tokens are used to establish remote database connections between the sites in the federation. In principle local sites authorise their users based on delegated remote policies. This is along the lines of the CAS model [20]. 5. Conclusions and Future work The VOTES prototype software is very much a work in progress. Yet the experiences in developing this prototype are helping to gain a better understanding of the clinical domain problem space and shaping the planned Grid framework. The vision


129

of a Grid framework eventually supporting a myriad of clinical trials and epidemiological studies is a compelling one, but can only be achieved once experiences have been gained in accessing and using a wide variety of clinical data sets. In achieving this, it is immediately apparent that there are a number of political and ethical issues that must be addressed when dealing with data-sharing between domains and these are inherently more difficult to deal with than the technological challenges. Whilst the NHS in Scotland and the UK more widely are taking steps to standardise the data-sets that they have, these are still far from being fully implemented (and accepted) by clinical practitioners. For instance, the unique index reference number the Community Health Index (CHI) has only been implemented across some regions of Scotland and therefore leaves certain areas with incomplete references. Those records that do not have the CHI number are referenced using a different Patient Identification (PID) number that will be idiosyncratic to the region in question. There is also a need to build up a trust relationship with the end-user institutions that we are working with to provide this clinical infrastructure. This necessarily takes time and will be furthered by engaging in an exchange program where employees from NeSC work with and understand the processes in the NHS IT departments and vice-versa. The current Grid infrastructure described here has allowed the investigation of automatically implementing combinations of patient consent policies. Ideally such a consent register would be maintained nationally, however this does not exist yet but is planned with the electronic patient record under discussions across the NHS in Scotland. Demonstrations of working solutions showing the trade-offs in consent or assent with opt in versus opt out possibilities allows the policy makers to see first hand what the impact of their ultimate decisions might have. We believe that it is easier to convince policy makers when they see actual working solutions rather than theoretical discussions of what might be achieved once the infrastructures are in place. The applications in this project are being developed with a view to being rolled out to the NHS Scotland in the first instance, moving from test data to “live” data with fully audited and standards-compliant security, upon establishment of reliability and production value. The eventual vision is that this infrastructure will one day be available on a global scale allowing health information to be exchanged across heterogeneous domains in a seamless, robust and secure manner. In this regard, we are currently exploring international collaborative possibilities with the caBIG project in the US [41] and closer to home in genetics and healthcare projects across Scotland [42].

6. References [1] National Program for IT in the NHS (NPFIT) - http://www.connectingforhealth.nhs.uk [2] Virtual Organisations for Trials and Epidemiological Studies (VOTES) http://www.nesc.ac.uk/hub/projects/votes/ [3] West Of Scotland Coronary Prevention Scheme (WOSCOPS) http://www.gla.ac.uk/departments/pathologicalbiochemistry/lipids/woscops.html [4] UK BioBank project - http://www.ukbiobank.ac.uk [5] R. O. Sinnott, Grid Security: Middleware, Practices and Outlook, prepared for the Joint Information Services Council (JISC), www.nesc.ac.uk/hub/projects/GridSecurityReport [6] General Register Office for Scotland, http://www.gro-scotland.gov.uk/ [7] R.O. Sinnott, Development of Usable Grid Services for the Biomedical Community, Workshop on Designing for Usability in e-Science, Edinburgh, January 2006, http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=613.

130


[8] General Practitioners Administration System for Scotland (GPASS), http://www.show.scot.nhs.uk/gpass/ [9] Scottish Morbidity Records (SMR), http://www.show.scot.nhs.uk/indicators/SMR/Main.htm [10] Scottish Care Information (SCI) Store http://www.show.scot.nhs.uk/sci/products/store/SCIStore_Product_Description.htm [11] NHS Data Dictionary – www.isdscotland.org [12] Health-Level 7 (HL7) - http://www.hl7.org/ [13] SNOMED-CT - http://www.snomed.org/snomedct/ [14] OpenEHR - http://www.openehr.org/ [15] International Statistical Classification of Disease and Related Health Problems (ICD-10), http://www.connectingforhealth.nhs.uk/clinicalcoding/classifications/icd_10 [16] ICD background, http://www.connectingforhealth.nhs.uk/clinicalcoding/faqs/ [17] R. Housley, T. Polk, Planning for PKI: Best Practices Guide for Deploying Public Key Infrastructures, Wiley Computer Publishing, 2001. [18] PERMIS - http://sec.isi.salford.ac.uk/permis/ [19] R.O. Sinnott, D.W. Chadwick, Experiences of Using the GGF SAML AuthZ Interface, Proceedings of UK e-Science All Hands Meeting, September 2004, Nottingham, England. [20] CAS - http://www.globus.org/toolkit/docs/4.0/security/cas/ [21] VOMS - http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/voms.html [22] Akenti - http://dsd.lbl.gov/Akenti/ [23] R.O. Sinnott, A.J. Stell, D.W. Chadwick, O.Otenko, Experiences of Applying Advanced Grid Authorisation Infrastructures, Proceedings of European Grid Conference (EGC), LNCS 3470, pages 265-275, Volume editors: P.M.A. Sloot, A.G. Hoekstra, T. Priol, A. Reinefeld, M. Bubak, June 2005, Amsterdam, Holland. [24] A.J. Stell, R.O. Sinnott, J. Watt, Comparison of Advanced Authorisation Infrastructures for Grid Computing, Proceedings of International Conference on High Performance Computing Systems and Applications, May 2005, Guelph, Canada. [25] Shibboleth Project - http://shibboleth.internet2.edu/ [26] R.O. Sinnott, J. Watt, O. Ajayi, J. Jiang, J. Koetsier, A Shibboleth-Protected Privilege Management Infrastructure for e-Science Education, submitted to CLAG+Grid Edu Conference, May 2006, Singapore. [27] R.O. Sinnott, J. Watt, O. Ajayi, J. Jiang, Shibboleth-based Access to and Usage of Grid Resources, submitted to International Conference on Emerging Trends in Information and Communication Security, Freiburg, Germany, June 2006. [28] OGSA-DAI – http://www.ogsadai.org.uk [29] GridSphere – http://www.gridsphere.org [30] Globus Toolkit – http://www.globus.org/toolkit [31] Web Services Resource Framework (WS-RF) – http://www.globus.org/wsrf [32] OpenSAML Project – http://www.opensaml.org [33] OpenSAML Development Wiki - https://authdev.it.ohio-state.edu/twiki/bin/view/Shibboleth/ OpenSAML [34] R. O. Sinnott, M. M. Bayer, J. Koetsier, A. J. Stell, Grid Infrastructures for Secure Access to and Use of Bioinformatics Data: Experiences from the BRIDGES Project, submitted to 1st International Workshop on Bioinformatics and Security (BIOS’06), Vienna, Austria, April, 2006. [35] R.O. Sinnott, M. Bayer, D. Berry, M. Atkinson, M. Ferrier, D. Gilbert, E. Hunt, N. Hanlon, Grid Services Supporting the Usage of Secure Federated, Distributed Biomedical Data, Proceedings of UK e-Science All Hands Meeting, September 2004, Nottingham, England. [36] R.O. Sinnott, A.J. Stell, J. Watt, Experiences in Teaching Grid Computing to Advanced Level Students, Proceedings of CLAG+Grid Edu Conference, May 2005, Cardiff, Wales. [37] R. S. Sandhu andd P. Samarati, “Access control: Principles and practice,” IEEE Communications Magazine, vol. 32, no. 9, pp. 40-48, 1994. [38] A. P. Sheth and J. A. Larson, “Federated database systems for managing distributed, heterogeneous, and autonomous databases,” ACM Comput. Surv., vol. 22, no. 3, pp. 183-236, 1990. [39] SQL Server – http://www.microsoft.com/sql [40] Oracle – http://www.oracle.com [41] National Cancer Institute, cancer Biomedical Informatics Grid, https://cabig.nci.nih.gov/ [42] Generation Scotland Scottish Family Health Study, http://www.innogen.ac.uk/Research/The-ScottishFamily-Health-Study


131

Privacy Protection in HealthGrid: Distributing Encryption Management Over the VO Erik TORRES a, b, 1, Carlos DE ALFONSO b, Ignacio BLANQUER b, Vicente HERNÁNDEZ b a

b

Centro Nacional de Bioinformática, Cuba Universidad Politécnica de Valencia – ITACA, Spain

Abstract. Grid technologies have proven to be very successful in tackling challenging problems in which data access and processing is a bottleneck. Notwithstanding the benefits that Grid technologies could have in Health applications, privacy leakages of current DataGrid technologies due to the sharing of data in VOs and the use of remote resources, compromise its widespreading. Privacy control for Grid technology has become a key requirement for the adoption of Grids in the Healthcare sector. Encrypted storage of confidential data effectively reduces the risk of disclosure. A self-enforcing scheme for encrypted data storage can be achieved by combining Grid security systems with distributed key management and classical cryptography techniques. Virtual Organizations, as the main unit of user management in Grid, can provide a way to organize key sharing, access control lists and secure encryption management. This paper provides programming models and discusses the value, costs and behavior of such a system implemented on top of one of the latest Grid middlewares.2 Keywords. Privacy of medical data, encrypted storage, medical grids, HealthGrid

1. INTRODUCTION Grid technologies have proven to be very successful in tackling challenging problems in which data access and processing is a bottleneck. The benefits of Grid-based applications on health are clearly identified [1, 2], since medical applications usually deal with large distributed data which must be considered at a global level (e.g. in epidemiology studies). HealthGrids, as Grids for healthcare, bring new tools, procedures and resources for patientcustomized therapy and epidemiological studies, improving clinical decision and diagnoses for better patient care. However, the development of HealthGrids, regardless of the success 1

Corresponding Author: [email protected] This work is partially funded by the Spanish Ministry of Science and Technology in the frame of the project Investigación y Desarrollo de Servicios GRID: Aplicación a Modelos Cliente-Servidor, Colaborativos y de Alta Productividad, with reference TIC2003-01318.

2

132

E. Torres et al. / Privacy Protection in HealthGrid

of prototypes and trials [DATAGRID, EGEE, GEMSS] is slow, mostly due to the legal constraints of medical data. Security in public networks and in Grids in particular has several risks, since users in a VO normally share data access rights. Moreover, the medical users must trust on the remote site protection, where users who grant administrator privileges can directly access the data. The ability to implement adequate confidentiality and privacy control in HealthGrid is both an ethical issue, affecting patient care, and a matter directly affecting the outcome of medical and clinical research, as discussed in [3, 4]. Securing privacy of confidential data stored in a Grid element remains unsolved. Storing medical data in encrypted coding will considerably reduce the risk of disclosure. A scheme that provides storing and accessing encrypted data on Grid storage, without compromise data sharing, has been proposed recently [5]. This work encourages the use of a Shamir’s secret sharing scheme for dividing a single key between several key servers. A Shamir’s secret sharing scheme is a means for N parties to carry shares or parts of a message, called the secret, such that any subset k of the shares determines the secret. This scheme is said to be perfect because no proper subset of shares leaks any information regarding the secret [7]. The present work describes an implementation of such architecture on gLite, an ultimate middleware for Grid computing. A model for the distribution of key shares and a model for revoking permissions are proposed as part of the implementation. Both models are completely consistent with existing Grid technologies and Grid security policies. Control access to both encrypted objects and decryption keys is managed in a Virtual Organization Membership Service (VOMS) environment [6]. Furthermore, VOMS enables the adoption of authentication and delegation mechanisms provided by the Grid Security Infrastructure (GSI). Finally we develop a methodology for the replication of key administration services. This methodology, as well as the synchronization of replicas, depends on the target environment. Next section in this paper describes the technologies used (encryption and decryption, key shares, permission revocation, key replication and integration with the gLite data management system). Section 3 describes the testbed, the test cases used and the results in terms of the evaluation of the security and the performance, and sections 4 and 5 presents the conclusions and acknowledgements.

2. METHODS 2.1. Encryption and Decryption Data and key shares locations are stored within each single encrypted file. The data is encrypted using the AES cipher provided by Bouncy Castle Crypto package. This cipher is operated in CBC mode with 128 bits keys (192 and 256 keys are also available). The encrypted file is signed with a Keyed-Hashing for Message Authentication Code (HMAC) using a SHA1 hash function on the output of the AES-CBC cipher. CBC mode not only hides pattern occurrences in plain data but also make possible to complete both the data encryption/decryption and the message authentication in a single file reading.


133

All the AES ciphers applied for a single object use a unique key, no matter the cipher operation mode. The encrypted files share a common structure: x x x

A header containing the key shares locations, the initialization vector for the AESCBC cipher encrypted with AES and the HMAC key encrypted with AES-CBC. Encrypted blocks of data using AES-CBC. HMAC signature.

2.2. Key Shares Distribution In a non-trusted environment privacy of data could be ensured enforcing a secret sharing scheme, distributing key shares across trusted participants [7]. A natural way to define the key sharing in a Grid environment is using Virtual Organizations (VOs) and Grid information services. The model presented in this article needs the object owner to decide the trusted VOs for key sharing. This approach ensures that only trusted key servers will be used. The VOs keeping a certain share as well as all the possible share combinations to rebuild the key are known during the whole process. Every encrypted object contains a header with the name of the VOs keeping key shares. VOs are responsible for both publishing key servers location using the Grid information services and maintaining replicas structure. VOs provide the model with flexibility and portability, ensuring a reasonable security level. An object administrator can distribute key shares over as many trusted VOs as needed, enforcing an authorized user to recover k shares, and ensuring that the exposure of an encrypted object, no matter which data server of the Grid is located in, needs accessing to k completely trusted VOs to rebuild the decryption key. In such scenario, the complexity of granting unauthorized access to the system requires compromising the security of, at least, k key servers held by k completely trusted VOs. A different AES decryption key is randomly generated for each object and translated into an integer. Key shares are computed using the Shamir secret sharing scheme [7]. Shares are distributed among trusted VOs and stored in a MySQL relational database linked to the VO. 2.3. Revoking Permissions A reliable and functional permission revocation model could be provided in a Grid environment by taking into account three rules: x x

Different keys should be associated to different encrypted objects. A copy of the Message Authentication Code (MAC) signature of the encrypted object should be kept both in the key servers having the key shares for the object and in the encrypted object itself. x Re-encryption of updated objects should use renewed passwords. On the view of these conditions, objects will be protected from unauthorized users, even from those who grant physical or administrator access to the storage device. Using a

134


different key for each encrypted object ensures that a user, who kept a decryption key after permission revocation, could not expose objects different from those ones for which he or she had access in the past. The integrity of the objects is ensured by cross validating the copy of the MAC signature stored within the encrypted object with the copies stored in the key servers. Again, the complexity of compromising the integrity of an object is the problem of compromising at least k key servers. Changing the password of updated objects prevents the exposure of new versions. Permission revocation is the act of updating the information of a user in the VOMS servers. Each service provider, including key servers, will evaluate new requests using the updated credentials emitted by VOMS servers. 2.4. Key Administration Services Replication Replication was implemented on the basis of MySQL database server replication capabilities [8]. Key servers are completely equivalents for clients, so, requests can be issued to any replica. Clients are required to access the key servers in a randomly fashion, to improve load balancing. A best way of ensuring load balancing over the replicas of a VO is to temporally remove the overloaded replicas from the information services, enforcing new clients to use unloaded replicas. gLite middleware provides low-level monitoring services as part of the information services. These services include the ability to feed a consumer with information that carries a timestamp. Client application implemented as part of the system consumes data published by the key servers keeping up-to-date with producer (key servers) events. In this way, an overloaded key server can transmit a signal to all listening clients indicating to use a new replica. Key servers receipt client’s requests and validate VOMS proxy certificates against local policies. Once a request is validated, further steps are tied to the nature of the request. A master data server performs insertion and update operations, and replicas perform read operations. For read-only requests a replicated data server is randomly chosen, and the rest is handled by the master. Only one master data server is operated by VO. Then, all modifying operations on the databases are done by this master. Furthermore, the master data server ensures the synchronization of the replicas. This approach improves time responses and load balancing when reading is the main operation. In fact, this condition is the habitual scenario in most health applications. 2.5. Integration with the gLite Data Management System In the gLite middleware, data functionality is provided by a set of interoperable services. The end-user application only needs to use the gLite Input/Output API in order to access its data. Grid security services are used indirectly by this API. Client application implemented as part of the system uses the gLite I/O client libraries to access encrypted objects for read and write. This client application handles a set of proxy certificates that

135


confirms the user is authorized by a trusted authority to access encrypted objects and decryption keys. Proxy certificates should be previously negotiated with VOMS servers.

3. RESULTS AND DISCUSSION 3.1. Testbed 6 desktop PC workstations were used to fix a gLite testbed shared by 3 different VOs. Clients where test were performed were also commodity desktops PC workstations. Clients and servers were connected to a dedicated Fast-Ethernet network. Key servers were implemented as Web Services deployed with gLite. The list of gLite services and modules used is the following: x x x x x x x x

gLite Security Utilities. gLite R-GMA Server. gLite VOMS Server. gLite Data Single Catalog (for MySQL). gLite I/O Server. gLite R-GMA Client (API Java). gLite I/O Client. gLite UI.

3.2. Test Cases 24 medical images were used in tests. Sample data was collected from public image repositories. Images were selected considering how representative are the image acquisition techniques, the resolution and the storing formats. Table 1. Description of the Sample Dataset Source

Num.of Samples

Max. Sample Size (Mb)

Acquisition

Image Format

Public Health Image Library at Centers for Disease Control and Prevention (CDCP)

8

14.0

X-Ray, SEM

Tiff

DDSM: Digital Database for Screening Mammography

4

26.0

Mammography

Tiff

Osiris Medical Imaging Software3

8

0.51

PET, CT

PNG, GIF, DCM

DICOM Sample Image Datasets Web Site

4

0.51

MRI

DCM

3 DICOM is the OsiriX original image format. Format conversion was done using XMedCon, an open source medical image conversion utility & library.

136


3.3. Validation of the Implemented Architecture A first group of tests was completed in order to validate the implemented architecture. A comparison of an image set before and after encryption-decryption using this system proved that the resulting images are equivalent. No data corruption was observed during the cryptographic processes and keys were successfully retrieved. The original set includes all test cases. System response was measured on different authorization scenarios. Expired proxy certificates, as well as invalid certificates, certificates signed by non trusted Certificate Authorities and valid certificates issued by a VO different from the key server VO were proved to be rejected by the system. Valid certificates are forced to support local accession policies. 3.4. Evaluation of the Security Levels and Overhead A second group of tests was used for evaluating the security levels introduced by the models and the overhead. The capability of response of the system to potential security violations was evaluated. Shamir's secret sharing scheme is a threshold scheme where the secret, decryption keys in our case, is divided in N shares but just k shares, with k@ %* LV D ELRLQIRUPDWLFV WRRO IRU *HQH 2QWRORJ\EDVHG '1$ RU SURWHLQVHTXHQFHDQQRWDWLRQDQGIXQFWLRQEDVHGGDWDPLQLQJDGHWDLOHGGHVFULSWLRQRI WKH WRRO LV JLYHQ LQ WKH QH[W VHFWLRQ 7KH DSSOLFDWLRQ KDV VXFFHVVIXOO\ EHHQ XVHG LQ PDQ\ IXQFWLRQDO JHQRPLFV SURMHFWV IRU WKHVH WZR FRUH IXQFWLRQDOLWLHV 7\SLFDO %* XVHUV DUH PLGGOH VL]H JHQRPLFV ODEV FDUU\LQJ RXW VHTXHQFLQJ (76 DQG PLFURDUUD\ SURMHFWVKDQGOLQJGDWDVHWVXSWRVHYHUDOWKRXVDQGVHTXHQFHV,QWKHFXUUHQWYHUVLRQRI %*WKHSRZHUDQGDQDO\WLFDOSRWHQWLDORIERWKDQQRWDWLRQDQGIXQFWLRQGDWDPLQLQJLV VRPHKRZOLPLWHGE\WKHFRPSXWDWLRQDOSRZHUEHKLQGHDFKSDUWLFXODULQVWDOODWLRQ $QQRWDWLRQ KDV LWV PDMRU ERWWOHQHFN LQ WKH ILUVW DQDO\VLV VWHS WKDW LPSOLHV WKH VHDUFK RI ODUJH '1$ RU SURWHLQ GDWDEDVHV IRU VHTXHQFH KRPRORJXHV ZLWK WKH %DVLF /RFDO$OLJQPHQW6HDUFK7RRO%/$67 DOJRULWKP>@7KLVLVWKH PRVWH[WHQGHGEXW QRWWKHRQO\PHWKRGIRUILQGLQJIXQFWLRQDOLQIRUPDWLRQIRUXQFKDUDFWHUL]HGVHTXHQFHV 0RVW %* XVHUV HPSOR\ WKH GHIDXOW DSSOLFDWLRQ RSWLRQV ZKLFK JLYH UHPRWH DFFHVV WR 1&%, ,Q VRPH PRUH ELRLQIRUPDWLFV H[SHULHQFHG ODERUDWRULHV ORFDO EODVW LQVWDOODWLRQV DUHVHWXSZKHUHGDWDVRXUFHVDUHTXHULHGORFDOO\,QDQ\FDVHEODVWLQJZLOOWDNHIURP VHFRQGV WR D IHZ PLQXWHV SHU VHTXHQFH WR FRPSOHWH ZKLFK PHDQV WKDW ZKHQ WKRXVDQGVHTXHQFHVDUHLQYROYHGWKLVLQLWLDOSURFHVVFDQODVWXSWRVHYHUDOGD\V 2Q WKH RWKHU KDQG VWDWLVWLFDO GDWDPLQLQJ PHWKRGV FXUUHQWO\ RIIHUHG E\ WKH DSSOLFDWLRQ DUH UHVWULFWHG WR GHVFULSWLYH DQG XQLYDULDWH IXQFWLRQV 7KLV SDUW RI WKH WRRO FRXOGJUHDWO\EHQHILWIURPRWKHUSRZHUIXOVWDWLVWLFDODSSURDFKHVRILQFUHDVLQJLQWHUHVWLQ IXQFWLRQDO JHQRPLFV DQDO\VLV VXFK DV PDFKLQHOHDUQLQJ EDVHG RU WKRVH LQYROYLQJ VWRFKDVWLFVHDUFKHVLQWKHPXOWLYDULDWHVSDFH+RZHYHUWKHLULQFRUSRUDWLRQLQWRDKLJKO\ LQWHUDFWLYH WRRO DV %* DSSHDUV QRW YHU\ DSSHDOLQJ DV WKH\ DUH KLJKO\ &38 LQWHQVLYH DQGWLPHFRQVXPLQJPHWKRGRORJLHV 7KH VHTXHQFHEDVHG VWUXFWXUH RI WKH GDWDVHWV XVHG E\ %* DQG E\ PDQ\ RWKHU IXQFWLRQDO JHQRPLFV DQDO\VLV WRROV VXJJHVWV WKDW SURFHVV SDUDOOHOLVDWLRQ FRXOG EH D VXLWDEOH VROXWLRQ WR WKH SUREOHP RI ORQJ FRPSXWLQJ WLPHV DQG KHDY\ FRPSXWDWLRQDO WDVNV6XEVWDQWLDOO\IDVWHUGDWDSURFHVVLQJZRXOGQRWRQO\UHVXOWLQDQLPSURYHPHQWRI WKHSHUIRUPDQFHRIWKHWRROEXWFRXOGDOVRRSHQWKHGRRUWRH[SHULPHQWDWLRQZLWKLQWKH DQDO\WLFDOURXWLQHV ,Q WKLV SDSHU ZH SUHVHQW RXU DSSURDFK IRU LQFRUSRUDWLQJ *ULG WHFKQRORJ\ LQWR D IXQFWLRQDO JHQRPLFV WRRO VXFK DV %* 3URWRW\SH GHYHORSPHQW ZLOO EH GRQH IRU WKH SDUWLFXODU SUREOHP RI VSHHGLQJ XS WKH %ODVW VHDUFKHV WR DFKLHYH IDVW UHVXOWV IRU ODUJH GDWDVHWV$IXWXUHIROORZXSDUHDRI*ULGWHFKQRORJ\DSSOLFDWLRQZLOOEHWKHLQWHJUDWLRQ RIDGYDQFHGDQGFRPSXWLQJLQWHQVLYHVWDWLVWLFDODOJRULWKPV7KHFRQQHFWLYLW\WRD*ULG HQYLURQPHQW ZLOO HDVH WKH IXWXUH PLJUDWLRQ WR WKH *ULG RI KHDY\ ZHLJKW FRPSXWLQJ SURFHVVHV

196

G. Aparicio et al. / Blast2GO Goes Grid: Developing a Grid-Enabled Prototype

67$7(2)7+($57 7KLVVHFWLRQILUVWO\GHVFULEHVWKH%*DSSOLFDWLRQWKDWZLOOWUDQVSDUHQWO\FRQQHFW XVHUV WR WKH *ULG EDVHG RQ WKH VRIWZDUH DUFKLWHFWXUH SUHVHQWHG LQ WKLV SDSHU 1H[W D UHYLHZRQWKHPDLQLVVXHVUHODWHGWRRWKHU*ULGSURMHFWVDGGUHVVLQJELRLQIRUPDWLFVDQG WKHLUFXUUHQWOLPLWDWLRQVZLOOEHJLYHQ7KHVRIWZDUHDUFKLWHFWXUHSURSRVHGKHUHDLPVDW PDNLQJ SURILW RI WKH H[SHULHQFH JDLQHG IURP WKRVH SURMHFWV DQG SURSRVHV QHZ IXQFWLRQDOLWLHVQRWFRYHUHGE\WKHP %ODVW*R %ODVW*2LVD-DYDDSSOLFDWLRQFRQFHLYHGDQGGHVLJQHGZLWKWKHDLPRISURYLGLQJ WKH )XQFWLRQDO *HQRPLFV 5HVHDUFK FRPPXQLW\ ZLWK DQ HDV\WRUXQ WRRO IRU VHTXHQFH DQQRWDWLRQDQGJHQH IXQFWLRQEDVHGGDWD PLQLQJ7KHDSSOLFDWLRQUHOLHVRQ ILYH PDMRU LQWHUDFWLYH DQDO\VLV SURFHVVHV WKDW WRJHWKHU SURYLGH WKHVH WZR PDLQ IXQFWLRQDOLWLHV 8VXDOO\ %* XVHUV VWDUW WKHLU DQDO\VLV E\ UXQQLQJ D %/$67 VHDUFK 7KH %DVLF /RFDO $OLJQPHQW6HDUFK7RROLVWKHXQLYHUVDODOJRULWKPIRUWKHTXHU\LQJRISURWHLQDQG'1$ GDWDEDVHV IRU VHTXHQFH VLPLODULWLHV $ JURXS RI VHOHFWHG VHTXHQFHV LV EODVWHG DJDLQVW HLWKHU SXEOLF RU FXVWRP GDWDEDVHV WR REWDLQ VRFDOOHG KRPRORJXHV VLPLODU VHTXHQFHV WKDWGHULYHIURPDFRPPRQDQFHVWRUDQGSXWDWLYHO\VKDUHDFRPPRQIXQFWLRQ7KLVILUVW VWHSFDQEHPRGXODWHGE\WKHXVHUWKURXJKWKHDGMXVWPHQWRIYDULRXVSDUDPHWHUVRIWKH DOJRULWKPVXFKDVWKHPLQLPXPVLPLODULW\EHWZHHQVHTXHQFHVWKHQXPEHURIUHWXUQHG KLWV RU WKH GDWDEDVH WR EH VHDUFKHG %* VXSSRUWV GLIIHUHQW ZD\V IRU UXQQLQJ %ODVW VHDUFKHV1&%,VHUYLFHVFDQ EHDFFHVVHGWKURXJKKWWSWRREWDLQUHVXOWVDJDLQVWSXEOLF GDWDVHWVZLWKRXWDQ\PDLQWHQDQFHZRUN$OWHUQDWLYHO\DORFDOEODVWLQVWDOODWLRQFDQEH DFFHVVHG WKURXJK %* UXQQLQJ RQ FXVWRP VHTXHQFH GDWDEDVHV ZLWK WKH UHVWULFWLRQ RI ORZ SHUIRUPDQFH DQG PDLQWHQDQFH HIIRUWV OLNH GDWDEDVH XSGDWHV 7\SLFDO XVHV RI %ODVW*2LQFOXGHGDWDVHWVRIVHYHUDOWKRXVDQGRIVHTXHQFHV7KHDSSOLFDWLRQODXQFKHV %ODVW VHDUFKHV VHTXHQWLDOO\ DQG UHVXOWV DUH SURFHVVHG DV WKH\ DUH UHWULHYHG E\ WKH SDUVLQJ PRGXOH7KHWLPH QHFHVVDU\WRREWDLQD%ODVWUHVXOWLVVHTXHQFHDQGGDWDEDVH GHSHQGHQWDQGFDQYDU\EHWZHHQVHFRQGVDQGDIHZPLQXWHVZKLFKWXUQV%ODVWLQJ LQWR WKH UDWHOLPLWLQJ VWHS RI WKH DQDO\VLV SURFHVV 7KH VHFRQG VWHS LV WKH UHWULHYDO RI DOUHDG\ NQRZQ ELRORJLFDO IXQFWLRQV DQG RWKHU DQQRWDWLRQV IRU WKH IRXQG KRPRORJXHV +HUH%*PDNHVXVHRIWKH*HQH2QWRORJ\*2 DVWUXFWXUHGYRFDEXODU\RIELRORJLFDO WHUPV *2 FDQ EH FRQVLGHUHG DV WKH GH IDFWRVWDQGDUG IRU IXQFWLRQDO VHTXHQFH DQG JHQRPH DQQRWDWLRQ IRXQG LQ PRVW VHTXHQFH GDWDEDVHV H[SORLWHG E\ WKH VFLHQWLILF FRPPXQLW\ 'XULQJ WKH WKLUG RU DQQRWDWLRQ VWHS WKH FROOHFWHG *2 LQIRUPDWLRQ LV HYDOXDWHG E\ D XVHUDGMXVWDEOH UXOH ZKLFK ILQDOO\ DVVLJQV *2 WHUPV WR WKH TXHU\ VHTXHQFHV 2QFH *2 DQQRWDWLRQV DUH JHQHUDWHG IRU WKH TXHU\ VHTXHQFHV WKH VHFRQG IXQFWLRQDOLW\RI%*EHFRPHVDFWLYH6LQJOHRUJURXSDQQRWDWLRQVFDQEHGLVSOD\HGLQ JUDSKIRUP UHFRQVWUXFWLQJ WKH *2WHUP UHODWLRQVKLSV LQKHULWHG IURP WKH RQWRORJLFDO VWUXFWXUH DQG FRORXUKLJKOLJKWLQJ WKH PRVW UHOHYDQW WHUPV WR HDVH ELRORJLFDO LQWHUSUHWDWLRQ 6WDWLVWLFDO DQDO\VHV DUH DGGLWLRQDOO\ RIIHUHG E\ WKH WRRO WR LGHQWLI\ HJ GLIIHUHQFHV LQ *2 WHUP GLVWULEXWLRQ EHWZHHQ VXEVHWV RI VHTXHQFHV 7KLV LV D NH\ DQDO\VLV LQ IXQFWLRQDO JHQRPLFV H[SHULPHQWV ZKHUH WKH UHODWLYH LPSRUWDQFH RI LQGLYLGXDO ELRORJLFDO SURFHVVHV ZLWKLQ WKH REWDLQHG UHVXOWV QHHGV WR EH HYDOXDWHG $W HDFKRIWKHVHDERYHVWHSVGLIIHUHQWFKDUWVDUHDYDLODEOHWRHYDOXDWHWKHSURJUHVVRIWKH DQDO\VLVDQGGDWDFDQEHVDYHGDQGH[SRUWHGLQGLIIHUHQWIRUPDWV


197

6LQFHLWVSXEOLFDWLRQLQ6HSWHPEHULQ%LRLQIRUPDWLFV%*KDVEHHQXVHGLQ QXPHURXV IXQFWLRQDO JHQRPLFV SURMHFWV 7KH DYDLODELOLW\ RI D IHHGEDFN IRUP DW WKH DSSOLFDWLRQ¶VZHEVLWHDQGDXVHU¶VJURXSKDVDOORZHGXVWRWUDFNWKHXVHRIWKHWRRODQG FROOHFWRSLQLRQVDQGVXJJHVWLRQVIURPWKHXVHU¶VFRPPXQLW\2XUJHQHUDOLPSUHVVLRQLV WKDW UHVHDUFKHUV DUH SOHDVHG ZLWK WKH XVHUIULHQGO\ VHWXS EXW VLQFH QRUPDOO\ ODUJH GDWDVHWVDUHXVHGZDLWLQJWLPHVEHWZHHQDQDO\VLVVWHSVFDQEHORQJHVSHFLDOO\GXULQJ WKH EODVW VHDUFK ,PSURYHPHQWV PDGH LQ WKLV GLUHFWLRQ ZRXOG VWURQJO\ LQFUHDVH WKH LQWHUDFWLYLW\DQGG\QDPLFVLQWKHXVDJHRIWKHVRIWZDUH)LJXUHJLYHVDQRYHUYLHZRI %* DQG KLJKOLJKWV SRLQWV WKDW FRXOG EHQHILW IURP KLJKWKURXJKSXW FRPSXWLQJ WHFKQLTXHV

)LJXUH %* RYHUYLHZ 7KH ILJXUH VKRZV VFKHPDWLFDOO\ WKH DUFKLWHFWXUH RI %* 8VHG V\PEROV DUH GHVFULEHGLQWKHHPEHGGHGOHJHQG7KHILJXUHKDVWREHLQWHUSUHWHGIURPWKHOHIWWRWKHULJKWDQGUHSUHVHQWD W\SLFDOUXQRIWKHDSSOLFDWLRQ'DUNHUER[HVKLJKOLJKWSODQHG*ULGPRGXOHV

+LJK3HUIRUPDQFHDQG*ULG&RPSXWLQJRQ*HQHWLFV 0DQ\SURMHFWVDUHFXUUHQWO\ZRUNLQJRQDGYDQFHGFRPSXWLQJDSSOLHGWRJHQRPLFV UHVHDUFK DQG 'DWD 0LQLQJ 7KHVH SURMHFWV RIWHQ XVH *ULG DQG 3DUDOOHO &RPSXWLQJ DOWKRXJKLWLVXQXVXDOWRILQGERWKDSSURDFKHVFRPELQHG7KLVVHFWLRQLQFOXGHVDEULHI UHYLHZRQVRPHRIWKHPRVWUHOHYDQWRQHV 7KH 1DWLRQDO &HQWUH IRU %LRWHFKQRORJ\ ,QIRUPDWLRQ 1&%, SURYLGHV RQH RI WKH PRVW ZLGHO\ XVHG UHSRVLWRU\ RI %ODVW WRROV 1&%, RIIHUV YLD LWV ZHEVLWH ELQDULHV IRU ORFDO%ODVWLQVWDOODWLRQVRQGLIIHUHQWSODWIRUPVDVZHOODVWKHH[WHQVLYHO\XVHG4%ODVWD ZHELQWHUIDFHWRH[HFXWHUHPRWHEODVWVHDUFKHVDJDLQVWWKH1&%,FOXVWHU,QWKHFDVHRI %*ERWKDSSURDFKHVODFNWKHGHVLUHGSURFHVVLQJVSHHG7KHUHPRWH4%ODVWXVDJHLV OLPLWHGVLQFHLWLVDZRUOGZLGHVKDUHGUHVRXUFHDQGORFDOLQVWDOODWLRQVDUHERXQGWRWKH VLWHOLPLWDWLRQV $QRWKHU VROXWLRQ SURSRVHG E\ 03,%/$67 >@ LV WZR SDUDOOHOL]H %ODVW SURYLGHG WKDW DQ LQVWLWXWLRQ KDV VXIILFLHQW UHVRXUFHV 03,%/$67 LV D IUHHO\ DYDLODEOH RSHQ VRXUFH SDUDOOHOL]DWLRQ RI WKH DFFHSWHG 1&%, %ODVW 03,%/$67 VHJPHQWV WKH %ODVW GDWDEDVHDQGGLVWULEXWHVLWDORQJFOXVWHUQRGHVHQDEOLQJ%ODVWTXHULHVWREHSURFHVVHG VLPXOWDQHRXVO\ RQ PDQ\ QRGHV 3DUWLDO UHVXOWV DUH IXVHG DIWHU HDFK UXQ UHFDOFXODWLQJ VWDWLVWLFDOYDOXHVDQGJHQHUDWLQJ1&%,%ODVWOLNHRXWSXW03,%/$67LVEDVHGRQ03, DQGUXQV XQGHU/LQX[:LQGRZVDQG VHYHUDO IODYRXUVRI 81,;03,%/$67UHGXFHV WKH FRPSXWLQJ WLPH IRU DQ LQGLYLGXDO VHDUFK DQG IRUPV SDUW RI WKH KHUH SURSRVHG

198


VROXWLRQ DV EDVLF FRPSXWLQJ NHUQHO +RZHYHU IRU WKH VDNH RI VFDODELOLW\ KLJKHUOHYHO DSSURDFKHV PXVW EH FRQVLGHUHG LI WKH UHTXLUHPHQW IRU FRPSXWLQJ QRGHV WR EH XVHG H[FHHGVWKHERXQGVRIDVLQJOHLQVWLWXWLRQ :KHQ FRPSXWLQJ QHHGV JR EH\RQG WKH LQVWLWXWLRQ FDSDELOLWLHV WKH XVH RI *ULG &RPSXWLQJKDVUHYHDOHGWREHDVXFFHVVIXODSSURDFK7KHUHDUHJOREDOVROXWLRQVVXFK DV LQ P\*ULG >@ D 8. H6FLHQFH SURMHFW IXQGHG E\ WKH (365& LQYROYLQJ ILYH 8. XQLYHUVLWLHV WKH (XURSHDQ %LRLQIRUPDWLFV ,QVWLWXWH (%, DQG PDQ\ LQGXVWULDO FROODERUDWRUV7KHP\*ULGSURMHFWHPSKDVL]HVRQWKH,QIRUPDWLRQ*ULGDQGLVEXLOGLQJ KLJKOHYHO VHUYLFHV IRU GDWD DQG DSSOLFDWLRQ UHVRXUFH LQWHJUDWLRQ VXFK DV UHVRXUFH GLVFRYHU\ZRUNIORZHQKDQFHPHQWDQGGLVWULEXWHGTXHU\SURFHVVLQJ $GGLWLRQDOO\ WKHUH DUH VHYHUDO *ULGHQDEOHG YHUVLRQV RI %ODVW VXFK DV WKH RQH GHYHORSHGE\DFROODERUDWLRQRIWKH1&6$DQGWKH8QLYHUVLW\RI,OOLQRLV>@WKH*(% GHYHORSHGLQWKH,15,$>@DQGRWKHUVLPLODUGHYHORSPHQWV7KHDUFKLWHFWXUHRIWKRVH V\VWHPVLVTXLWHVLPLODUWKH\VSOLWERWKWKHLQSXWWKURXJKRXWDJULGDQGWKHGDWDEDVHVDV LQ WKH 03,%/$67 DSSURDFK +RZHYHU WKRVH DSSURDFKHV DUH HLWKHU DSSOLFDWLRQV E\ WKHPVHOYHV RU FRPPDQGOLQH WRROV PDLQO\ IRFXVHG WR GHDO ZLWK LQWUDRUJDQLVDWLRQ UHVRXUFHVUDWKHUWKDQDQKHWHURJHQHRXVODUJH*ULGQHWZRUN )RFXVHG RQ H[SORLWLQJ WKH UHVRXUFHV DW ODUJH VFDOH LQ D *ULG LQIUDVWUXFWXUH WKH :,6'20 SURMHFW >@ DLPV WR GHPRQVWUDWH WKH UHOHYDQFH DQG WKH LPSDFW RI WKH *ULG DSSURDFK WR DGGUHVV GUXJ GLVFRYHU\ IRU QHJOHFWHG GLVHDVHV 7KLV ILUVW ELRPHGLFDO GDWD FKDOOHQJHLQWKH(*((LQIUDVWUXFWXUHLVDVFDODELOLW\VWHSWRZDUGVDIXOOLQVLOLFRQGUXJ GLVFRYHU\SODWIRUP7KHFRPSXWLQJDSSURDFKRIWKLVSURMHFWKDVPDQDJHGWRDFKLHYHWKH SURGXFWLYLW\RIWHQVRI &38\HDUVLQMXVWRQH PRQWK:,6'20KRZHYHUODFNVRQDQ HDV\WRXVHLQWHUIDFHDQGLVQRWDJHQHUDOSXUSRVHWRRO &RQFHUQLQJXVDELOLW\WKHUHLVDFOHDUQHHGWRHDVHWKHLQWHUIDFLQJEHWZHHQWKH*ULG DQGWKHXVHUV,QWKLVFRQWH[WWKH*ULG3URWHLQ6HTXHQFH#QDO\VLV>@*36# LVDQ LQWHJUDWHG *ULG SRUWDO GHYRWHG WR PROHFXODU ELRLQIRUPDWLFV *36# XVH D *ULG FRPSXWLQJ LQIUDVWUXFWXUH SURYLGHG E\ WKH (*(( (XURSHDQ SURMHFW WR ILQG D YLDEOH VROXWLRQ WR GLVWULEXWH GDWD DOJRULWKPV FRPSXWLQJ DQG VWRUDJH UHVRXUFHV IRU JHQRPLF UHVHDUFK 7KH FXUUHQW YHUVLRQ LV XQGHU GHYHORSPHQW DQG GHSOR\HG RQ D +LJK (QHUJ\ 3K\VLFV*ULG0LGGOHZDUH/+&&RPSXWLQJ*ULG>@/&* *36#LVDPLJUDWLRQRI WKH1HWZRUN3URWHLQ6HTXHQFH$QDO\VLV136$ VHUYLFHVRQWRWKH(*((JULG136$LV D SURGXFWLRQ ZHE SRUWDO KRVWLQJ SURWHLQV GDWDEDVHV DQG DOJRULWKPV IRU VHTXHQFH DQDO\VLV$OWKRXJKZHFDQXVH*36#WRVHDUFKSURWHLQVHTXHQFHVLPLODULWLHVE\XVLQJ %ODVW YLD *ULG LW LV QRW SRVVLEOH WR PRGLI\ DOJRULWKP SDUDPHWHUV QHLWKHU WR HDVLO\ VXEPLW ODUJH EORFNV RI GDWD )XUWKHUPRUH ZHE LQWHUIDFLQJ GRHV QRW ILW WKH QHHG RI D GHYHORSPHQWHQYLURQPHQW 7KHUHIRUHRXUDSSURDFKLVWRGHDOZLWKDODUJHKHWHURJHQHRXVQHWZRUNRIUHVRXUFHV DVLQWKH:,6'20SURMHFW VSOLWWLQJWKHZRUNLQDVLPLODUDSSURDFKDV*5,'%/$67 DQG XVLQJ 03,%/$67 IRU WKH SDUDOOHOL]DWLRQ RI VLQJOH VHDUFKHV 7KH LQWHUIDFH LV SURYLGHG E\ WKH :HE 6HUYLFHV 5HVRXUFH )UDPHZRUN >@ :65) EDVHG RQ VWDQGDUG :HE6HUYLFHVSURWRFROVWRHDVHWKHLQWHJUDWLRQWR%* $5&+,7(&785( 7KHVRIWZDUHDUFKLWHFWXUHSURSRVHGKHUHDLPVDWEULGJLQJELRPHGLFDOGDWDPLQLQJ DSSOLFDWLRQV DQG *ULG 7HFKQRORJLHV WR VROYH SUREOHPV WKDW UHTXLUH ODUJH DPRXQW RI UHVRXUFHV ERWK FRPSXWLQJ DQG VWRUDJH 7KLV VHFWLRQ GHVFULEHV WKH GLIIHUHQW OD\HUV LQ

199


ZKLFKLWLVVWUXFWXUHGDQGWKHFRPSRQHQWVLQWHJUDWHGWRWKHP7KLVDUFKLWHFWXUHLVEDVHG LQ RWKHU *ULGLQWHUIDFLQJ ZRUN >@ 6SHFLDO DWWHQWLRQ ZLOO EH SDLG WR WKH FRPSRQHQWV QHHGHG IRU WKH LQWHJUDWLRQ RI 03,%/$67 ZKLFK ZLOO EH WKH SURFHVVLQJ HQJLQH H[HFXWLQJ ODUJH VHW RI VHTXHQFHV IURP WKH %* LQWHUIDFH UHFHLYHG WKURXJK WKH LPSOHPHQWHG*ULG6HUYLFH 7KHDUFKLWHFWXUHFRQVLVWVRIIRXUOD\HUV7ZRORZHUOD\HUVGLUHFWO\LQWHUDFWZLWKWKH *ULG DQG WZR KLJKHU OD\HUV SURYLGH DQ DEVWUDFW DQG XQLIRUP LQWHUIDFH WR WKH %* DSSOLFDWLRQ7KHOD\HUVSURSRVHGIURPKLJKHUWRORZHU DUHD $SSOLFDWLRQ /D\HUE *ULG /D\HU F *DWHWR*ULG /D\HU G &RPSRQHQWV 0LGGOHZDUH /D\HU *ULG /D\HU )LJXUHVKRZVDVFKHPDRIWKHGLIIHUHQWOD\HUVDQGWKHLULQWHUDFWLRQV

$SSOLFDWLRQ/D\HU

&RPSRQHQWV 0LGGOH:DUH/D\HU

1

$SSOLFDWLRQ%ODVW*R

&B*5,'B03,%ODVW

2WKHU&RPSRQHQW

SUR[\

SUR[\

1

8VHU

2WKHU$SSOLFDWLRQV

1 *DWHWR*ULG /D\HU

*ULG/D\HU

+7736 &HUWLILFDWHV

+7736 &HUWLILFDWHV

SUR[\

:65)03,%ODVW

SUR[\

2WKHU:65)

8VHU,QWHUIDFH/&*

:HE6HUYLFHV5HVRXUFH )UDPHZRUN:65) &RQWDLQHU

)LJXUH*HQHUDOYLHZRIWKHDUFKLWHFWXUHDQGWKH03,%/$67FRPSRQHQWV

1H[W VHFWLRQV GHVFULEH WKH GLIIHUHQW OD\HUV LQ GHWDLO LQGLFDWLQJ WKH GHILQHG UHTXLUHPHQWVVXFKDVSURWRFROVVWUXFWXUHVGDWDDQGLQWHUIDFHGHILQLWLRQV $SSOLFDWLRQ/D\HU 7KLV OD\HU FRPSULVHV WKH H[HFXWLRQ PRGXOHV WKDW SURYLGH XVHULQWHUIDFH DSSOLFDWLRQVZLWKWKHDLPWRDFFHVVWKHUHVRXUFHVDQGWKHXVHULQWHUIDFHDSSOLFDWLRQVE\ WKHPVHOYHV 7KH DUFKLWHFWXUH SURSRVHG LQ WKLV ZRUN ZRXOG EH FRPSDWLEOH WR GLIIHUHQW XVHULQWHUIDFH DSSOLFDWLRQV 7KH FRPSRQHQWV WR EH GHYHORSHG LQ WKLV OD\HU KDYH WKH REMHFWLYH RI DEVWUDFWLQJ WKH DFFHVV WR WKH UHVRXUFHV RI WKH *ULG ERWK VRIWZDUH DQG KDUGZDUH *HQHUDOO\ DSSOLFDWLRQV DUH QRW VXSSRVHG WR EH GHHSO\ FKDQJHG ZKHQ FRQQHFWHGWR*ULG,QWKHFDVHRI%*WKHDSSOLFDWLRQSHUIRUPVRQH%ODVWUHTXHVWSHU VHTXHQFHZKLFKZRXOGEHLQDGHTXDWHLQD*ULGHQYLURQPHQWFRQVLGHULQJWKHODWHQFLHV DQG *ULG ZRUNORDG :KHQ WKH *ULG HQYLURQPHQW LV XVHG IRU EODVWLQJ WKH DSSOLFDWLRQ OD\HU LQ %* ZLOO FRPSLOH UHTXHVWV DQG EXLOG KLJKOHYHO MREV RI WKH DGHTXDWH JUDQXODULW\WRDFKLHYHWKHGHVLUHGHIILFLHQF\ 7KHFRPSRQHQWDQDO\VHVWKHQXPEHURIVHTXHQFHVWREHSURFHVVHGDQGWKHQXPEHU RI DYDLODEOH FRPSDWLEOH UHVRXUFHV &RQVLGHULQJ WKRVH IDFWV LW FROOHFWV LQGLYLGXDO UHTXHVWV XS WR D UHDVRQDEOH SDFNHW VL]H DQG GLVWULEXWHV LW DORQJ WKH GLIIHUHQW 03,%/$67UHVRXUFHV5HVXOWVDUHFROOHFWHGSURJUHVVLYHO\DQGDUHGLUHFWO\UHWXUQHGWR WKHDSSOLFDWLRQWRPDLQWDLQWKHXVHULQIRUPHGRIWKHDQDO\VLVSURJUHVV

200


&RPSRQHQWV0LGGOHZDUH/D\HU 7KH FRPSRQHQWV RI WKLV OD\HU KDYH WZR LQWHUIDFHV RQH IRU LQWHUDFWLQJ ZLWK WKH $SSOLFDWLRQV/D\HUDQGRWKHUIRULQWHUDFWLQJZLWKWKHUHVRXUFHVGHILQHGLQWKH*DWHWR *ULG /D\HU7KHDSSOLFDWLRQLQWHUIDFHLV LPSOHPHQWHGE\REMHFWRULHQWHGFRPSRQHQWV 7KHLQWHUIDFHWRWKHUHVRXUFHVXVHVSUR[LHVWKDWHQDEOHWKHFRPSRQHQWVWRLQWHUDFWZLWK WKHUHVRXUFHVRQEHKDOIRIWKHXVHU7KLVLQWHUIDFHLVLPSOHPHQWHGWKURXJK:65)*ULG 6HUYLFHV 7KH GDWD H[FKDQJHG ZLWK WKH *ULG 6HUYLFHV DUH FRGHG XVLQJ WKH H;WHQVLEOH 0DUNXS /DQJXDJH >@ ;0/ 7KH FRPPXQLFDWLRQ SURWRFRO LV WKH 6LPSOH 2EMHFW $FFHVV3URWRFRO>@62$3 DERYH+7736WRSUHVHUYHVHFXULW\RQWKHFRQQHFWLRQV )RU WKH LQWHUDFWLRQ ZLWK WKH 03,%/$67 D FRPSRQHQW FDOOHG &B*5,'B03,%/$67LVGHILQHGZKLFKZLOOEHLQFKDUJHRIUHTXHVWLQJWKHSURFHVVLQJ UHWULHYLQJ WKH UHVXOWV DQG FRQVXOWLQJ WKH VWDWHV LQ UHDO WLPH 7KLV FRPSRQHQW SURYLGHV WKH DSSOLFDWLRQ ZLWK D FODVV IRU JHQHUDWLQJ WKH REMHFWV WKDW DUH XVHG GLUHFWO\ E\ WKH DSSOLFDWLRQ %* )XUWKHUPRUH WKLV FRPSRQHQW LQWHUDFWV ZLWK WKH *ULG 6HUYLFH :65)B03,%/$67 IRU WKH H[HFXWLRQ RI WKH SDUDOOHO WDVNV XVLQJ LWV FRUUHVSRQGLQJ SUR[\ 7KHPDLQPHWKRGVGHILQHGLQWKHFRPSRQHQW&B*5,'B03,%/$67IRULQWHUDFWLQJ ZLWKWKHDSSOLFDWLRQ%*DUHWKHIROORZLQJ 6HQGV DQ H[HFXWLRQ UHTXHVW ZLWK D VHW RI LQSXW VHTXHQFHV XVLQJ L6WDUW%/$67 GLIIHUHQWGDWDEDVHV +DVK7DEOHRILQSXWVHTXHQFHV ,QSXW ,QSXW$UJXPHQWVGDWDEDVHVQDPH%ODVWKLWVHWF« 7KHVWDWXVRIWKHH[HFXWLRQ 5HWXUQ 6WRSV DQ H[HFXWLRQ VWDUWHG E\ L6WDUW%/$67 DQG FDQFHOV DOO MREV L6WRS%/$67 DVVRFLDWHG 6XFFHVVIXOVWRSRUHUURU 5HWXUQ 5HWXUQ WKH QXPEHU RI VHTXHQFHV SURFHVVHG DQG WKH DYDLODEOH L*HW6WDWXV UHVXOWV 1XPEHURIUHVXOWVILQLVKHGDQGUHWULHYHG 5HWXUQ Y*HW)LQLVKHG5H 5HWXUQD+DVKWDEOHZLWKWKHUHVXOWVILQLVKHGEXWQRW\HWUHWULHYHG VXOWV +DK7DEOHZLWKWKHUHVXOWV 5HWXUQ *DWHWR*ULG/D\HU 7KLVOD\HULVDOVRGLYLGHGLQWRWZRLQWHUIDFHV7KHILUVWLQWHUIDFHLQWHUDFWVZLWKWKH FRPSRQHQWV RI WKH &RPSRQHQWV 0LGGOHZDUH /D\HU XVLQJ SUR[LHV DQG WKH LQWHUIDFH RIIHUHGE\WKH*ULG6HUYLFH7KLVLQWHUIDFHLVGHILQHGXVLQJWKH:HE6HUYLFH'HILQLWLRQ /DQJXDJH >@ :6'/ 7KH VHFRQG LQWHUIDFH LQWHUDFWV GLUHFWO\ ZLWK WKH UHVRXUFHV FOXVWHURIFRPSXWHUVGDWDEDVHVZRUNORDGPDQDJHUVHWF« &RQFHUQLQJWKH03,%/$67FRPSRQHQWWKHVHOHFWHGVROXWLRQKHUHLVWKHXVDJHRI WKH (*(( LQIUDVWUXFWXUH 7KLV LQIUDVWUXFWXUH FXUUHQWO\ UHDFKHV PRUH WKDQ FRQQHFWHG FRPSXWHUV 7KH DSSOLFDWLRQ FRQVLGHUV WKH (*(( GHSOR\PHQW DV D UHVRXUFH

201


DQG ZLOOEHLPSOHPHQWHGDVD:65)*ULG6HUYLFHRIIHULQJDQLQWHUIDFHFRPSRVHGE\ WKHIROORZLQJPHWKRGV L,QLW6HVVLRQ ,QSXW 5HWXUQ L/DXQFK%/$67 ,QSXW

5HWXUQ L*HW6WDWXV

6HVVLRQLQLWLDOLVDWLRQLQWKH(*((*ULG(QYLURQPHQW 8VHULGHQWLILHU 3DVVZRUGRIWKHXVHU 5HWXUQWKH6HVVLRQ,GHQWLILHU 6XEPLWVD-RELQGLIIHUHQWWDVNVLQWKH(*((HQYLURQPHQW ,GHQWLILHURIVHVVLRQ ,QSXWVHTXHQFHVLQDQ;0/GRFXPHQW 'DWD SDUDPHWHUV GDWD EDVHV QDPH %ODVWKLWV HWF« LQ ;0/ 7KHVWDWXVRIWKHVXEPLVVLRQ 5HWXUQV WKH QXPEHU RI VHTXHQFHV SURFHVVHG DQG WKH DYDLODEOHUHVXOWV 1XPEHURIUHVXOWVILQLVKHGDQGREWDLQHG

5HWXUQ [PO*HW)LQLVKHG5HVXOWV 5HWXUQVDQ;0/GRFXPHQWZLWKWKHUHVXOWVILQLVKHGDQGQRW \HWUHWULHYHGIRUDJLYHQVHVVLRQ ,QSXW ,GHQWLILHURIWKHVHVVLRQ 5HWXUQ ;0/GRFXPHQWZLWKWKHUHVXOW

7KHIXQFWLRQDOLW\RIWKLVUHVRXUFHLVWRJHWWKHLQSXWVHTXHQFHVDQGVSOLWWKHPLQWR GLIIHUHQW MREV 'HSHQGLQJ RQ WKH FRPSXWDWLRQDO UHVRXUFHV DYDLODEOH LQ WKH (*(( GHSOR\PHQW HDFK MRE ZLOO KDYH D QXPEHU RI LQSXW VHTXHQFHV DQG ZLOO XVH WKH 03,%/$67 DOJRULWKP WR SURFHVV WKHP 7KLV LQIRUPDWLRQ LV SURYLGHG E\ WKH LQIRUPDWLRQV\VWHPVRI(*((EHLQJWKHQXPEHURISURFHVVLQJQRGHV:RUNHUV1RGHV DQGWKHQXPEHURISURFHVVRUVWKHPDLQLQIRUPDWLRQ

)LJXUH*HQHUDOYLHZRIWKH:65)*ULG6HUYLFHVFRPSRQHQWLQWKH*DWHWR*ULG/D\HU

202


*ULG/D\HU 7KLVOD\HUFRUUHVSRQGVZLWKWKHORZHVWOHYHORIWKHDUFKLWHFWXUH,WGHDOVLQJHQHUDO ZLWKWKHXVHRIWKH*ULGLQIUDVWUXFWXUH,QWKHFDVHRI03,%/$67LWLVQHFHVVDU\WRXVH D*ULG0LGGOHZDUHWKDWZLOOEHHIILFLHQWLQGHDOLQJZLWKKLJKSURGXFWLYLW\SURFHVVLQJ VLQFHWKHLQSXWVHTXHQFHVLQWKH03,%/$67SURFHVVDUHGLYLGHGLQGLIIHUHQWMREVWKDW ZLOOEHH[HFXWHGFRQFXUUHQWO\RQGLIIHUHQWGDWDEDVHV 7KH*ULG0LGGOHZDUHXVHGLQWKLVZRUNLV/&*DOWKRXJKPLJUDWLRQWRJ/LWH >@ ZLOO EH SHUIRUPHG ZKHQ DYDLODEOH WKLV ZLOO UHVXOW LQ D MRLQHG GLVWULEXWLRQ RI /&* DQG J/LWH %RWK PLGGOHZDUHV RIIHU WKH ³VLQJOH FRPSXWHU´ YLVLRQ RI WKH *ULG WKRXJKW WKH VWRUDJH FDWDORJXHV DQG ZRUNORDG PDQDJHPHQW VHUYLFHV WKDW WDFNOH ZLWKWKHSUREOHPRIVHOHFWLQJWKHULJKWPRVWUHVRXUFH/&**ULGPLGGOHZDUHFRPSULVH WKHIROORZLQJHOHPHQWV • ,6%',,,QIRUPDWLRQ6HUYLFH±%HUNHOH\'DWDEDVH,QIRUPDWLRQ,QGH[WKLVHOHPHQW SURYLGHVWKHLQIRUPDWLRQDERXWWKH*ULGUHVRXUFHVDQGWKHLUVWDWXV • &$&HUWLILFDWLRQ$XWKRULW\WKH&$VLJQVWKHFHUWLILFDWHVIURPERWKUHVRXUFHVDQG XVHUV • &( &RPSXWLQJ (OHPHQW LW LV GHILQHG DV D TXHXH RI *ULG MREV $ FRPSXWLQJ HOHPHQWLVDIDUPRIKRPRJHQHRXVFRPSXWLQJQRGHVFDOOHG:RUNHU1RGHV • :1:RUNHU1RGHLWLVDFRPSXWHULQFKDUJHRIH[HFXWLQJMREV • 6(6WRUDJHHOHPHQWD6(LVDVWRUDJHUHVRXUFHLQZKLFKDWDVNFDQVWRUHGDWDWREH XVHGE\WKHFRPSXWHUVRIWKH*ULG • 5&5/65HSOLFD&DWDORJXH5HSOLFD/RFDWLRQ6HUYLFHWKH\DUHWKHHOHPHQWVWKDW PDQDJHWKHORFDWLRQRIWKH*ULGGDWD • 5%5HVRXUFH%URNHUWKH5%SHUIRUPVWKHORDGEDODQFLQJRIWKHMREVLQWKH*ULG GHFLGLQJLQZKLFK&(VWKHMREVZLOOEHODXQFKHG • 8,8VHULQWHUIDFHWKLVFRPSRQHQWLVWKHHQWU\SRLQWRIWKHXVHUVWRWKH*ULGDQG SURYLGHVDVHWRIFRPPDQGVDQG$3,VWKDWFDQEHXVHGE\WKHSURJUDPVWRSHUIRUP GLIIHUHQWDFWLRQVRQWKH*ULG • 7KH-RE'HVFULSWLRQ/DQJXDJH-'/ LWLVWKHZD\LQZKLFKMREVDUHGHVFULEHG$ -'/ LV D WH[W ILOH VSHFLI\LQJ WKH H[HFXWDEOH WKH SURJUDP SDUDPHWHUV WKH ILOHV LQYROYHGLQWKHSURFHVVLQJDQGRWKHUDGGLWLRQDOUHTXLUHPHQWV 6HFXULW\ $OWKRXJKVHTXHQFHGDWDPLJKWQRWEHVRVHQVLWLYHWRSULYDF\DVPHGLFDOUHFRUGVLW LVQRWLQIUHTXHQWWKDWQXFOHLFDFLGRUSURWHLQVHTXHQFHVPD\KDYHSDWHQWSRWHQWLDORUEH UHODWHGWRSURMHFWVZKHUHFRQILGHQWLDOLW\KDVWREHSUHVHUYHG)RUWKLVFDVHWKHVROXWLRQ RIWKLVZRUNXVHVVHFXUHFKDQQHOVUHTXHVWLQJHQFU\SWLRQ 5HJDUGLQJWKHDFFHVVWRWKHV\VWHPWKHGLIIHUHQWOD\HUVRIWKHDUFKLWHFWXUHGHILQHG KDYH GLIIHUHQW DSSURDFKHV LQ WKH LPSOHPHQWDWLRQ RI WKH VHFXULW\ %DVLFDOO\ LW FDQ EH GLYLGHGLQWRWZRSDUWV2QHSDUWLVUHODWHGWRWKH:65)HQYLURQPHQWDQGRWKHUSDUWZLOO GHDOZLWKWKH(*((*ULGHQYLURQPHQW,QERWKFDVHVVHFXUHSURWRFROVDUHXVHGIRUWKH FRPPXQLFDWLRQ ,QUHODWLRQWRWKHVFRSHRIVHFXULW\RI:65)VWKHDUFKLWHFWXUHGHILQHVDFOLHQWOD\HU PLGGOHZDUH FRPSRQHQWV IRU LQWHUDFWLQJ ZLWK WKH V\VWHP )RU :65)V WKH 62$3 SURWRFRO LV XVHG RQ WRS +7736 ZKLFK LV EDVHG LQ 6HFXUH 6RFNHWV /D\HU >@ 66/


203

7KH +7736 SURWRFRO JXDUDQWHHV WKH SULYDF\ RI WKH GDWD DQG WKH XVH RI GLJLWDO FHUWLILFDWHVJXDUDQWHHDXWKHQWLFDWLRQRIXVHU 7KH *ULG PLGGOHZDUH XVHG SURYLGHV D QDWLYH LQIUDVWUXFWXUH RI VHFXULW\ QDPHO\ *ULG 6HFXULW\ ,QIUDVWUXFWXUH *6, >@ *6, LV DOVR EDVHG RQ 66/ %HIRUH DFFHVVLQJ DQ\ UHVRXUFH RI WKH *ULG D SUR[\ PXVW EH FUHDWHG IURP WKH FOLHQW FHUWLILFDWH 7KH FHUWLILFDWHLVGXO\VLJQHGE\WKH&HUWLILFDWH$XWKRULW\&$ (DFKUHVRXUFHRIWKH*ULG HVWDEOLVKHV D PDSSLQJ RI WKH 'LVWLQJXLVK 1DPH '1 REWDLQHG IURP WKH SUR[\ $V D UHVXOWHDFKGHSOR\HGUHVRXUFHLQWKH*ULGLVFHUWLILHGE\DYDOLG&$ ,Q WKH GHVFULEHG DUFKLWHFWXUH WKH *DWHWR*ULG LV WKH FRPPRQ SRLQW EHWZHHQ WKH :65)DQG(*((*ULGHQYLURQPHQW$PDSSLQJV\VWHPRIWKH:65)XVHUVWKURXJK :HE XVHU FHUWLILFDWH KDV EHHQ LPSOHPHQWHG LQ WKH *DWHWR*ULG OD\HU WKDW DVVRFLDWHV WKH:HEXVHUVZLWKWKH(*((*ULGXVHUV)RUHDFKXVHUD*ULGSUR[\LVFUHDWHGIURP LWV*ULGXVHUFHUWLILFDWH

)LJXUH6HFXULW\VFKHPDRIWKHSURSRVHGDUFKLWHFWXUH

(;3(&7('5(68/76$1'%(1(),76 7KH PRVW GLUHFW H[SHFWHG UHVXOW IURP WKLV SURMHFW ZLOO EH WKH FUHDWLRQ RI WKH UHTXLUHGDUFKLWHFWXUHIRUODXQFKLQJ%ODVWSURFHVVHVIURPWKH%ODV*2DSSOLFDWLRQLQWRD *ULGV\VWHP$VVXFKVSHHGLQJXSWKH%ODVWSURFHVVIRUWKRXVDQGRIVHTXHQFHVKDVWKH FOHDUEHQHILWRIDQLQFUHDVHGSHUIRUPDQFHDQGWLPHJDLQ$GGLWLRQDOO\WKHDYDLODELOLW\ RIDIDVWGHOLYHULQJ%ODVWVHUYLFHFDQKDYHIXUWKHULPSOLFDWLRQVRQWKHZD\WKLVDQDO\VLV SURFHGXUH FDQ EH XVHG 7\SLFDOO\ DQQRWDWLRQ SURMHFWV UXQ DW DQ HDUO\ VWDJH RQH %ODVW VHDUFK DJDLQVW RQH RU D IHZ VHOHFWHG GDWDEDVHV DQG %ODVW XSGDWHV DUH QRW IUHTXHQW +RZHYHU SXEOLF GDWDEDVHV DUH FRQVWDQW DQG UDSLGO\ LQFUHDVLQJ LQ QXPEHU DQG VL]H WKURXJKWKHFRQWULEXWLRQVRISDUWLFXODUJHQHFKDUDFWHUL]DWLRQVDQGPDVVLYHVHTXHQFLQJ SURMHFWV 7KH HDVLQHVV IRU %ODVW XSGDWHV WKDW D JULG HQYLURQPHQW FRXOG RIIHU ZRXOG RSWLPL]H WKH H[SORLWDWLRQ RI QHZ VHTXHQFH GDWD QRW RQO\ IRU %* EXW IRU VLPLODU JHQRPLFDSSOLFDWLRQVRUH[SHULPHQWVWKDWXVH%ODVWUHVXOWV $QRWKHU DVSHFW WKDW ZRXOG EH HQKDQFHG LV WKH PXOWL'% TXHU\LQJ DV %ODVW VHDUFKHVLQGLIIHUHQWGDWDEDVHVFRXOGEHHDVLO\GLVWULEXWHGLQGLIIHUHQW&38VDQGUXQLQ SDUDOOHO$OOWKHVHLPSURYHPHQWVLQ%ODVWSHUIRUPDQFHPHDQWKDWH[SHULPHQWDWLRQZLWK %ODVWSDUDPHWHUV ZRXOG EHFRPH IHDVLEOH HJ %ODVW*2 VWXGLHV RQ WKH *2 WHUP DQQRWDWLRQHIIHFWLYHQHVVLQUHODWLRQWR%ODVWSDUDPHWHUVFRXOGEHHDVLO\FDUULHGRXW 0RUHRYHU %ODVW LV QRW WKH RQO\ WLPHFRQVXPLQJ SURFHVV RI WKH %ODVW*2 WRRO RWKHU IXQFWLRQV VXFK DV PDSSLQJ *2 WHUP VHOHFWLRQ DQQRWDWLRQ HQKDQFHPHQW DQG *2VOLP SURMHFWLRQV DUH DOVR VHTXHQFHZLVH SURFHVVHV WKDW FDQ EHFRPH KLJKO\ &38 FRQVXPLQJ DQG WKDW FRXOG JUHDWO\ EHQHILW IURP D *ULG DUFKLWHFWXUH 6LPLODUO\ RWKHU

204


LQWHQVLYH DQQRWDWLRQ VWUDWHJLHV VXFK DV 3IDP >@ DQG ,QWHU3UR >@ FRXOG EH DOVR LQFRUSRUDWHGLQWRWKHWRRO /DVW EXW QRW OHDVW D YHU\ LQWHUHVWLQJ SRWHQWLDO EHQHILW RI LPSOHPHQWLQJ *ULG WHFKQRORJ\LQDIXQFWLRQDOJHQRPLFVWRROVXFKDV%ODVW*2UHODWHVWRLWVIXQFWLRQDOLW\ IRU DQQRWDWLRQEDVHG GDWDPLQLQJ RI H[SHULPHQWDO JHQRPLFV GDWD &XUUHQW DSSURDFKHV DUH PRVWO\ EDVHG RQ XQLYDULDWH VWDWLVWLFDO PHWKRGV 2WKHU YHU\ LQWHUHVWLQJ PHWKRGRORJLHV LQ WKLV DUHD DUH PXOWLYDULDWH PHWKRGV FRXSOHG WR VWRFKDVWLF VHDUFKHV ZKLFK DUH FRQFHSWXDOO\ VXVFHSWLEOH WR GLYLVLRQ *ULG WHFKQRORJ\ ZRXOG RIIHU WKHQ WKH SRVVLELOLW\ WR UHDOL]H WKLV GLVWULEXWLRQ RI WKH VHDUFK SURFHVV DQG WKHUHIRUH WR UHGXFH FRPSXWLQJWLPH )LQDOO\WKHUHDUHDOVRLQLWLDWLYHVVXFKDV%LRLQIRUPDWLFV*ULG$SSOLFDWLRQIRU/LIH 6FLHQFH %LRLQIRJULG RU 6XSSRUWLQJ DQG VWUXFWXULQJ +HDOWKJULG $FWLYLWLHV DQG 5HVHDUFK LQ (XURSH >@ 6+$5( ZKLFK DUH IRVWHULQJ WKH XVH RI *ULGV LQ WKH OLIH VFLHQFH FRPPXQLW\ DQG ZKLFK FRXOG SOD\ DQ LPSRUWDQW UROH LQ WKH SURPRWLRQ RI WKH XVDJHRI*ULGVLQELRFRPSXWDWLRQRSHQLQJWKHGRRUVWRPRUHFROODERUDWLRQV 5()(5(1&(6 >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@

&RQHVD$ *|W]6 *DUFtD*yPH]-0 7HURO- 7DOyQ0 DQG 5REOHV0 %ODVW*2 $ 8QLYHUVDO 7RRO IRU $QQRWDWLRQ 9LVXDOL]DWLRQ DQG $QDO\VLV LQ )XQFWLRQDO *HQRPLFV 5HVHDUFK %LRLQIRUPDWLFV $OWVFKXO6) *LVK: 0LOOHU: 0H\HUV(: DQG /LSPDQ'- %DVLF /RFDO $OLJQPHQW 6HDUFK7RRO-RXUQDORI0ROHFXODU%LRORJ\ ³PSL%/$672SHQ6RXUFH3DUDOOHO%ODVW´KWWSPSLEODVWODQOJRY ³0\*ULG3URMHFW´KWWSZZZP\JULGRUJXN ³1&6$*ULG$ZDUH*ULG%ODVW6\VWHP³KWWSELRLQIQFVDXLXFHGXJULGEODVWLQGH[KWPO ³*ULG (QDEOHG %ODVW *(% ZLWK 'LVWULEXWHG 2EMHFWV DQG &RPSRQHQWV´ KWWSZZZ VRSLQULDIURDVLV6WDJHV%LR3UR$FWLYH&DURPHOKWPO ³:LGH,Q6LOLFR'RFNLQJ2Q0DODULD³KWWSZLVGRPKHDOWKJULGRUJ ³*ULG 3URWHLQ 6HTXHQFH #QDO\VLV %LRLQIRUPDWLFV :HE 3RUWDO 'HGLFDWHG WR 3URWHLQ 6HTXHQFH $QDO\VLVRQWKH*5,'´KWWSJSVDLEFSIU ³:RUOG:LGH:HE&RPSXWLQJ*ULG'LVWULEXWHG3URGXFWLRQ(QYLURQPHQWRI3K\VLFV'DWD3URFHVVLQJ ³KWWSOFJZHEFHUQFK/&* ³7KH:65HVRXUFH)UDPHZRUN´KWWSZZZJOREXVRUJZVUI ,JQDFLR %ODQTXHU 9LFHQWH +HUQiQGH] )HUUDQ 0DV 'DPLj 6HJUHOOHV ³$ )UDPHZRUN %DVHG RQ :HE 6HUYLFHV DQG *ULG 7HFKQRORJLHV IRU 0HGLFDO ,PDJH 5HJLVWUDWLRQ´ WK ,QWHUQDWLRQDO 6\PSRVLXP ,6%0'$$YHLUR3RUWXJDO 1RYHPEHU3URFHHGLQJV $OOHQ:\NH5:DWW$³;0/6FKHPD(VVHQWLDOV´:LOH\&RPSXWHU3XE,6%1 ³62$39HUVLRQ´KWWSZZZZRUJ75VRDS ³:HE 6HUYLFHV 'HVFULSWLRQ /DQJXDJH :6'/ :& 1RWH 0DUFK ´ KWWSZZZZRUJ75127(ZVGO ³/LJKW:HLJKW0LGGOHZDUHIRU*ULG&RPSXWLQJ´KWWSJOLWHZHEFHUQFKJOLWH 5ROI2SSOLJHU³6HFXULW\7HFKQRORJLHVIRUWKH:RUOG:LGH:HE´6HFRQGHGLWLRQ&RPSXWHU6HFXULW\ 6HULHV$UWHFK+RXVH,6%1 ³*ULG6HFXULW\,QIUDVWUXFWXUH´KWWSZZZJOREXVRUJVHFXULW\RYHUYLHZKWPO ³6XSSRUWLQJ DQG 6WUXFWXULQJ +HDOWK*ULG $FWLYLWLHV 5HVHDUFK LQ (XURSH´ 6+$5( 7HFKQLFDO $QQH[,67 %DWHPDQ$%LUQH\('XUELQ5(GG\65+RZH./DQG6RQQKDPPHU(/7KH3IDP3URWHLQ )DPLOLHV'DWDEDVH1XFOHLF$FLGV5HV S 7KH ,QWHU3UR &RQVRUWLXP 5$SZHLOHU 7.$WWZRRG $%DLURFK $%DWHPDQ (%LUQH\ 0%LVZDV 3%XFKHU /&HUXWWL )&RUSHW 0'5&URQLQJ 5'XUELQ /)DOTXHW :)OHLVFKPDQQ -*RX]\ ++HUPMDNRE 1+XOR ,-RQDVVHQ '.DKQ $.DQDSLQ

Challenges And Opportunities of Healthgrids: Proceedings of Healthgrid 2006 (Studies in Health Technology and Informatics)

Medical And Care 3 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Global Health Informatics Education (Studies in Health Technology and Informatics)

Medical and Care Compunetics 2 (Studies in Health Technology and Informatics, Vol. 114) (Studies in Health Technology and Informatics)

Research into Spinal Deformities 5 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Advances in Information Technology and Communication in Health - Volume 143 Studies in Health Technology and Informatics

From Grid to Healthgrid (Studies in Health Technology and Informatics, Vol. 112)

Plaque Imaging: Pixel to Molecular Level (Studies in Health Technology and Informatics, Vol. 113) (Studies in Health Technology and Informatics)

Connecting Medical Informatics and Bio-Informatics (Studies in Health Technology and Informatics)

E-Health: Current Situation And Examples Of Implemented & Beneficial E-Health Applications (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

MEDINFO 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics (Studies in Health Technology and Informatics)

International Perspectives in Health Informatics - Volume 164 Studies in Health Technology and Informatics

Wearable eHealth Systems For Personalised Health Management: State Of The Art and Future Challenges (Studies in Health Technology and Informatics)

Pharmacogenetics: Opportunities and Challenges for Health Innovation

Pharmacogenetics: Opportunities and Challenges for Health Innovation

Mobile Technology Consumption: Opportunities and Challenges

Challenges and Opportunities in Agrometeorology

Information Technology in Health Care 2007 - Proceedings of the 3rd International Conference on Information Technology in Health Care: Socio-technical ... Studies in Health Technology and Informatics

eHealth Beyond the Horizon - Get IT There:Proceedings of MIE2008 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Regional Health Economies and ICT Services (Studies in Health Technology and Informatics Shti)

Future Public Health: Burden, Challenges and Opportunities in the UK

Analysis, Design and Implementation of Secure and Interoperable Distributed Health Information Systems (Studies in Health Technology and Informatics, 89)

Femtocells: Opportunities and Challenges for Business and Technology

The Conservative Scoliosis Treatment:1st SOSORT Instructional Course Lectures Book (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Transformation of Healthcare with Information Technologies (Studies in Health Technology and Informatics)

From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences - Proceedings of HealthGrid 2007, Volume 126 Studies in Health Technology and Informatics

China's Rise: Challenges and Opportunities

Biometric Recognition: Challenges and Opportunities

Data Mining: Opportunities and Challenges

Biometric Recognition: Challenges and Opportunities

China's Rise: Challenges and Opportunities

Challenges And Opportunities of Healthgrids: Proceedings of Healthgrid 2006 (Studies in Health Technology and Informatics)

Medical And Care 3 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Global Health Informatics Education (Studies in Health Technology and Informatics)

Medical and Care Compunetics 2 (Studies in Health Technology and Informatics, Vol. 114) (Studies in Health Technology and Informatics)

Research into Spinal Deformities 5 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Advances in Information Technology and Communication in Health - Volume 143 Studies in Health Technology and Informatics

From Grid to Healthgrid (Studies in Health Technology and Informatics, Vol. 112)

Plaque Imaging: Pixel to Molecular Level (Studies in Health Technology and Informatics, Vol. 113) (Studies in Health Technology and Informatics)

Connecting Medical Informatics and Bio-Informatics (Studies in Health Technology and Informatics)

E-Health: Current Situation And Examples Of Implemented & Beneficial E-Health Applications (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

MEDINFO 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics (Studies in Health Technology and Informatics)

International Perspectives in Health Informatics - Volume 164 Studies in Health Technology and Informatics

Wearable eHealth Systems For Personalised Health Management: State Of The Art and Future Challenges (Studies in Health Technology and Informatics)

Pharmacogenetics: Opportunities and Challenges for Health Innovation

Pharmacogenetics: Opportunities and Challenges for Health Innovation

Mobile Technology Consumption: Opportunities and Challenges

Challenges and Opportunities in Agrometeorology

Information Technology in Health Care 2007 - Proceedings of the 3rd International Conference on Information Technology in Health Care: Socio-technical ... Studies in Health Technology and Informatics

eHealth Beyond the Horizon - Get IT There:Proceedings of MIE2008 (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Regional Health Economies and ICT Services (Studies in Health Technology and Informatics Shti)

Future Public Health: Burden, Challenges and Opportunities in the UK

Analysis, Design and Implementation of Secure and Interoperable Distributed Health Information Systems (Studies in Health Technology and Informatics, 89)

Femtocells: Opportunities and Challenges for Business and Technology

The Conservative Scoliosis Treatment:1st SOSORT Instructional Course Lectures Book (Studies in Health Technology and Informatics) (Studies in Health Technology and Informatics)

Transformation of Healthcare with Information Technologies (Studies in Health Technology and Informatics)

From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences - Proceedings of HealthGrid 2007, Volume 126 Studies in Health Technology and Informatics

China's Rise: Challenges and Opportunities

Biometric Recognition: Challenges and Opportunities

Data Mining: Opportunities and Challenges

Biometric Recognition: Challenges and Opportunities

China's Rise: Challenges and Opportunities

Recommend Documents