Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Yang Xiang | Alfredo Cuzzocrea | Michael Hobbs | Wanlei Zhou (Editors)

17 downloads 767 Views 11MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

7017

Yang Xiang Alfredo Cuzzocrea Michael Hobbs Wanlei Zhou (Eds.)

Algorithms and Architectures for Parallel Processing 11th International Conference, ICA3PP 2011 Melbourne, Australia, October 24-26, 2011 Proceedings, Part II

13

Volume Editors Yang Xiang Wanlei Zhou Deakin University, School of Information Technology Melbourne Burwood Campus, 221 Burwood Highway Burwood, VIC 3125, Australia E-mail: {yang, wanlei}@deakin.edu.au Alfredo Cuzzocrea ICAR-CNR and University of Calabria Via P. Bucci 41 C, 87036 Rende (CS), Italy E-mail: [email protected] Michael Hobbs Deakin University, School of Information Technology Geelong Waurn Ponds Campus, Pigdons Road Geelong, VIC 3217, Australia E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349 e-ISBN 978-3-642-24669-2 ISBN 978-3-642-24668-5 DOI 10.1007/978-3-642-24669-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011937820 CR Subject Classification (1998): F.2, H.4, D.2, I.2, G.2, H.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Message from the ADCN 2011 Chairs

We are happy to welcome you to the 2011 International Symposium on Advances of Distributed Computing and Networking (ADCN 2011). ADCN 2011 is held in conjunction with the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), Melbourne, Australia, October 24-26, 2011. ADCN 2011 contains 16 full papers selected from those submitted to the ICA3PP 2011 main track. All the papers were peer reviewed by members of the ICA3PP 2011 Program Committee. The symposium covers a broad range of topics in the field of parallel and distributed computing such as cluster; distributed and parallel operating systems and middleware; cloud, grid, and services computing; reliability and fault-tolerant computing; multi-core programming and software tools; distributed scheduling and load balancing; high-performance scientific computing; parallel algorithms; parallel architectures; parallel and distributed databases; parallel I/O systems and storage systems; parallel programming paradigms; performance of parallel and distributed computing systems resource management and scheduling; tools and environments for parallel and distributed software development; software and hardware; reliability testing, verification and validation; security, privacy, and trusted computing; self-healing, self-protecting and fault-tolerant systems, information security on internet, multimedia in parallel computing parallel computing in bioinformatics dependability issues in computer networks and communications; dependability issues in distributed and parallel systems; dependability issues in embedded parallel systems; industrial applications; and scientific applications. We thank the authors for submitting their work and the members of the ICA3PP 2011 Program Committee for managing the reviews of the ACDN 2011 symposium papers in such short time. We firmly believe this symposium complements perfectly the topics covered by ICA3PP 2011, and provides additional breadth and depth to the main conference. Finally, we hope you enjoy the symposium and have a fruitful meeting in Melbourne, Australia. August 2011

Wanlei Zhou Alfredo Cuzzocrea Michael Hobbs

Message from the IDCS 2011 Chairs

It is our great pleasure that the accepted papers of the 4th International Workshop on Internet and Distributed Computing Systems (IDCS 2011) included in the proceedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), held in Melbourne, Australia during October 24–26, 2011. Following the previous three successful IDCS workshops – IDCS 2008 in Dhaka, Bangladesh; IDCS 2009 on Jeju Island, Korea; and IDCS 2010 in Melbourne, Australia – IDCS 2011 is the fourth in its series to promote research in diverse fields related to Internet and Distributed Computing Systems. In this workshop, we are interested in presenting innovative papers on emerging technologies related to Internet and distributed systems to support the effective design and efficient implementation of high-performance computer networks. The areas of interest for this year’s event are the following: – – – – – – – – –

Internet architectures and protocols modeling and evaluation of internet-based systems Internet quality of service grid, cloud, and P2P computing middleware for wireless sensor networks security of network-based systems network-based applications (VoIP, streaming) network management and traffic engineering tools and techniques for network measurements

The target audience of this event includes researchers and industry practitioners interested in different aspects of the Internet and distributed systems, with a particular focus on practical experiences with the design and implementation of related technologies as well as their theoretical perspectives. We received 23 submissions from 7 different countries. Each submission was reviewed by three members of the international Program Committee. After a rigorous review process, we selected 10 papers for inclusion in the workshop program. We plan to invite extended and enhanced versions of top-quality selected papers for submission on a fast-track basis for the Springer Journal of Internet Services and Applications (JISA) and International Journal of Internet and Distributed Computing Systems (IJIDCS). In addition, selected papers in the information security area will be recommended for publication in the International Journal of Risk and Contingency Management. The organization of IDCS 2011 includes direct or indirect contributions from many individuals, including program chairs, Program Committee members, external reviewers, logistics personnel and student volunteers. We would like to thank Dr Wen Tao Zhu and Dr Muhammad Khurram Khan for accepting the

VIII

Message from the IDCS 2011 Chairs

IDCS 2011 workshop proposal within ICA3PP. Special thanks to ICA3PP general chairs Andrzej Goscinski and Peter Brezany, as well as program chairs Yang Xiang, Alfredo Cuzzocrea, and Michael Hobbs for their continuous support in making IDCS 2011 a success. Last but not least, we express our gratitude to all authors of the accepted and submitted papers. Their contribution has made these proceedings a scholarly compilation of exciting research outcomes. August 2011

Jemal Abawajy Giancarlo Fortino Ragib Hasan Mustafizur Rahman

IDCS 2011 Organizing Committee

Workshop Chairs Jemal Abawajy Giancarlo Fortino Ragib Hasan Mustafizur Rahman

Deakin University, Australia University of Calabria, Italy Johns Hopkins University, USA IBM, Australia

Web, Publicity and Logistics Chair Al-Sakib Khan Pathan Mukaddim Pathan

International Islamic University, Malaysia CSIRO, Australia

International Program Committee Joaqu´ın Garc´ıa-Alfaro Doina Bein Rajkumar Buyya Antonio Coronato Mustafa Mat Deris Zongming Fei S.K. Ghosh Victor Govindaswamy Jaehoon Paul Jeong Syed Ishtiaque Ahmed Tarem Ahmed Mohammad Mehedi Hassan Dimitrios Katsaros Fahim Kawsar Ram Krishnan Hae Young Lee Ignacio M. Llorente Carlo Mastroianni Jaime Lloret Mauri Sudip Misra Muhammad Mostafa Monowar Manzur Murshed Marco Netto George Pallis Rajiv Ranjan

´ ECOM ´ TEL Bretagne, France Pennsylvania State University, USA University of Melbourne, Australia ICAR-CNR, Italy Universiti Tun Hussein Onn, Malaysia University of Kentucky, USA IIT-Kharagpur, India Texas A&M University-Texarkana, USA University of Minnesota, USA BUET, Bangladesh Brac University, Bangladesh Kyung Hee University, South Korea University of Thessaly, Greece Bell Labs, BE and Lancaster University, UK University of Texas at San Antonio, USA ETRI, South Korea Universidad Complutense de Madrid, Spain ICAR-CNR, Italy Universidad Politécnica de Valencia, Spain IIT-Kharagpur, India University of Chittagong, Bangladesh Monash University, Australia IBM Research, Brazil University of Cyprus, Cyprus University of New South Wales, Australia

X

IDCS 2011 Organizing Committee

Thomas Repantis Riaz Ahmed Shaikh Ramesh Sitaraman Mostafa Al Masum Shaikh Paolo Trunfio Christian Vecchiola Spyros Voulgaris Anwar Walid Lizhe Wang Bin Xie Norihiko Yoshida

Akamai Technologies, USA University of Quebec in Outaouais, Canada University of Massachusetts, USA University of Tokyo, Japan University of Calabria, Italy University of Melbourne, Australia Vrije Universiteit, The Netherlands Alcatel-Lucent Bell Labs, USA Indiana University, USA InfoBeyond Technology, USA Saitama University, Japan

M2A2 Foreword

It is with great pleasure that we present the proceedings of the Third International Workshop on Multicore and Multithreaded Architectures and Algorithms (M2A2 2011) held in conjunction with the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011) in Melbourne, Australia. Multicore systems are dominating the processor market, and it is expected that the number of cores will continue to increase in most of the commercial systems, such as high-performance, desktops, or embedded systems. This trend is driven by the need to increase the efficiency of the major system components, that is, the cores, the memory hierarchy, and the interconnection network. For this purpose, the system designer must trade off performance versus power consumption, which is a major concern in current microprocessors. Therefore new architectures or architectural mechanisms addressing this trade-off are required. In this context, load balancing and scheduling can help to improve energy saving. In addition, it remains a challenge to identify and productively program applications for these architectures with a resulting substantial performance improvement. The M2A2 2011 workshop provided a forum for engineers and scientists to address the resulting challenge and to present new ideas, applications, and experience on all aspects of multicore and multithreaded systems. This year, and because of the high quality of the submitted papers, only about 38% of the papers were accepted for the workshop. We would like to express our most sincere appreciation to everyone contributing to the success of this workshop. First, we thank the authors of the submitted papers for their efforts in their research work. Then, we thank the TPC members and the reviewers for their invaluable and constructive comments. Finally, we thank our sponsors for their support of this workshop. August 2011

Houcine Hassan Julio Sahuquillo

XII

M2A2 Foreword

General Co-chairs Houcine Hassan Julio Sahuquillo

Universidad Politecnica de Valencia, Spain Universidad Politecnica de Valencia, Spain

Steering Committee Laurence T. Yang Jong Hyuk Park

St Francis Xavier University, Canada Seoul National University of Technology, Korea

Program Committee Hideharu Amano Hamid R. Arabnia Luca Benini Luis Gomes Antonio Gentile Zonghua Gu Rajiv Gupta Houcine Hassan Seongsoo Hong Shih-Hao Hung Eugene John Seon Wook Kim Jihong Kim Chang-Gun Lee Sebastian Lopez Yoshimasa Nakamura Sabri Pllana Julio Sahuquillo Zili Shao Kenjiro Taura

Keio University, Japan The University of Georgia, USA University of Bolonia, Italy Universidade Nova de Lisboa, Portugal Universit` a di Palermo, Italy University of Science and Technology, Hong Kong University of California, Riverside, USA Universidad Politecnica de Valencia, Spain Seoul National University, Korea National Taiwan University, Taiwan University of Texas at San Antonio, USA Korea University, Korea Seoul National University, Korea Seoul National University, Korea Universidad Las Palmas, Spain Kyoto University, Japan University of Vienna, Austria Universidad Politecnica de Valencia, Spain The Hong Kong Polytechnic University, Hong Kong University of Tokyo, Japan

HardBio 2011 Foreword

It gives us great pleasure to introduce this small collection of papers that were presented at the First International Workshop on Parallel Architectures for Bioinformatics Systems (HardBio 2011), October 23–26, 2011, Melbourne, Australia. Bioinformatics is a research field that focuses on algorithms and statistical techniques that allow efficient interpretation, classification and understanding of biological datasets. These applications are to the general benefit of mankind. The datasets typically consist of huge numbers of DNA, RNA, or protein sequences. Sequence alignment is used to assemble the datasets for analysis. Comparisons of homologous sequences, gene finding, and prediction of gene expression are the most common techniques used on assembled datasets. However, analysis of such datasets have many applications throughout all fields of biology. The down-side of bioinformatics-related applications is that they need a humongous computational effort to be executed. Therefore, a lot of research effort is being channeled towards the development of special-purpose hardware accelerators and dedicated parallel processors that allow for efficient execution of this kind of applications The Program Committee received 12 submissions, from which it selected 4 for presentation and publication. Each paper was evaluated by three referees. Technical quality, originality, relevance, and clarity were the primary criteria for selection. We wish to thank all these who submitted manuscripts for consideration. We also wish to thank the members of the Program Committee who reviewed all of the submissions. I hope that many more reserachers will submit the results of their work to next year’s workwhop. August 2011

Nadia Nedjah Luiza de Macedo Mourelle

XIV

HardBio 2011 Foreword

Program Committee Felipe Maia Galv˜ ao Fran¸ca Nader Bagherzadeh Leandro dos Santos Coelho Jurij Silc Heitor Silvério Lopes Lech J´ ozwiak Zhihua Cui Hamid Sarbazi-Azad

Federal University of Rio de Janeiro, Brazil University of California, Irvine, USA Pontifical Catholic University of Paran´ a, Brazil Jozef Stefan Institute, Slovenia Federal Technological University of Paran´ a, Brazil Eindhoven University of Technology, The Netherlands Taiyuan University of Science and Technology, China Sharif University of Technology, Iran

Table of Contents – Part II

ADCN 2011 Papers Lightweight Transactional Arrays for Read-Dominated Workloads . . . . . . Ivo Anjo and Jo˜ ao Cachopo Massively Parallel Identification of Intersection Points for GPGPU Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Solon Nery, Nadia Nedjah, Felipe M.G. Fran¸ca, and Lech Jozwiak Cascading Multi-way Bounded Wait Timer Management for Moody and Autonomous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asrar Ul Haque and Javed I. Khan World-Wide Distributed Multiple Replications in Parallel for Quantitative Sequential Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mofassir Haque, Krzysztof Pawlikowski, Don McNickle, and Gregory Ewing Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongnan Li, Limin Xiao, Guangjun Qin, Xiuqiao Li, and Songsong Lei Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism on Heterogeneous Multi-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shigang Li, Shucai Yao, Haohu He, Lili Sun, Yi Chen, and Yunfeng Peng

1

14

24

33

43

54

Generic Parallel Genetic Algorithm Framework for Protein Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lukas Folkman, Wayne Pullan, and Bela Stantic

64

A Survey on Privacy Problems and Solutions for VANET Based on Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hun-Jung Lim and Tai-Myoung Chung

74

Scheduling Tasks and Communications on a Hierarchical System with Message Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Yves Colin and Moustafa Nakechbandi

89

Spiking Neural P System Simulations on a High Performance GPU Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francis George Cabarle, Henry Adorna, Miguel A. Mart´ınez–del–Amor, and Mario J. Pérez–Jiménez

99

XVI


SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moussa Taifi, Justin Y. Shi, and Abdallah Khreishah

109

Investigating the Scalability of OpenFOAM for the Solution of Transport Equations and Large Eddy Simulations . . . . . . . . . . . . . . . . . . . . Orlando Rivera, Karl F¨ urlinger, and Dieter Kranzlm¨ uller

121

Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Gao and Jefferson Tan

131

A Secure Internet Voting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdul Based and Stig Fr. Mjølsnes

141

A Hybrid Graphical Password Based System . . . . . . . . . . . . . . . . . . . . . . . . Wazir Zada Khan, Yang Xiang, Mohammed Y. Aalsalem, and Quratulain Arshad

153

Privacy Threat Analysis of Social Network Data . . . . . . . . . . . . . . . . . . . . . Mohd Izuan Hafez Ninggal and Jemal Abawajy

165

IDCS 2011 Papers Distributed Mechanism for Protecting Resources in a Newly Emerged Digital Ecosystem Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilung Pranata, Geoff Skinner, and Rukshan Athauda Reservation-Based Charging Service for Electric Vehicles . . . . . . . . . . . . . . Junghoon Lee, Gyung-Leen Park, and Hye-Jin Kim Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghoon Lee, Hye-Jin Kim, Gyung-Leen Park, Ho-Young Kwak, and Cheol Min Kim

175 186

196

Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heejung Byun and Jungmin So

205

Experimental Evaluation of a Failure Detection Service Based on a Gossip Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leandro P. de Sousa and Elias P. Duarte Jr.

215

On the Performance of MPI-OpenMP on a 12 Nodes Multi-core Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdelgadir Tageldin Abdelgadir, Al-Sakib Khan Pathan, and Mohiuddin Ahmed

225


XVII

A Protocol for Discovering Content Adaptation Services . . . . . . . . . . . . . . Mohd Farhan Md Fudzee and Jemal Abawajy

235

Securing RFID Systems from SQLIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harinda Fernando and Jemal Abawajy

245

Modeling QoS Parameters of VoIP Traffic with Multifractal and Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Homero Toral-Cruz, Al-Sakib Khan Pathan, and Julio C. Ram´ırez-Pacheco Hybrid Feature Selection for Phishing Email Detection . . . . . . . . . . . . . . . Isredza Rahmi A. Hamid and Jemal Abawajy

255

266

M2A2 2011 Papers On the Use of Multiplanes on a 2D Mesh Network-on-Chip . . . . . . . . . . . . Cruz Izu A Minimal Average Accessing Time Scheduler for Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen

276

287

Fast Software Implementation of AES-CCM on Multiprocessors . . . . . . . . Jung Ho Yoo

300

A TCM-Enabled Access Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gongxuan Zhang, Zhaomeng Zhu, Pingli Wang, and Bin Song

312

Binary Addition Chain on EREW PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . Khaled A. Fathy, Hazem M. Bahig, Hatem M. Bahig, and A.A. Ragb

321

A Portable Infrastructure Supporting Global Scheduling of Embedded Real-Time Applications on Asymmetric MPSoCs . . . . . . . . . . . . . . . . . . . . . Eugenio Faldella and Primiano Tucci Emotional Contribution Process Implementations on Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos Dom´ınguez, Houcine Hassan, José Albaladejo, Maria Marco, and Alfons Crespo A Cluster Computer Performance Predictor for Memory Scheduling . . . . M´ onica Serrano, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and José Duato

331

343

353

XVIII


HardBio 2011 Papers Reconfigurable Hardware Computing for Accelerating Protein Folding Simulations Using the Harmony Search Algorithm and the 3D-HP-Side Chain Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . César Manuel Vargas Ben´ıtez, Marlon Scalabrin, Heitor Silvério Lopes, and Carlos R. Erig Lima Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmed Shamsul Arefin, Mario Inostroza-Ponta, Luke Mathieson, Regina Berretta, and Pablo Moscato Reconfigurable Hardware to Radionuclide Identification Using Subtractive Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcos Santana Farias, Nadia Nedjah, and Luiza de Macedo Mourelle

363

375

387

A Parallel Architecture for DNA Matching . . . . . . . . . . . . . . . . . . . . . . . . . . Edgar J. Garcia Neto Segundo, Nadia Nedjah, and Luiza de Macedo Mourelle

399

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

409

Table of Contents – Part I

ICA3PP 2011 Keynote Keynote: Assertion Based Parallel Debugging . . . . . . . . . . . . . . . . . . . . . . . . David Abramson

1

ICA3PP 2011 Regular Papers Secure and Energy-Efficient Data Aggregation with Malicious Aggregator Identification in Wireless Sensor Networks . . . . . . . . . . . . . . . . Hongjuan Li, Keqiu Li, Wenyu Qu, and Ivan Stojmenovic

2

Dynamic Data Race Detection for Correlated Variables . . . . . . . . . . . . . . . Ali Jannesari, Markus Westphal-Furuya, and Walter F. Tichy

14

Improving the Parallel Schnorr-Euchner LLL Algorithm . . . . . . . . . . . . . . Werner Backes and Susanne Wetzel

27

Distributed Mining of Constrained Frequent Sets from Uncertain Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alfredo Cuzzocrea and Carson K. Leung

40

Set-to-Set Disjoint-Paths Routing in Recursive Dual-Net . . . . . . . . . . . . . . Yamin Li, Shietung Peng, and Wanming Chu

54

Redflag: A Framework for Analysis of Kernel-Level Concurrency . . . . . . . Justin Seyster, Prabakar Radhakrishnan, Samriti Katoch, Abhinav Duggal, Scott D. Stoller, and Erez Zadok

66

Exploiting Parallelism in the H.264 Deblocking Filter by Operation Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tsung-Hsi Weng, Yi-Ting Wang, and Chung-Ping Chung Compiler Support for Concurrency Synchronization . . . . . . . . . . . . . . . . . . Tzong-Yen Lin, Cheng-Yu Lee, Chia-Jung Chen, and Rong-Guey Chang

80 93

Fault-Tolerant Routing Based on Approximate Directed Routable Probabilities for Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinh Thuy Duong and Keiichi Kaneko

106

Finding a Hamiltonian Cycle in a Hierarchical Dual-Net with Base Network of p -Ary q-Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yamin Li, Shietung Peng, and Wanming Chu

117

XX


Adaptive Resource Remapping through Live Migration of Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Atif and Peter Strazdins LUTS: A Lightweight User-Level Transaction Scheduler . . . . . . . . . . . . . . . Daniel Nic´ acio, Alexandro Baldassin, and Guido Ara´ ujo Verification of Partitioning and Allocation Techniques on Teradata DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ladjel Bellatreche, Soumia Benkrid, Ahmad Ghazal, Alain Crolotte, and Alfredo Cuzzocrea Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86 64 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Molka, Robert Sch¨ one, Daniel Hackenberg, and Matthias S. M¨ uller Anonymous Communication over Invisible Mix Rings . . . . . . . . . . . . . . . . . Ming Zheng, Haixin Duan, and Jianping Wu Game-Based Distributed Resource Allocation in Horizontal Dynamic Cloud Federation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Mehedi Hassan, Biao Song, and Eui-Nam Huh

129

144

158

170

182

194

Stream Management within the CloudMiner . . . . . . . . . . . . . . . . . . . . . . . . . Yuzhang Han, Peter Brezany, and Andrzej Goscinski

206

Security Architecture for Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . Udaya Tupakula, Vijay Varadharajan, and Abhishek Bichhawat

218

Fast and Accurate Similarity Searching of Biopolymer Sequences with GPU and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Pawlowski, Bo˙zena Malysiak-Mrozek, Stanislaw Kozielski, and Dariusz Mrozek Read Invisibility, Virtual World Consistency and Probabilistic Permissiveness are Compatible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tyler Crain, Damien Imbs, and Michel Raynal Parallel Implementations of Gusfield’s Cut Tree Algorithm . . . . . . . . . . . . Jaime Cohen, Luiz A. Rodrigues, Fabiano Silva, Renato Carmo, André L.P. Guedes, and Elias P. Duarte Jr. Efficient Parallel Implementations of Controlled Optimization of Traffic Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sameh Samra, Ahmed El-Mahdy, Walid Gomaa, Yasutaka Wada, and Amin Shoukry

230

244

258

270


Scheduling Concurrent Workflows in HPC Cloud through Exploiting Schedule Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . He-Jhan Jiang, Kuo-Chan Huang, Hsi-Ya Chang, Di-Syuan Gu, and Po-Jen Shih Efficient Decoding of QC-LDPC Codes Using GPUs . . . . . . . . . . . . . . . . . . Yue Zhao, Xu Chen, Chiu-Wing Sham, Wai M. Tam, and Francis C.M. Lau

XXI

282

294

ICA3PP 2011 Short Papers A Combined Arithmetic Logic Unit and Memory Element for the Design of a Parallel Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Ziaur Rahman Parallel Implementation of External Sort and Join Operations on a Multi-core Network-Optimized System on a Chip . . . . . . . . . . . . . . . . . . . . Elahe Khorasani, Brent D. Paulovicks, Vadim Sheinin, and Hangu Yeo STM with Transparent API Considered Harmful . . . . . . . . . . . . . . . . . . . . . Fernando Miguel Carvalho and Joao Cachopo A Global Snapshot Collection Algorithm with Concurrent Initiators with Non-FIFO Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diganta Goswami and Soumyadip Majumder An Approach for Code Compression in Run Time for Embedded Systems – A Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wanderson Roger Azevedo Dias, Edward David Moreno, and Raimundo da Silva Barreto Optimized Two Party Privacy Preserving Association Rule Mining Using Fully Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Golam Kaosar, Russell Paulet, and Xun Yi SLA-Based Resource Provisioning for Heterogeneous Workloads in a Virtualized Cloud Datacenter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saurabh Kumar Garg, Srinivasa K. Gopalaiyengar, and Rajkumar Buyya ΣC: A Programming Model and Language for Embedded Manycores . . . Thierry Goubier, Renaud Sirdey, Stéphane Louise, and Vincent David Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William Voorsluys, Saurabh Kumar Garg, and Rajkumar Buyya

306

318

326

338

349

360

371

385

395

XXII


A Principled Approach to Grid Middleware: Status Report on the Minimum Intrusion Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jost Berthold, Jonas Bardino, and Brian Vinter

409

Performance Analysis of Preemption-Aware Scheduling in Multi-cluster Grid Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohsen Amini Salehi, Bahman Javadi, and Rajkumar Buyya

419

Performance Evaluation of Open Source Seismic Data Processing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Izzatdin A. Aziz, Andrzej M. Goscinski, and Michael M. Hobbs

433

Reputation-Based Resource Allocation in Market-Oriented Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masnida Hussin, Young Choon Lee, and Albert Y. Zomaya

443

Cooperation-Based Trust Model and Its Application in Network Security Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wu Liu, Hai-xin Duan, and Ping Ren

453

Performance Evaluation of the Three-Dimensional Finite-Difference Time-Domain(FDTD) Method on Fermi Architecture GPUs . . . . . . . . . . . Kaixi Hou, Ying Zhao, Jiumei Huang, and Lingjie Zhang

460

The Probability Model of Peer-to-Peer Botnet Propagation . . . . . . . . . . . . Yini Wang, Sheng Wen, Wei Zhou, Wanlei Zhou, and Yang Xiang

470

A Parallelism Extended Approach for the Enumeration of Orthogonal Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hien Phan, Ben Soh, and Man Nguyen

481

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

495

Lightweight Transactional Arrays for Read-Dominated Workloads Ivo Anjo and Jo˜ ao Cachopo ESW INESC-ID Lisboa/Instituto Superior Técnico/Universidade Técnica de Lisboa Rua Alves Redol 9, 1000-029 Lisboa, Portugal {ivo.anjo,joao.cachopo}@ist.utl.pt

Abstract. Many common workloads rely on arrays as a basic data structure on top of which they build more complex behavior. Others use them because they are a natural representation for their problem domains. Software Transactional Memory (STM) has been proposed as a new concurrency control mechanism that simplifies concurrent programming. Yet, most STM implementations have no special representation for arrays. This results, on many STMs, in inefficient internal representations, where much overhead is added while tracking each array element individually, and on other STMs in false-sharing conflicts, because writes to different elements on the same array result in a conflict. In this work we propose new designs for array implementations that are integrated with the STM, allowing for improved performance and reduced memory usage for read-dominated workloads, and present the results of our implementation of the new designs on top of the JVSTM, a Java library STM. Keywords: Parallel Programming, Software Transactional Memory.

1

Introduction

Software Transactional Memory (STM) [10, 15] is a concurrency control mechanism for multicore and multiprocessor shared-memory systems, aimed at simplifying concurrent application development. STM provides features such as atomicity and isolation for program code, while eliminating common pitfalls of concurrent programming such as deadlocks and data races. During a transaction, most STMs internally work by tracking the memory read and write operations done by the application on thread-local read and write-sets. Tracking this metadata adds overheads to applications that depend on the granularity of transactional memory locations. There are two main STM designs regarding granularity: Either word-based [4, 8] or object-based [7, 11]. Wordbased designs associate metadata with either each individual memory location, or by mapping them to a fixed-size table; whereas object-based designs store

This work was supported by FCT (INESC-ID multiannual funding) through the PIDDAC Program funds and by the RuLAM project (PTDC/EIA-EIA/108240/2008).

Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 1–13, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

I. Anjo and J. Cachopo

transactional information on each object or structure’s header, and all of the object’s fields share the same piece of transactional metadata. Arrays, however, are not treated specially by STM implementations. Thus, programmers either use an array of transactional containers in each position, or they wrap the entire array with a transactional object. Neither option is ideal, if we consider that array elements may be randomly but infrequently changed. Because arrays are one of the most elemental data structures on computing systems, if we hope to extend the usage of STM to provide synchronization and isolation to array-heavy applications, minimizing the imposed overhead is very important. In this paper, we describe how existing transactional arrays are implemented, and explore new approaches that are integrated with the STM, achieving better performance and reducing memory usage for read-dominated workloads. Our work is based on the Java Versioned Software Transactional Memory (JVSTM) [2, 3], a multi-version STM. The rest of this work is organized as follows. Section 2 introduces the JVSTM transactional memory. Section 3 describes current black-box approaches to arrays. Section 4 introduces the new proposals for handling arrays. In Section 5, we compare the different array implementations. Experimental results are presented in Section 6, followed, in Section 7, by a survey of related work. Finally, in Section 8, we finish by presenting the conclusions and future research directions.

2

The JVSTM Software Transactional Memory

The Java Versioned Software Transactional Memory (JVSTM) is a pure Java library implementing an STM [3]. JVSTM introduces the concept of versioned boxes [2], which are transactional locations that may be read and written during transactions, much in the same way of other STMs, except that they keep the history of values written to them by any committed transaction. Programmers using the JVSTM must use instances of the VBox class to represent the shared mutable variables of a program that they want to access transactionally. In Java, those variables are either class fields (static or not) or array components (each element of an array). As an example, consider a field f of type T in a class C whose instances may be accessed concurrently. To access f transactionally, the programmer must do two things: (1) transform the field f in C into a final field that holds an instance of type VBox, and (2) replace all the previous accesses to f by the corresponding operations on the contents of the box now contained in f. JVSTM implements versioned boxes by keeping a linked-list of VBoxBody instances inside each VBox: Each VBoxBody contains both the version number of the transaction that committed it and the value written by that transaction. This list of VBoxBody instances is sorted in descending order of the version number, with the most recent at the head. The key idea of this design is that transactions typically need to access the most recent version of a box, which is only one indirection-level away from the box object.

Lightweight Transactional Arrays for Read-Dominated Workloads

3

Yet, because the JVSTM keeps all the versions that may be needed by any of the active transactions, a transaction that got delayed for some reason can still access a version of the box that ensures that it will always perform consistent reads: The JVSTM satisfies the opacity correctness criteria [9]. In fact, a distinctive feature of the JVSTM is that read-only transactions are lock-free and never conflict with other transactions. They are also very lightweight, because there is no need to keep read-sets or write-sets: Each read of a transactional location consists only of traversing the linked-list to locate the correct VBoxBody from which the value is to be read. These two characteristics make the JVSTM especially suited for applications that have a high read/write transaction ratio. Currently there are two versions of the JVSTM that differ on their commit algorithm. The original version of the JVSTM uses a lock-based commit algorithm, described below, whereas more recently Fernandes and Cachopo described a lock-free commit algorithm for the JVSTM [6]. Unless otherwise stated, the approaches described in this paper apply to both versions of the JVSTM. To synchronize the commits of read-write transactions, the lock-based JVSTM uses a single global lock: Any thread executing a transaction must acquire this lock to commit its results, which means that all commits (of read-write transactions) execute in mutual exclusion. After the lock acquisition, the committing transaction validates its read-set and, if valid, writes-back its values to new VBoxBody instances, which are placed at the head of each VBox’s history of values. To prevent unbounded growth of the memory used to store old values for boxes, the JVSTM implements a garbage collection algorithm, which works as follows: Each committing transaction creates a list with all the newly created instances of VBoxBody and stores this list on its descriptor. The transaction descriptors themselves also form a linked-list of transactions, with increasing version numbers. When the JVSTM detects that no transactions are running with version number older than some descriptor, it cleans the next field of each VBoxBody instance in the descriptor, allowing the Java GC to clean the old values.

3

Current Black-Box Array Implementations

In this section, we describe the two most common alternatives to implement transactional arrays with the JVSTM if we use only its provided API — that is, if we use the JVSTM as a black-box library. 3.1

Array of Versioned Boxes

The most direct and commonly used way of obtaining a transactional array with the JVSTM is the array of VBoxes. A graphical representation of the resulting structure is shown in Figure 1. One of the shortcomings of this approach is the array initialization: All positions on the array need to be initialized with a VBox before they are used, typically as soon as the array is created and before it is published. Trying to perform lazy initialization highlights one of the issues of implementing such a data-structure outside the STM: the underlying native Java array is

4


Fig. 1. Array of versioned boxes

Fig. 2. Versioned box with array

not under the control of the STM, and as such the programmer must provide his own synchronization mechanism for this operation. Side-stepping the synchronization provided by the STM while at the same time using the STM must be done carefully, or key STM characteristics might be lost, such as lock-freedom and atomicity, and common concurrent programming issues such as deadlocks might arise again. We will see in Section 4.1 a variant of this approach that uses lazy initialization and knowledge of the JVSTM’s internals. Since all VBoxes and their associated VBoxBody instances are normal Java objects, they still take up a considerable amount of memory when comparing to the amount needed to store each reference on the VBox array. As such, it is not unexpected for the application to spend more than twice the space needed for the native array to store these instances in memory. 3.2

Versioned Box with Array

The other simple implementation of a transactional array is one where a single VBox keeps the entire array, as shown in Figure 2. Creation of this kind of array is straightforward, with overheads comparable to a normal non-transactional array. Array reads are the cheapest possible, only adding the cost of looking up the correct VBoxBody to read from; but writes are very expensive, as they need to duplicate the entire array just to change one of the positions. In addition, a single array write conflicts with every other (non read-only) transaction that is concurrently accessing the array, as the conflict detection granularity is the VBox holding the entire array. Moreover, there is a very high overhead in keeping the history of values: For each version, an entire copy of the array is kept, even if only one element of the array was changed. This may lead the system to run out of memory very quickly, if writes to the array are frequent and some old running transaction prevents the garbage collector from running. In conclusion, this approach is suited only for very specific workloads, with zero or almost-zero writes to the array. On the upside, for those workloads, it offers performance comparable to native arrays, while still benefiting from transactional properties. It is also the only approach that allows the underlying array to change size and dimensions dynamically with no extra overhead.


5

Type value = getVBox(index).get(); // Reading from a VBoxArray getVBox(index).put(newValue); // Writing to a VBoxArray VBox getVBox(int index) { // Helper method getVBox VBox vbox = transArray[index]; if (vbox == null) { vbox = new VBox((VBoxBody) null); vbox.commit(null, 0); if (!unsafe.compareAndSwapObject(transArray, ..., null, vbox)) vbox = transArray[index]; } return vbox; }

Fig. 3. Code for the

4

VBoxArray

approach

New Array Proposals

In this section, we describe three proposals to implement transactional arrays that improve on the black-box approaches presented in the previous section. 4.1

VBoxArray and VBodyArray

The VBoxArray approach is obtained by adding lazy creation and initialization of VBoxes to the approach presented in Section 3.1. The main operations for this implementation are shown in Figure 3. The getVBox() helper method first tries to obtain a VBox from the specified array position. If it exists, it is returned; otherwise a new one is created using an empty body, that is immediately written back, and tagged with version 0. This is conceptually the same as if the VBox was created by a transaction that ran before every other transaction and initialized all the boxes. The VBox is then put into the array in an atomic fashion: Either the compareAndSwap1 operation succeeds, and the box is placed on the underlying array, or it fails, meaning that another thread already initialized it. We can take the VBoxArray one step further and obtain the VBodyArray by doing away with the VBoxes altogether. The insight is that a VBox is needed only to uniquely identify a memory location on which we can transactionally read and write. If we provide our transactional array inside a wrapper VBodyArray class, we can use another method to identify uniquely a memory position: a pair . Using this pair, we no longer need the VBoxes, because the underlying array can directly contain the VBoxBody instances that would normally be kept inside them; initialization can still be done lazily. The VBodyArray saves a considerable amount of memory for larger arrays, and also lowers overhead on reads, as less memory reads need to be done to reach the values. 1

Available in the

sun.misc.Unsafe

class included in most JVM implementations.

6


Fig. 4. The

VArray

transactional array

Type value = array.values.get(index); // Read value from array (volatile read!) int version = array.version; // Read array version // If the array did not change, return the value read, otherwise check the log if (version = n. It then checks that node for the index, by performing a binary search on the logEntryIndexes array. If this search finds the index, it returns the corresponding value. Otherwise, the search is resumed from the previous node, until a value is found, or the beginning of the log is reached — meaning that the requested value should be read from the main array. Synchronization. As we saw, the read algorithm first reads the value from the array, and then reads its version. To commit a new value we reverse this order: First the committer updates the version, and then writes back the new values. Yet, without additional synchronization, we have a data race and the following can happen: The update of the array value may be reordered with the update of the version, which means that a reader may read the new value written by the committing transaction, but still read the old version value, causing the algorithm to return an invalid (newer) value to the application. To solve this issue, and taking into account the Java memory model [12] we might be inclined to make the field that stores the array version volatile. Unfortunately, this will not work: If the committing thread first does a volatile write on the array version, and then updates the array, if the reading thread does not observe the write to the array version, no synchronizes-with 3 relation happens, and so the update to the array value may be freely reordered before the version write, making a reader read the new value, and miss the new version. The other possible option would be for the committing thread to first write-back the value, and then update the array version with a volatile write; in this case, a simple delay or context switch between the two writes would cause issues. As such, we can see that no ordering of writes to update both the array value and version can work correctly if just the version is declared volatile. As it turns out, the commit algorithm works correctly if only the array value is read and written with volatile semantics (through the usage of the AtomicReferenceArray class), and the version as a normal variable. This way, the reader can never read a newer value and an old version, because by volatile definition, if we observe a value, we at least observe the correct version for that value, but may also 3

The volatile keyword, when applied to a field states that if a thread t1 writes to normal field f1 and then to volatile field f2 ; then if other thread observes the write on f2, it is guaranteed that it will also see the write to f1, and also every other write done by t1 before the write to f2. This is called a synchronizes-with [12] relationship.


9

% &

Table 1. Comparison of array implementations. The memory overheads are considered for two workloads: a workload where only a single position is ever used after the array is created, and one where the entire array is used.

! " " ! ! # $ # $ # $ # $

" $ # $ #

" " $

" " ! "

observe a later version, which poses no problem: In both cases the algorithm will correctly decide to check the log. Garbage Collection. We also extended the JVSTM garbage collection algorithm to work with the VArray log. As the linked list structure of the array log is similar to the linked list of bodies inside a VBox, new instances of VArrayLogNode that are created during transaction commit are also saved in the transaction descriptor, and from then the mechanism described in Section 2 is used.

5

Comparison of Approaches

Table 1 summarizes the key characteristics of the multiple approaches described in this paper. The single position memory overhead test case considers an array of n positions, where, after creation, only one of those positions is ever used during the entire program; conversely the entire array test case considers one where every position of the array is used. The memory overheads considered are in addition to a native array of size n, which all implementations use. The main objective of this work was the creation of an array implementation that provided better performance for read-only operations, while minimizing memory usage and still supporting write operations without major overheads. We believe VArray fulfills those objectives, as it combines the advantages of the “VBox with Array” approach, such as having a very low memory footprint and read overhead, with advantages from other approaches, notably conflict detection done at the array position level, and low history overhead. Writes to a VArray are still more complex than most other approaches, but as we will see in Section 6 they can still be competitive.

6

Experimental Results

We shall now present experimental results of the current implementation of VArray. They were obtained on two machines: one with two Intel Xeon E5520 processors (8 cores total) and 32GB of RAM, and another with four AMD Opteron 6168 processors (48 cores total) and 128GB of RAM, both running Ubuntu

10


!"#$% #!

Fig. 8. Comparison of VArray versus the Array of VBoxes approach for the array benchmark, with a read-only workload on our two test systems

Fig. 9. Comparison of VArray versus the Array of VBoxes approach for the array benchmark, with varying number of read-write transactions (10%, 50% and 100%) on the 48-core AMD machine

10.04.2 LTS 64-bit and Oracle Java 1.6.0 22. For our testing, we compared VArray to the Array of VBoxes approach, using the array benchmark,4 which can simulate multiple array-heavy workloads. Before each test, the array was entirely initialized— note that after being fully initialized, the Array of Versioned Boxes and VBoxArray behave similarly. Each test was run multiple times, and the results presented are the average over all executions. Figure 8 shows the scaling of VArray versus the Array of VBoxes approach for a read-only workload, with a varying number of threads. Each run consisted of timing the execution of 1 million transactions, with an array size of 1,000,000 on the 8-core machine, and 10,000,000 on the 48-core machine. Due to the reduced overheads imposed on array reads, VArray presents better performance. Figure 9 shows the scaling of VArray versus a Array of VBoxes approach for a workload with a varying percentage of read-only and read-write transactions. Each read-only transaction reads 1000 (random) array positions, and each 4

http://web.ist.utl.pt/sergio.fernandes/darcs/array/


11

read-write transaction reads 1000 array positions and additionally writes to 10. Each run consisted of timing the execution of 100,000 transactions. As we can see, the increased write overhead of VArray eventually takes its toll and beyond a certain number of cores (that depend on the percentage of read-write transactions), VArray presents worse results than the Array of VBoxes approach. These results show that while VArray is better suited for read-only workloads, if needed it can still support a moderate read-write workload. To test the memory overheads of VArray, we measured the minimum amount of memory needed to run a read-only workload in the array benchmark, on a single CPU, for an array with 10 million Integer objects. Due to its design, VArray was able to complete the benchmark using only 57MB of RAM, 10% of the 550MB needed by the Array of VBoxes approach. Finally, we measured, using a workload comprised of 10% read-write transactions and 90% read-write transactions, and 4 threads, the minimum memory needed for both approaches to present acceptable performance, when compared with a benchmark run with a large heap. In this test, VArray took approximately 25% longer to execute with a 256MB heap, when compared to a 3GB heap; runs with an Array of VBoxes needed at least 800MB and also took 25% longer.

7

Related Work

Software Transactional Memory (STM) [15] is an optimistic approach to concurrency control on shared-memory systems. Many implementations have been proposed — Harris et al.’s book [10] provides a very good overview of the subject. CCSTM [1] is a library-based STM for Scala based on SwissTM [5]. Similarly to the JVSTM, the programmer has to explicitly make use of a special type of reference, that mediates access to a STM-managed mutable value. Multiple memory locations can share the same STM metadata, enabling several levels of granularity for conflict detection. The CCSTM also provides a transactional array implementation that eliminates some of the indirections needed to access transactional metadata, similar to our VBodyArray approach. The DSTM2 [11] STM framework allows the automatic creation of transactional versions of objects based on supplied interfaces. Fields on transactional objects are allowed to be either scalar or other transactional types, which disallows arrays; to work around this issue, the DSTM2 includes the AtomicArray class that provides its own specific synchronization and recovery, but no further details on its implementation are given. Another approach to reducing the memory footprint of STM metadata on arrays and other data structures is changing the granularity of conflict detection. Word-based STMs such as Fraser and Harris’s WSTM [8] and TL2 in per-stripe mode [4] use a hash function to map memory addresses to a fixed-size transactional metadata table; hash collisions may result in false positives, but memory usage is bounded to the chosen table size. Marathe et al. [13] compared word-based with object-based STMs, including the overheads added and memory usage; one of their conclusions is that the studied systems incur significant bookkeeping overhead for read-only transactions.

12


Riegel and Brum [14] studied the impact of word-based versus object-based STMs for unmanaged environments, concluding that object-based STMs can reach better performance than purely word-based STMs. Our VArray implementation is novel because it presents the same memory overheads of word-based schemes, while still detecting conflicts for each individual array position. Processing overhead for read-write transactions is still larger than with word-based approaches, because the transaction read-set must contain all individual array positions that were read, and all of them must be validated at commit-time, which is something word-based STMs can further reduce.

8

Conclusions and Future Work

Software transactional memory is a very promising approach to concurrency. Still, to expand into most application domains, many research and engineering issues need to be examined and solved. The usage of arrays is one such issue. In this work we presented the first comprehensive analysis of transactional array designs, described how arrays are currently implemented on top of the JVSTM, and presented two implementations that improve on previous designs. In particular, the VArray implementation has memory usage comparable to native arrays, while preserving the lock-free property of JVSTM’s read-only transactions. In addition, our experimental results show that VArray is highly performant for readdominated workloads, and competitive for read-write workloads. Future research directions include researching the possibility of a lock-free VArray commit algorithm, and exploring the usage of bloom filters for log lookups.

References 1. Bronson, N., Chafi, H., Olukotun, K.: CCSTM: A library-based STM for Scala 2. Cachopo, J., Rito-Silva, A.: Versioned boxes as the basis for memory transactions. Science of Computer Programming 63(2), 172–185 (2006) 3. Cachopo, J.: Development of Rich Domain Models with Atomic Actions. Ph.D. thesis, Technical University of Lisbon (2007) 4. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 5. Dragojević, A., Guerraoui, R., Kapalka, M.: Stretching transactional memory. ACM SIGPLAN Notices 44, 155–165 (2009) 6. Fernandes, S., Cachopo, J.: Lock-free and scalable multi-version software transactional memory. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 179–188. ACM, New York (2011) 7. Fraser, K., Harris, T.: Practical lock-freedom. Tech. rep. (2004) 8. Fraser, K., Harris, T.: Concurrent programming without locks. ACM Trans. Comput. Syst. 25 (2007) 9. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 175–184. ACM, New York (2008) 10. Harris, T., Larus, J., Rajwar, R.: Transactional memory. Synthesis Lectures on Computer Architecture 5(1), 1–263 (2010)


13

11. Herlihy, M., Luchangco, V., Moir, M.: A flexible framework for implementing software transactional memory. ACM SIGPLAN Notices 41(10), 253–262 (2006) 12. Manson, J., Pugh, W., Adve, S.: The Java Memory Model 13. Marathe, V.J., Scherer, W.N., Scott, M.L.: Design tradeoffs in modern software transactional memory systems. In: Proceedings of the 7th Workshop on Workshop on Languages, Compilers, and Run-Time Support for Scalable Systems, LCR 2004, pp. 1–7. ACM, New York (2004) 14. Riegel, T., Brum, D.B.D.: Making object-based STM practical in unmanaged environments. In: TRANSACT 2008: 3rd Workshop on Transactional Computing (2008) 15. Shavit, N., Touitou, D.: Software transactional memory. Distributed Computing 10(2), 99–116 (1997)

Massively Parallel Identification of Intersection Points for GPGPU Ray Tracing Alexandre S. Nery1,3 , Nadia Nedjah2 , Felipe M.G. Fran¸ca1 , and Lech Jozwiak3 1

LAM – Computer Architecture and Microeletronics Laboratory, Systems Engineering and Computer Science Program, COPPE, Universidade Federal do Rio de Janeiro 2 Department of Electronics Engineering and Telecommunications, Faculty of Engineering, Universidade do Estado do Rio de Janeiro 3 Department of Electrical Engineering – Electronic Systems, Eindhoven University of Technology, The Netherlands

Abstract. The latest advancements in computer graphics architectures, as the replacement of some fixed stages of the pipeline for programmable stages (shaders), have been enabling the development of parallel general purpose applications on massively parallel graphics architectures (Streaming Processors). For years the graphics processing unit (GPU) is being optimized for increasingly high throughput of massively parallel floating-point computations. However, only the applications that exhibit Data Level parallelism can achieve substantial acceleration in such architectures. In this paper we present a parallel implementation of the GridRT architecture for GPGPU ray tracing. Such architecture can expose two levels of parallelism in ray tracing: parallel ray processing and parallel intersection tests, respectively. We also present a traditional parallel implementation of ray tracing in GPGPU, for comparison against the GridRT-GPGPU implementation.

1

Introduction

High-fidelity computer generated images is one of the main goals in the Computer Graphics field. Given a 3-D scene, usually described by a set of 3-D primitives (e.g. triangles), a typical rendering algorithm creates a corresponding image by several matrix computations and space transformations applied to the 3-D scene, together with many per-vertex shading computations [1]. All these computations are organized in pipeline stages, each one performing many SIMD floating-point operations, in parallel. The Graphics Processing Unit (GPU) is also known as a Stream Processor, because of such massively parallel pipeline organization, that continuously processes a stream of input data through pipeline stages. In the final stage, all primitives are rasterized to produce an image (a.k.a. frame). In order to achieve real-time rendering speed it is necessary to produce at least 60 frames per second (fps), so that the change between frames is not perceived and interactivity is ensured. The Streaming Processor model of current GPU architectures can deliver enough throughput of frame rates for most 3-D scenarios, but at the cost Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 14–23, 2011. c Springer-Verlag Berlin Heidelberg 2011

Massively Parallel Identification of Intersection Points

15

of a lower degree of realism in each produced frame. For example, important Global Illumination effects like shadows and reflections must be handled at the application level, because the hardware is based on a Local Illumination model and, thus, is especialized in processing 3-D primitives only [4]. Although the ray tracing algorithm [10] is also a computer graphics application for rendering 3-D scenes, the algorithm operates particularly in opposition to traditional rendering algorithms [1]. For instance, instead of projecting the primitives to the viewplane, where the final image is produced, the ray tracing algorithm fires rays towards the scene and traces their path in order to identify what are the visible objects, their properties and the light trajectory within the scene, through several intersection computations. In the end, all these informations are merged to produce the final image. For that reason, ray tracing is a high computation cost application that can produce high-fidelity images of a 3-D scene, with shadow and reflection effects. Besides, the algorithm has a very high parallelization potential, because each ray can be processed independently from the others, usually achieving almost linear acceleration for a parallel implementation [4]. Thus, there are parallel implementations on Clusters [11] and Shared Memory Systems [2], using spatial subdivision of the 3-D scene. Parallel implementations in GPGPUs (General Purpose Graphics Processing Units) have also achieved substantial results [9]. Some stages of the pipeline, such as the Vertex and Geometry processing stages, have recently evolved to programmable Shaders, that can be programmed to perform different algorithms [6]. So, the GPU is no longer dedicated to run graphic related algorithms, but also general purpose parallel algorithms that can benefit from the massively parallel architectue of modern GPGPUs. For instance, Data Level parallel applications in general achieve high acceleration when mapped to GPGPUs, because these applications perform well in SIMD machines [9]. However, if control flow and recursion are strongly required, which is often the case for ray tracing, then existing Von Neumann architectures may be a better option for Task Level parallel applications. In ray tracing, every ray can be processed independently, in parallel, but each ray must be tested for intersections against the primitives of the 3-D scene and, if there is such intersection, the computation may proceed in many different ways. So, the work granularity is in the task level and each task may execute through different branches, which makes control flow and recursion a big issue for ray tracing. So, there are consistent approaches to accelerate ray tracing with custom parallel architectures in hardware, as in [8,13], operating at low frequencies. Hence, the low frequency of operation is compensated by the parallelism of the custom design and several limitations can be overcome by a custom hardware design. In general, the target device is a Field Programmable Gate Array (FPGA), which can be used to prototype the design, and later an Application Specific Integrated Circuit (ASIC) can be produced, operating at much higher frequencies. Throughout this paper we briefly describe our GridRT parallel architecture for ray tracing and we present a GPGPU implementation of the architecture

16

A.S. Nery et al.

in CUDA, exhibiting Task Level parallelism of rays and Data Level parallelism of intersection computations. The CUDA kernel algorithm, which corresponds to the GridRT parallel architecture with some minor modifications, is also described. In the end, we present performance results for two modern NVidia Fermi GPUs, the GTX 460 and GTX 465. Furthermore, we describe a traditional parallel ray tracer implementation in GPGPU, for comparison with the GridRTGPGPU. The rest of this paper is organized as follows. First, Section 2 briefly explains the ray tracing algorithm. Then, Section 3 shows a traditional parallel implementation of the algorithm in GPGPUs. After that, Section 4 presents the GridRT architecture before the GPGPU implementation is presented in Section 5. Finally, Section 6 presents performance results, while Section 7 draws the conclusion of this work.

2

Ray Tracing

The ray tracing algorithm is briefly explained in this section, while further details can be found in [10]. Thus, the first step of the algorithm is the setup of a virtual camera, so that primary rays can be fired towards the scene. Each primary ray passes through a pixel of the camera’s viewplane, where the final image is going to be captured. For every primary ray, a simple and straightforward ray tracing algorithm usually computes intersection tests against all the 3-D primitives of the scene, looking for the primitives (objects) that are visible from the camera’s perspective. If an intersection is encountered, the object properties are used to determine wether the ray will be reflected, refracted or completely absorbed. For instance, if the ray is reflected or refracted, the algorithm is recursively executed to determine the objects that are visible from the previous intersection point perspective, which is why the algorithm can naturally produce mirror like effects in the final image. On the other hand, if the ray is absorbed, the processing ends and all the information that has been gathered until that point is merged to compose the color of the corresponding pixel of the viewplane. This ray tracing style is known as Whitted-Style ray tracing [12]. The program main entry is presented in Algorithm 1, in which the primary rays are being traced. The trace procedure in Algorithm 1 is responsible for determining the closest intersection point, while the shade procedure (called by the trace procedure) is responsible for coloring the pixel and recursively calling the trace procedure in case the intersected object surface is specular or transparent. For the sake of simplicity and brevity, these two procedures are not described in this work. Further details on shading algorithms can be found in [1]. In order to avoid intersection computations between each ray and the whole scene, a spatial subdivision of scene can be applied to select only those objects that are in the direction of a given ray, avoiding unnecessary computation. There are several spatial subdivision techniques, such as Binary Space Partitioning Trees, Boundary Hierarchical Volumes and KD-Trees and Uniform Grids [10,1], each one of them with its own advantages and disadvantages. For instance, the


17

Algorithm 1. Ray Tracing primary rays 1 2 3 4 5 6 7 8

3-D scene = load3DScene(file); viewplane = setupViewplane(width,height); camera = setupCamera(viewplane,eye,view direction); depth = 0; for i = 1 to viewplane’s width do for j = 1 to viewplane’s height do ray = getPrimaryRay(i,j,camera); image[i][j] = trace(3-D scene, ray, depth);

KD-Tree structure adapts very well to the 3-D scene and, hence, selects fewer objects than the other techniques. However, the KD-Tree building time is more expensive and complex, as well as the algorithm that is used to traverse the tree structure [5]. On the other hand, the Uniform Grid structure is less expensive to build and the traversal algorithm is very fast [10], but such structure is not adaptive and, because of that, may select a few more objects for intersection tests or perform extra traversal steps through empty areas of the 3-D scene. In this work we use the Uniform Grid structure, which is the base of the GridRT parallel architecture [8].

3

Traditional Parallel GPGPU Ray-Tracer

The ray tracing algorithm exhibits parallelism at the task level. Each ray can be processed independently from the others and each one can be assigned to an execution process or thread across different computing nodes or processing elements. So, one ray is going to be processed as a task, producing one pixel at the end of the computation. The main idea is to spread tasks (rays) across different processes. Also, it is possible to assign a group of tasks per process, instead of one task only. In the end, each task or group of tasks will produce the color of one or more pixels of the final image, respectively. Modern general purpose GPUs are capable of executing many thousands of threads in parallel [6], achieving peaks of 1TFLOPs or more. Thus, in modern GPGPUs, each thread can be assigned to a primary ray that crosses a pixel of the viewplane. The result is that a portion of the final image is going to be produced by a block of threads (one pixel per thread). The number of blocks of threads corresponds to as many subdivisions the image is going to be split into, which corresponds to distributing primary rays among threads. The given CUDA Kernel is presented in Algorithm 2, considering that all data transfers between the host and the GPGPU have already been performed. Note that this algorithm does not have the loop construction presented in lines 5 and 6 of the sequential Algorithm 1 version, because now each ray has been assigned to a thread of a block of threads. Every block of threads has its own identifier, as every thread too. In that way each thread can access its own data to process. So, in Algorithm 2, the given thread uses its own identifiers to select the corresponding ray that

18

A.S. Nery et al.

Algorithm 2. Traditional parallel GPGPU ray tracer CUDA-Kernel 1 2 3 4 5

int i = blockDim.x * blockIdx.x + threadIdx.x; int j = blockDim.y * blockIdx.y + threadIdx.y; ray = rays[i][j]; color = trace(3-D scene,ray,depth); image[i][j] = color;

will be traced, resulting in one pixel color. Depending on the configuration that is set on Kernel launch, the identifiers can have up to three coordinates. In the case of Algorithm 2, only two coordinates are used (i, j), because the data (primary rays) is organized in two dimensions. In the end, the whole image will have been produced by parallel threads that processed one primary ray each, together with any secondary rays that may have been generated for each intersection test.

4

The GridRT Parallel Architecture

Before explaining the GridRT implementation in GPGPU we explain the GridRT parallel architecture, which can be implemented in any kind of multiprocessor system, such as Clusters, Chip-Multiprocessors, GPGPUs or custom parallel designs in FPGA. The GridRT architecture is strongly based on the Uniform Grid structure. In such spatial subdivision scheme the 3-D scene is split into regions of equal size, called voxels. Each voxel has a list of primitives (triangles) that are inside it or almost inside it. Thus, only those voxels that are pierced by a ray are going to be sequentially accessed for intersection tests, from the voxel that is closest to the ray origin to the furthest. Therefore, if an intersection is found, no more tests are required for the given ray, because it is already the closest to the ray origin. In the example depicted in Fig. 1a, three voxels were accessed until an intersection t1 was found in voxel v5 . On the other hand, the GridRT parallel model maps each voxel onto a Processing Element (PE), as depicted in Fig. 1b. So, intersection tests are performed in

v0

Uniform Grid 4x4x1 v1 v2 v3

PE0

Uniform Grid 4x4x1 PE1 PE2 PE3 block of threads

voxel v4

v5

v6

v7

PE4

PE5

t1 v8

v9

v10

v11 ray origin

v12

v13

v14

v15

(a) Uniform Grid

PE8

PE9

B4

PE13

B5

PE10

PE14

B6

PE11

PE15

(b) Parallel GridRT

B7 t2

t1 ray origin

PE12

Uniform Grid 4x4x1 B1 B2 B3

t2

t1

ray origin

PE7

PE6

B0

B8

B9

B10

B11

B12

B13

B14

B15

(c) GridRT-GPGPU

Fig. 1. Sequential Uniform Grid, Parallel GridRT model and GridRT in GPGPU


19

parallel by those PEs that are pierced by a ray and, because of that, it becomes necessary to decide which PE holds the result that is closest to the ray origin. At first, one solution is to exchange the results between the PEs, but it would require every PE to wait for the others to finish their computation on the given ray before deciding which one holds the correct results. Thus, the GridRT parallel model uses the order of traversal for each ray to determine the correct result. For instance, in Fig. 1b, P E5 and P E6 have found an intersection each, t1 and t2 . According to the ray traversal order, P E5 is closer to the ray origin. Thus, P E5 may send an interrupt message to the next PE in the list, so it can abort its computation and forward the interrupt message to the next one, until every following PE is aborted. The computation is now reduced between the remaining PEs. If one of them finds an intersection within its piece of scene data, it can also proceed in the same way, sending interruption messages to the following ones in the list. Otherwise, if none of them finds an intersection, a feedback message is sent from the first to the last remaining PEs. Such message is used to ensure that none of the previous PEs in the list have found intersection tests. Then, the remaining PE holds the correct result, like P E5 of Fig. 1b, or none of them. Note that each PE needs to communicate such messages between their direct neighbors, which depends on the target architecture that the parallel model is going to be implemented. For example, if each PE is mapped onto a Process running on different computation nodes, the messages can be exchanged via Message Passing Interface (MPI) [7]. But if the target architecture is a FPGA parallel design, then the neighborhood of PEs can be connected by interrupt signals. Further details on the GridRT architecture and its communication model can be found in [8].

5

GridRT-CUDA Parallel Ray-Tracer

Following the GridRT parallel architecture presented in the previous section, the GridRT-CUDA implementation maps each voxel onto a block of threads, as depicted in Fig. 1c. Thus, every block of threads is performing intersection tests along a ray, in parallel. Also, the intersection tests are performed in parallel inside a block of threads. Thus, two levels of parallelism are exhibited in such organization. The first one is in the task level parallelism, while the second one is in the data level parallelism. For instance, if a given block has n triangles and n or more threads at disposal, then the intersection tests are performed in parallel by the threads of the block. Otherwise, parallel intersection tests can be performed in chunks of n triangles by n threads. In this work, the 3-D scene is such that there are always enough threads to process all the triangles in parallel inside the block, as will be presented in Section 6. However, in order to determine the correct result among the blocks of threads, a different approach from the one presented in Section 4 had to be developed, because threads from different blocks of threads cannot coordinate their activities. Only threads inside the same block can coordinate their activities through a Shared Memory. Thus, a given block of threads cannot inform the next block in the traversal list about its computation results.

20

A.S. Nery et al.

Algorithm 3. GridRT-CUDA Kernel 1 2 3 4 5

shared float res[max. number of triangles per voxel]; foreach ray i do foreach element j of the corresponding traversal list do if this blockId is in the list then if there are triangles in this block then

6

res[threadIdx.x ] = intersectTriangles(ray,vertices [threadIdx.x ]);

7

syncthreads();

8

if threadIdx.x = 0 then /* Finds the smallest result /* Copy the result to global memory

*/ */

Therefore, we let the host processor determine the correct result at the end of the whole computation, also according to the order of traversal presented in Section 4. Hence, each ray has an array of results associated to it and the size of an array corresponds to the maximum number of PEs, i.e. blocks, that can be traversed by a given ray. The size is given by the total number of subdivisions applied to each of the three axis (nx , ny , nz ) of the GridRT spatial structure, as defined in Eq. 1. For instance, considering the grid of Fig. 1c, the maximum size of the array is N = 7, since the uniform grid subdivision is nx = 4, ny = 4 and nz = 1. N = nx + (ny − 1) + (nz − 1)

(1)

When each block of threads have finished the intersection checks with respect to the corresponding voxel, the result is stored in the array at the block associated entry. Thereafter, the block can proceed with the computation of a different ray, which also has a different array of results associated to it. In the end, the matrix of results is copied from the GPU to the host processor, so it can proceed with further analysis of results. Each row of the matrix corresponds to the array of results computed by the block for a given ray, while each column contains the result that was computed by a block. The algorithm that is executed by a block of threads is shown in Algorithm 3. Each block of threads takes as input an array of rays, which has also an array of results associated to each row, thus yielding a matrix. The 3-D scene is already copied to the GPU before the kernel execution. The scene is stored according to the uniform grid structure as an unidimensional array. Each position of the array points to the list of triangles that belongs to the corresponding voxel (i.e. block of threads). Once the necessary data has been copied to the GPU, the kernel is launched. According to Algorithm 3, the first step taken by each block is to declare an array of shared results, as in line 1. This shared array is used to store the results from the parallel intersection tests, as in line 6. For each ray in the input data, the block will search for its own identifier in the traversal list, as in lines 3 and 4. Then, if there are any triangles in the block, parallel intersection tests are performed by the threads. Finally, one of the threads (the


21

Table 1. GridRT-CUDA kernel execution times in GTX 460 and GTX 465 Blocks of threads GridRT GTX 460 GridRT GTX 465 TradRT GTX 465

1 2 4 8 12 18 27 64 - 1.69 0.87 0.92 0.97 1.07 2.39 - 1.34 0.63 0.61 0.59 0.55 1.2 0.94 0.70 0.47 0.35 0.28 0.23 0.21 0.17

125 3.85 2.41 0.14

216 7.03 4.02 0.12

*All times are in seconds. Low-res Stanford Bunny 3-D scene.

one with identifier zero) searches the smallest result (that is closest to the ray origin, in respect to that block) from the array of shared results, as in line 8.

6

Results

In this section we present the comparison results between our GridRT-CUDA implementation in two different Nvidia GPU’s (GTX 460, 465) and also the results for our traditional parallel ray-tracer in GPGPU. These results are summarized in Table 1, respectively. The execution times for configurations of 1 and 2 blocks of threads are not available for the GridRT implementation, because the execution was terminated due to kernel execution timeout. A second dedicated GPGPU graphics card could have been used to avoid this limitation. Otherwise, the same GPU has to be shared between the host Operating System applications and thus cannot execute long time CUDA kernels (up to tens of seconds). Also, because of kernel execution timeout limitation, we could not use higher resolution 3-D scenes. As we can observe from Table 1, the GridRT-CUDA implementation achieves acceleration in both GPU models. However, the performance starts to degenerate when more than 8 blocks of threads are used by the GTX 460 or more than 27 blocks of threads are used by the GTX 465. The latter scales better because it has 11 Streaming Multiprocessors (SMs), while the first has 7 SMs. In essence, a block of threads is executed by a SM. The more SMs are available, the more blocks of threads can be executed in parallel. The results from Table 1 for the GridRT-CUDA are depicted in Fig. 2a. If more SMs were available, more acceleration was likely to be achieved. The kernel execution times for a traditional parallel ray tracer in CUDA are depicted in Fig. 2b, together with the GridRT-CUDA. In contrast, the traditional parallel ray-tracer implementation uses a different approach. Parallelism is employed at the ray (task) level only. Thus, blocks of threads are not mapped to voxels of the uniform grid subdivision. Instead, blocks of threads are mapped to groups of primary rays that are going to be traced in parallel, as presented in Section 3. So, each thread is in fact processing a independent primary ray and its corresponding secondary rays and shadow rays that may be spawned by the algorithm. In the end, each thread produces the color of an individual pixel of the final image. From Table 1 and Fig. 2b, it is clear that this version of ray

22

A.S. Nery et al.

(a) GridRT-CUDA kernel execution time in GTX 460 and GTX 465.

(b) Traditional parallel CUDA ray tracing compared to GridRT-CUDA.

Fig. 2. Execution time results and comparisons

tracing scales almost linearly to the number of blocks of threads. The explanation for such acceleration is also in the GPGPU architecture itself: if two or more threads are going to execute through different branches, they are serialized [3]. Hence, we can see from the GridRT-CUDA (Algorithm 3) that there are several possible branches of execution, that can lead to serialization of threads. For that reason, a custom parallel design in FPGA is preferable, because the architecture can be designed according to the characteristics of the application. For instance, although the execution time in [8] is higher, the acceleration is much higher as more processing elements can be fit into the FPGA.

7

Conclusion

In this paper, two different implementations of parallel ray tracing are discussed: the GridRT-CUDA implementation and a traditional CUDA parallel ray tracer. These two implementations are analyzed an compared regarding performance. The GridRT-CUDA implementation achieves acceleration up to 27 blocks of threads in a Nvidia GTX 465 GPU and up to 8 blocks of threads in a Nvidia GTX 460 GPU. From that point, the performance degenerates, especially because of the Streaming Processor model, which is not good for applications that exhibit too many branches of execution, such as in the GridRT architecture. So, several threads were serialized. Also, the performance degenerates because many blocks of threads have to compete for execution on the GPU fewer resources. A more powerful GPU is likely to achieve higher acceleration for further more blocks of threads. Compared to the traditional GPGPU ray tracer, the GridRT-CUDA performance is not good. However, since the GPGPU implementation introduces more hardware overhead compared to a custom hardware design (ASIP-based ASIC implementation), the custom hardware implementation is expected to have lower area and power consumption, as well as better performance.


23

References 1. Akenine-M¨ oller, T., Haines, E., Hoffman, N.: Real-Time Rendering, 3rd edn. A.K. Peters, Ltd., Natick (2008) 2. Carr, N.A., Hall, J.D., Hart, J.C.: The ray engine. In: HWWS 2002: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 37–46. Eurographics Association, Aire-la-Ville (2002) 3. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pp. 407– 420. IEEE Computer Society, Washington, DC, USA (2007) 4. Govindaraju, V., Djeu, P., Sankaralingam, K., Vernon, M., Mark, W.R.: Toward a multicore architecture for real-time ray-tracing. In: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pp. 176– 187. IEEE Computer Society, Washington, DC, USA (2008) 5. Havran, V., Prikryl, J., Purgathofer, W.: Statistical comparison of ray-shooting efficiency schemes. Technical report, Institute of Computer Graphics and Algorithms, Vienna University of Technology, Favoritenstrasse 9-11/186, A-1040 Vienna, Austria (2000) 6. Kirk, D.B., Hwu, W.-m.W.: Programming Massively Parallel Processors: A Handson Approach. Morgan Kaufmann Publishers Inc., San Francisco (2010) 7. Nery, A.S., Nedjah, N., Fran¸ca, F.M.G.: Two alternative parallel implementations for ray tracing: Openmp and mpi. In: Mecnica Computacional, vol. XXiX, pp. 6295–6302. Asociacin Argentina de Mecnica Computacional (2010) 8. Nery, A.S., Nedjah, N., Fran¸ca, F.M.G., Jozwiak, L.: A parallel architecture for ray-tracing with an embedded intersection algorithm. In: International Symposium on Circuits and Systems, pp. 1491–1494. IEEE Computer Society, Los Alamitos (2011) 9. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007) 10. Suffern, K.: Ray Tracing from the Ground Up, 1st edn. A.K. Peters, Ltd., Natick (2007) 11. Wald, I., Ize, T., Kensler, A., Knoll, A., Parker, S.G.: Ray tracing animated scenes using coherent grid traversal. In: SIGGRAPH 2006: ACM SIGGRAPH 2006 Papers, pp. 485–493. ACM, New York (2006) 12. Whitted, T.: An improved illumination model for shaded display. Commun. ACM 23(6), 343–349 (1980) 13. Woop, S., Schmittler, J., Slusallek, P.: Rpu: a programmable ray processing unit for realtime ray tracing. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Papers, pp. 434–444. ACM, New York (2005)

Cascading Multi-way Bounded Wait Timer Management for Moody and Autonomous Systems Asrar Ul Haque1 and Javed I. Khan2 1

College of Computer Science and Information Tech., King Faisal University, Al-Ahsa 31982, Kingdom of Saudi Arabia [email protected] 2 Media Communications and Networking Research Laboratory Department of Math & Computer Science, Kent State University 233 MSB, Kent, OH 44242 [email protected]

Abstract. Timer management is one of the central issues in addressing the ‘moody’ and autonomous characteristics of current Internet. In this paper we formalize the multi-way bounded wait principle for ‘moody’ and autonomous environment. We propose an optimum scheme and compare it with a set of generalized heuristic-based timer management schemes recommended for the harness-a distributed communication al and computational system for moody and autonomous environment. Keywords. Optimum timeout scheme, Timeout heuristics, Grid Computing, P2P search, Web service.

1 Introduction Any distribute system with millions of components must learn to operate with incomplete information. This is becoming the case of various distributed systems operating over the Internet. A classic example is the search for service discovery [1-2]. Such a distributed search is quite different from conventional distributed algorithms. A particular unique characteristic of such a search is that it is never complete. The search propagates via millions of other nodes from a source to the entire network as illustrated in Fig. 1. While it is ideal to expect that answers will arrive from a sweep covering all the nodes in the network, but almost always that is never the case. A search must learn to adapt to work with an imperfect sweep. An interesting question faced by this set of distributed algorithms is how to maximize the quality of the result without waiting inordinate amount of time. The network-based distributed algorithms are run in an environment consisting of various manifestations of this inherent unreliability such as dead-beat node, unreliable or busy peer, missing messages, authentication failure, intentional non-cooperation, selective cooperation etc. We will call it Moody and Autonomous Environment (MAE) environment. Classical communication layer has handled only limited aspects and forms of this unreliability. Various schemes such as error resilience coding and retransmission based transport essentially tell how a communication can best try to Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 24–32, 2011. © Springer-Verlag Berlin Heidelberg 2011

Cascading Multi-way Bounded Wait Timer Management

25

Fig. 1. (Top) A client begins searching for items. The search then propagates through thousands of other nodes. Each plays role in forwarding/routing results back to the root of the request. (Bottom) The root peer waits for the results.

create a ‘perfect’ and fault-free notion of services for mostly point-to-point communication. Classical network algorithms running on top of such transport thus assume that the imperfection can be sealed-off at lower layers and it can operate over a virtual perfection. Unfortunately this assumption of virtual perfection is does not always hold in the emerging distributed MAE. All natural communications systems for the MAE use the principle of bounded wait to address the moodiness property. An entity in the MAE while running a distributed algorithm faces the dilemma of how long it should wait for some other entity – if it waits too long it delays the overall computation completion time or even it may miss the timer deadline of its parent. Conversely, if it waits too short a period then it may miss communication from many of its children. Consequently, this deteriorates the overall quality of computation. This paper formalizes the above multi-way bounded wait

26

A. Ul Haque and J.I. Khan

principle for general distributed algorithms for MAE. We suggest an optimum scheme and compare it with a set of generalized heuristic-based timer management. The solution is applicable to any schema of multi-way communication- whether inside a new protocol, at middle-layer, or as a pure application level effort. In this paper we present the proposed timer management scheme within a formal framework of multi-way communication based general distributed computing. We call it harness [3]. The harness is a network computing framework which has a reusable multi-way communication primitive designed to operate in MAE. The harness makes a separation between the communication and the information part of the data exchange. Then a set of six plug-ins allows the computation and communication part to be separately programmed. The messaging pattern and the message content can be programmed independent of each other. Essentially a set of pre-programmed communication patterns can be reused with another set of message synthesis procedures. The harness has been shown to solve various network algorithms. In this section- we briefly introduce the harness framework. Details can be found [3]. The paper is arranged in the following way. First in section 2 we provide a brief overview of various interesting related work. We then formalize the multi-way bounded wait problem and solve it to set the timer optimally in section 3. For comparison, we also provide a set of plausible heuristics. Finally in section 5 we provide a detail performance comparison of the heuristics and the optimum schemes.

2 Related Work As the problem is increasingly becoming real, recently, various timeout schemes have been proposed for a range of large scale network applications. Network Weather Service [4]-a Grid performance data management and forecasting service, is one the first to try dynamic timeout in forecast computing and noted substantial improvement over static scheme. A RTO (retransmission timeout) selection has been proposed [5] for multimedia to achieve the optimal tradeoff between the error probability and the rate cost in order to improve throughput. Timeout strategy has been proposed associating costs for waiting time and retransmission attempts [6] where the timeout value was set to minimize the overall expected cost. The goal of this work was to reduce the number of retransmission attempts. In [7] a scheme was proposed for Deterministic Timeouts for Reliable Multicast (DTRM) to avoid negative feedback explosion [7]. DTRM ensures that retransmission caused by only one NACK from a receiver belonging to a sub-tree arrives early enough so that the timers do not expire in the other receivers in that sub-tree. These recent schemes- have used various heuristics approximations, but are notable because of there pioneering role in the timeout management in multi-way communication in moody environment. The solution we propose assumes that the nodes have prior knowledge of link delay. Indeed, there has been considerable amount of work related to finding link-level delay. Network tomography [8-12] based on statistical estimation theory is an emerging field of characterizing various link level performance metrics of a network. Network tomography closely estimates network internal characteristics including delay distribution, loss, and delay variance by correlating end-to-end measurements for


27

multicast tree. However, a major limitation of Network tomography is that it focuses on multicast routing only whereas the bulk of the traffic is uni-cast. This has been overcome this by estimating delay distribution by employing uni-cast, end-to-end measurement of back to back packets [13].

3 Multi-way Bounded Wait Principle A node during running a search algorithm might aggregate messages received from a set of downstream children and forwarding to a parent. As messages are received from its children the computation is more accurate since each message contains search results from some other nodes as well. A node faces the dilemma of how long it should wait for the message – if it waits too long it delays the overall computation completion time or even it may miss the timer deadline of its parent. Conversely, if it waits too short a period then it may miss communication from many of its children. Consequently, this deteriorates the overall quality of computation. In this section, the above multi-way bounded wait principle is formalized and an optimum scheme is suggested for setting timers for message generation and aggregation at the intermediate nodes and the terminals. In the following subsections, in order to generalize the problem formulation, we assume the timer value for message aggregation at the root is denoted by D (deadline) and the timer value for message generation of its child is represented by T. Furthermore, we use the notion of profit (denoted by ω ) to signify the total number of nodes pertaining to which search results have been accumulated in a message. 3.1 Formulation of the Problem Let, as shown in Fig. 2, node j has a parent k and a set of children nodes i={ i1, i2,… in}. Let ri x j (t ) and rk(t) be the probability distribution function of round trip time between

k

rkj (t ) j

ri1 j (t ) i1

rin j (t )

ri2 j (t ) i2

ωi j 1

ωi

in 2

j

ωi

Fig. 2. Optimal Timer Setting for Node j

n

j

28


the nodes ix and j and, j and k where x={1..n}. Let

ωi

x

j

be the profit carried by a mes-

sage from node ix to j. Given D the timeout value of k, calculate the maximum value of expected profit from j to k and the corresponding timeout value Topt for j. 3.2 Generic Profit Function Formulation T

The question we now pose is how to maximize the profit-P(t). Let C (t )dt be the ∫ 0

D −T

∫ S (t )dt

total profit accumulation at j in time t and

be the probability of successfully

0

reaching the parent, k, in time (D-T) with the accumulated profit. So we show that the basket function is the product of profit accumulation and probability of successful delivery. T

D −T

0

0

P(t ) = ∫ C (t )dt

∫ S (t )dt

(1)

The profit accumulation is summation of the product of profit and delay distribution of each children of j i.e., T

T

0

0 s∈i

∫ C (t )dt = ∫ ∑ ωsj rsj (t )dt D −T

D −T

0

0

(2)

∫ S (t )dt = ∫ r (t )dt

(3)

k

From (1), (2), and (3) we get, T

D

0 s∈i

T

P( D) = ∫ ∑ ω sj rsj (t )dt ∫ rk (t )dt

(4)

Fig. 3 illustrates the formulation of the profit function for node j as in Eq. 1. As the time T is increased the area under ri1 j (t ), ri2 j (t ), and ri n j (t ) also increases indicatD −T

ing higher accumulation of profit C(t). However the area under

∫r

k

(t )dt i.e.

0

D −T

∫ S (t )dt decreases as T increases since D is fixed. Thus possibility of reaching parent 0


29

T

node k with the accumulated profit C (t )dt diminishes as T is increased. The product ∫ 0

D −T

of

T

∫ S (t )dt and ∫0 C (t )dt is the total profit accumulated at node k in time T. 0

3.3 Solution for Optimal Timer Setting The optimum time, Topt, and the corresponding maximum profit, Pmax, are calculated based on the following assumptions: • The delay distributions of the parent and children of j are of Normal distribution. • The delay distributions of the children of j are independent of each other.

Fig. 3. Formulation of Profit Function

30


• Node j generates a message for its parent even if no message is received from its children before timeout occurs at j. Further simplification of Eq. 4 is beyond the scope of this paper. However, it can be noted that P(D), in Eq. 4 is a positive, bounded, and continuous function of T. Furthermore, as T → ±∞, it goes to zero. Therefore, a global maximum of P(D) and the corresponding T, denoted by Topt, must exist, and must satisfy the equation

dP ( D) / dT = 0.

4 Simulation In this section the performance of the optimal scheme is partially presented. To put the performance of the optimum scheme into perspective we have also constructed a number of timer-setting heuristics and picked five best performing ones. We present the optimum scheme in the context of these intuitive solutions. These heuristics are shown in Table 2 and are discussed in [3]. The marginal constant and factors used in simulation for the heuristics were manually selected after extensively looking into performance of individual heuristics in various network scenarios. We assume ϕ , β , α , ρ , ξ , σ , and

λ

to be 40000, 10000, 10, 6.5, 10, 3000, and 1.5 respectively.

4.1 Measuring Parameters One of the important parameters of interest is the completion time (CT) which is the time elapsed between root generating the first request and receiving and processing all the replies. However, completion time alone can not justify usage of a heuristic. To better compare various heuristics, we define capturing efficiency (CE) as the ratio of number of node from which information has been collected at the root divided by total number of responsive nodes. Let N m be overall count of nodes from which information has propagated to the root node,

N t be total number of nodes, and

N NRN number of NRN. Then,

CE =

Nm N t − N NRN

(5)

4.2 Impact of Size of Network Fig. 4 and 5 illustrate impact of size on the heuristics with respect to CT and CE respectively. The CT for optimal scheme for 2500, 5000 and 10000 nodes are 10.11,13.54, and 12.9 seconds respectively. For MCD, CT increases by 2.3s as graph size increases from 2500 to 10000 nodes where as for other heuristics the increase is noticeable. However, CE is not 1 for all three graph sizes for MCD. For 10000 nodes, CE is only 0.72. The optimal scheme has CE=1.0 for al three graph sizes.

Cascading Multi-way Bounded Wait Timer Management Table 1. Various Timeout Heuristics

Ti h = η ih−1 + β where β ≠ f (i, L, RTT ) Ti h = α (η ih−1 )

where α

≠ f (i, L, RTT ) andα > 1

h Ti −1 = RTTi k + γ i where γ i = ρ * ( L − i ) * RTTi −k1 andρ > 1

Ti −h1 = ξ * RTTi k + Ti k where ξ ≠ f (i, L, RTT ) and ξ > 1 Ti −h1 = RTTi k + Ti k + σ

Topt =

where σ

≠ f (i, L, RTT ) and σ > 1

1 (L + μ j − μk ) + π (σ k − σ j ) 2 2 2 NRN Loc=Terminal,NRN=0.2% , α=2.3

100 Time(s)

80

2.5K

60 40

5K 10K

Scheme

Opt

MDT

PDT

PLD

PCD

0

MCD

20

Fig. 4. Impact of Size of Graph on CT NRN Loc=Terminal,NRN=0.2% , α=2.3l 1 2.5K

0.75

Scheme

Opt

0

PDPT

10K PDT

0.25 PLD

5K

PCD

0.5

MCD

Optimal Scheme

Formula

CE

Scheme Heuristic MCD (Fixed Margin Over Cumulative Delay (CRTT) Heuristic PCD (Proportionate Cumulative RTT) Heuristic PLD (Proportionate Level over RTT) Heuristic PDT (Proportionate RTT over Fixed Child Timer) Heuristic MDT (Fixed Margin RTT over Fixed Child Timer)

Fig. 5. Impact of Size of Graph on CE

31

32


5 Conclusions Timer management is the natural technique to address the ‘unreliability’ posed by any MAE entity which inherently different from unreliability handled by TCP. We formalized the multi-way bounded wait principle for MAE to respond to this sort ‘unreliability’. We introduced the notion of the lower bound of completion time and the upper bound of capturing efficiency. In this paper we used completion time and capturing efficiency to compare optimal scheme with some heuristics proposed for the harness to show better performance of optimal scheme. We have shown that the optimal scheme outperforms other heuristics in various network conditions. Among the heuristics most promising is MCD. However, a major concern for MCD is that its performance degrades with size of the network whereas the optimal scheme scales well with the size of a network.

References 1. Meshkova, E., Riihij, J., Petrova, M., Petri, M.: A survey on resource discovery mechanisms, peer-to-peer and service discovery frameworks. The International Journal of Computer and Telecommunications Networking Archive 52(11) (August 2008) 2. Ahmed, R., Boutaba, R.: A Survey of Distributed Search Techniques in Large Scale Distributed Systems. Communications Surveys & Tutorials 13(2) (May 2011) 3. Khan, J.I., Haque, A.U.: Computing with data non-determinism: Wait time management for peer-to-peer systems. Journal Computer Communications 31(3) (February 2008) 4. Allen, M.S., Wolski, R., Plank, J.S.: Adaptive. Timeout Discovery using the Network Weather Service. In: Proceedings of HPDC 2011 (July 2002) 5. Zhan, J.C.W., He, Z.: Optimal Retransmission Timeout Selection For Delay-Constrained Multimedia Communications. In: International Conference on Image Processing, ICIP 2004, October 24-27, vol. 3, pp. 2035–2038 (2004), doi:10.1109/ICIP.2004.1421483 6. Libman, L., Orda, A.: Optimal retrial and timeout strategies for accessing network resources. IEEE/ACM Transactions on Networking 10(4), 551–564 (2002) 7. Grossglauser, M.: Optimal deterministic timeouts for reliable scalable multicast. In: IEEE Infocom 1996, pp. 1425–1441 (March 1996) 8. Bu, T., Duffield, N., Presti, F.L., Towsley, D.: Network tomography on general topologies. In: Proc. of ACM SIGMETRICS (2002) 9. Duffield, N.G., Lo Presti, F.: Multicast Inference of Packet Delay Variance at Interior Networks Links. In: Proc. Infocom 2000, Tel Aviv, Israel (March 26-30, 2000) 10. Adams, A., Bu, T., Caceres, R., Duffield, N., Friedman, T., Horowitz, J., Lo Presti, F., Moon, S.B., Paxson, V., Towsley, D.: The use of end-to-end multicast measurements for characterizing internal network behavior. IEEE Communications Magazine (May 2000) 11. Lo Presti, F., Duffield, N.G., Horowitz, J., Towsley, D.: Multicast-Based Inference of Network-Internal Delay Distribution, preprint, AT&T Labs and University of Massachusetts (1999) 12. Bu, T., Duffield, N.G., Lo Presti, F., Towsley, D.: Network tomography on general topologies. ACM SIGMETRICS (June 2002) 13. Coates, M.J., Nowak, R.: Network Delay Distribution Inference from End-to-end Unicast Measurement. In: Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (May 2001) 14. Duffield, N.G., Horowitz, J., Lo Presti, F., Towsley, D.: Network delay tomography from end-to-end unicast measurements. In: Palazzo, S. (ed.) IWDC 2001. LNCS, vol. 2170, pp. 576–595. Springer, Heidelberg (2001)

World-Wide Distributed Multiple Replications in Parallel for Quantitative Sequential Simulation Mofassir Haque1, Krzysztof Pawlikowski1, Don McNickle2, and Gregory Ewing 1 1

University of Canterbury, Department of Computer Science, Christchurch 8140, New Zealand 2 University of Canterbury, Department of Management, Christchurch 8140, New Zealand [email protected] {Krys.Pawlikowski,Don.McNickle,Greg.Ewing}@canterbury.ac.nz

Abstract. With the recent deployment of global experimental networking facilities, dozens of computer networks with large numbers of computers have become available for scientific studies. Multiple Replications in Parallel (MRIP) is a distributed scenario of sequential quantitative stochastic simulation which offers significant speedup of simulation if it is executed on multiple computers of a local area network. We report results of running MRIP simulations on PlanetLab, a global overlay network which can currently access more than a thousand computers in forty different countries round the globe. Our simulations were run using Akaroa2, a universal controller of quantitative discrete event simulation designed for automatic launching of MRIP-based experiments. Our experimental results provide strong evidence that global experimental networks, such as PlanetLab, can efficiently be used for quantitative simulation, without compromising speed and efficiency. Keywords: Multiple Replications in Parallel, Experimental networking facilities, Akaroa2, PlanetLab, Sequential quantitative stochastic simulation, Open queuing network.

1 Introduction Quantitative stochastic simulation of complex scenario can take hours or days to complete. SRIP (Single Replication in Parallel) and MRIP (Multiple Replication in Parallel) are two methods used to reduce simulation time. In SRIP, the simulation program is divided into smaller logical parts and run on different computers. In MRIP, multiple processors run their own replications of sequential simulation, but cooperate with central analyzers (one central analyzer for each performance measure analyzed) that are responsible for analyzing the results and stopping the simulations when the specified level of accuracy is met [1]. The MRIP technique can significantly speed up simulation if replications are launched on a larger homogeneous set of computers [2, 3]. In last few years, a large number of experimental networking facilities have been, or are being developed across the globe: e.g. PlanetLab, GENI, OneLab, G-Lab, Akari, Panlab, etc. [4]. These global networks often consist of thousands of computers. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 33–42, 2011. © Springer-Verlag Berlin Heidelberg 2011

34

M. Haque et al.

Thus they provide a viable alternative for running distributed stochastic simulations in the Multiple Replications in Parallel scenario (MRIP). We selected PlanetLab as the provider of distributed computing resources for investigating various aspects of MRIP simulations, since it is a continuously evolving computing platform with thousands of nodes [5]. These nodes can be easily accessed for running MRIP without investing in infrastructure. However, before using such a globally distributed networking facility for sequential stochastic simulation on multiple computers, factors such as load at selected nodes and potential communication overhead between them have to be carefully considered, as these computers can be shared by a large number of users and some of them are thousands of miles apart. Load generated by these users can vary significantly and quickly. Thus, it can adversely affect performance of computers, and the simulations running on them.

Fig. 1. PlanetLab with deployed nodes around the world [5]

We did extensive experimentation to determine the suitability of PlanetLab nodes for MRIP simulations. Our simulations were run with Akaroa2, a universal controller of quantitative discrete event simulation, designed for automatic launching of MRIPbased experiments. Experiments were designed to measure times needed to produce final simulation results over various sets of PlanetLab computers. Results obtained from the experiments executed over PlanetLab nodes were compared with the results obtained from running MRIP simulations on a local area network at the University of Canterbury. This has allowed us to conclude that a global networking facility such as PlanetLab can be effectively utilized for running MRIP. The rest of the paper is organized as follows. Section 2 spells out the procedure for running Akaroa2 on PlanetLab. Sections 3 explains in detail the experimental set up and evaluation metric. Section 4, presents experimental results, and conclusions are in Section 5.

2 Akaroa2 on PlanetLab In Akaroa2, multiple independent replications of a stochastic simulation are run on different processors, which play the role of independent simulation engines producing

World-Wide Distributed Multiple Replications

35

statistically equivalent output data during one simulation. Multiple simulation engines cooperate with the global analyzer that processes streams of output data coming out from different simulation engines, and stops the simulation once the required accuracy of the results has been achieved. The accuracy is typically measured by the relative statistical error of the results. Two main processes of Akaroa2 are: Akmaster and Akslave. The Akslave process initiates simulation engines on multiple processors, while Akmaster controls sequential collection of output data and their analysis. It collects local estimates from all running Akslaves, calculates final global estimates, displays results, and then terminates the simulation when the stopping criterion is reached [6]. Both steady-state simulations and terminating simulations are supported. In the former case, the procedures for sequential mean and variance analysis are described in [1, 7-8], while the procedure adopted for terminated simulation is presented in [2]. Akaroa2 is widely used for simulations executed on local area networks, as its records of the last 10 years (in July 2011) show over 3100 downloads of the software by users from over 80 countries [9]. In order to run Akaroa2 on PlanetLab, first we need to copy and install Akaroa2 on all the nodes which will be used for running simulation engines. Copying and installing software on hundreds of machines is an intricate task. Either the CoDeploy program [10] provided by PlanetLab or, alternatively, simple shell scripts for automating copying, installation and running of Akaroa-2 on PlanetLab can be used. The shell script we used can be downloaded from the PlanetLab New Zealand web site [11]. For proper execution of MRIP-based simulation, the path variable should be correctly set in the bash profile file of all participating PlanetLab nodes, and simulation program should be copied in the directory specified in the path. The detailed procedure with step by step instructions for running Akaroa-2 on PlanetLab using Linux or Windows operating system can be downloaded from PlanetLab New Zealand web site [11].

3 Experimental Setup To study the feasibility of running Akaroa2 on PlanetLab, we conducted a large number of experiments, considering different strategies for selecting participating nodes of the network. The aim was to measure the times to produce simulation results, from the time instant when the simulation was launched until the time instant when the final results were obtained, to find out how using differently selected sets of PlanetLab nodes can affect users’ quality of experience, in comparison with simulations executed on local computers only. We compared two of many possible strategies for selection of PlanetLab nodes for MRIP simulations. We assumed that the computers are either distributed over a restricted geographical region (so they operate in the same or close time zones), or they are distributed globally (so computers participating in MRIP simulations work in very different time zones). 3.1 Computing Setup CS1 In this computing setting, while operating from New Zealand, we installed Akaroa2 on PlanetLab nodes spread over the European Union. The Akmaster was installed in

36

M. Haque et al.

Italy and simulation engines were located in France, UK, Belgium, Italy, Hungary and Poland. PlanetLab nodes were carefully selected using the CoMon utility [12] to avoid currently heavily loaded nodes. The CoMon utility is provided by PlanetLab for monitoring of resource utilization of all PlanetLab nodes. In CS1 our aim was to assess response times of MRIP-based simulation experiments. The experiments were run on Friday, beginning at 2pm British Standard Time. 3.2 Computing Setup CS2 In this computing environment, simulation engines of Akaroa2 were installed worldwide, so they operated in very different time zones. Again, while operating from New Zealand, we installed the Akmaster in Italy, and the simulation engines were launched in Europe, USA, Canada, New Zealand and Asia; see Figure 2. Nodes were again carefully selected using the CoMon utility, avoiding nodes which were heavy loaded. This setup was used to study and verify effect of communication overhead when simulation engines are thousands of miles apart. The experiments were run on Friday, beginning at 2pm USA Central Standard Time.

Fig. 2. Global distribution of Akaroa2 in CS2

Note that the nodes of PlanetLab used by Akaroa2 represented a homogenous set of computers, as the computers of PlanetLab have to satisfy some minimum common technical requirements. For comparing the quality of users’ experience in such distributed simulation environments, we have also measured the waiting times for simulation results in the traditional local processing environment of Akaroa2, where its simulation engines are located around a local area network. 3.3 Computing Setup CS3 Here, the simulation experiments were run on computers linked by a local area network in a computer laboratory of the Department of Computer Science and Software Engineering, at the University of Canterbury, in Christchurch. Akmaster and Akslave were installed in this controlled local area network environment, the original home location of Akaroa2. The results were used as the reference for comparison with the results obtained from the two other, distributed computing environments.


37

The experiments were run on Friday, beginning from 2pm, New Zealand time. The nodes of the local area network, physically located in one laboratory, constitute a homogenous set of computers. Laboratory and PlanetLab nodes are equipped with quad processors and both use the Fedora operating system based on the Linux Kernel. However, the computers available on PlanetLab are of slightly higher technical standards in terms of memory and clock frequency than those available in our CS3 setting. 3.4 Simulation Setting and Evaluation We ran the same sequential stochastic simulation in MRIP scenario in all three computing setups: CS1, CS2 and CS3. For our study, we simulated a simple open queuing network, consisting of a CPU and two disk memories with unlimited buffer capacities, depicted in Figure 3. We estimated steady-state mean response (mean time spent by a customer in this system), assuming that arriving customers form a Poisson process with λ= 0.033 tasks per second. All service times are exponentially distributed with mean service time at the CPU of 6 seconds, mean service time at Disk 1 and mean service time at Disk 2 both of 14 seconds. This makes the servers to the CPU, Disk 1 and Disk 2 loaded at 96%, 92.4% and 92.4%, respectively. Disk 1

Job Sink

Job Source

p1 p3

CPU p2

Disk 2

Fig. 3. Simulated open queuing network

The simulation processes on all computers were to stop when the estimate of steady-state mean response reached a relative statistical error not greater than 5%, at a confidence level of 0.95. This should require about 20 million observations. Running simulation in Multiple Replications in Parallel scenario allowed us to collect this sample of output data faster, as it is produced by multiple simulation engines. To demonstrate that this attractive feature of MRIP remains practical also in case of globally distributed simulation engines, we assessed speedup and relative efficiency of MRIP simulations in setup CS1 and CS2, and compared the results with those from locally distributed simulation engines in CS3. The performance of our MRIP simulations was assessed by measuring response time (RT) of a given simulation setting, defined as the time interval between the time of launching the simulation until

38

M. Haque et al.

the time when the final results are delivered to the user. Then, the speedup of simulation at P > 1 simulation engines can be found as: S P =

Mean_RT(1) Mean_RT(P)

(1)

where Mean_RT (P) is mean response time of P simulation engines running MRIP simulation with P ≥ 1. Alternatively, we looked at the relative speedup of MRIP simulation, defined as SR (P)=

Mean_RT(1)-Mean_RT(P) Mean _RT(1)

*100 %

(2)

for P= 1, 2, 3, …. Note that, due to the truncated Amdahl law for MRIP formulated in [2, 3], there exists a limit on the number of processors which would increase the speedup of MRIP simulation. It is also known that the largest speedup can be obtained in homogeneous computing environments. In the extreme case, if one simulation engine uses a very fast processor and remaining processors are slow, a simulation will not benefit at all from MRIP at all, as the fastest simulation engine can produce the entire sample of required observations needed for stopping the simulation, before any of the remaining slower simulation engines is able to reach its first checkpoint. Another performance measure which we considered is the efficiency of distributed processing during MRIP simulation, or speedup per simulation engine: E (P) =

S(P)

P

(3)

In an ideal situation, the efficiency would be equal to one. However, in practical applications of parallel processing it is usually much smaller. E (P) measures how well the contributing processors are utilized for solving a given problem, despite their mutual communication and synchronization activities.

4 Experimental Results In this section, we present our experimental results obtained under computing setups CS1, CS2 and CS3. We use mean response time as the measure of quality for testing our hypothesis that the MRIP scenario can also be efficiently used in the case of world-wide distributed simulation engines. The mean response times obtained for CS1, CS2 and CS3, measured in seconds, are given in Table 1. Each reported result is an average over 10 independent measurements/simulations. The relative statistical errors of these estimates are not larger than 1% for CS3 and not larger than 6 % for CS1 and CS2, at 0.95 confidence level. Fig. 4 compares mean response times of CS1, CS2 and CS3. The histogram clearly shows that mean response time reduces as the number of nodes increases. The PlanetLab nodes are being shared by a large number of users and are located hundreds of miles apart. Conversely, laboratory nodes are used by only one person and are located close to each other. The mean response times in case of CS3 are therefore smaller than in the case of PlanetLab nodes both in CS1 and CS2. In order to obtain good performance, PlanetLab nodes should be carefully selected, avoiding heavily loaded nodes and busy working hours.


39

Table 1. Mean response time for scenario CS1, CS2 and CS3 Number of Nodes

CS1

CS2

CS3

2 4 6 8 10 12 15

88.13 61.94 52.25 45.23 39.8 34.32 27.14

97.53 75.48 64.74 59.34 46.81 43.21 36.35

59.78 47.53 37.08 32.81 29.98 28.62 15.67

CS1

CS2

CS3

Time in Seconds

100 80 60 40 20 0 2

4

6

8

10

12

15

Number of Nodes

Fig. 4. Comparison of mean response times in CS1, CS2 and CS3

Comparison of the mean response times for CS1 and CS2 shows that these mean response times are much shorter if all the PlanetLab nodes are selected from one area (continent), for example within Europe rather than from all over the world. This is primarily because of communication overhead. When controller and simulation engines are located thousands of mile apart, the time used for exchanging data between simulation engines and controller directly effects the mean response time. We also ran the same experiment by selecting PlanetLab nodes from North America only and found results similar to the setup CS2. Speedup for distributed scenario of CS1 and CS2 is calculated using Equation (1) and given in Table 2. Speedup has been calculated using mean response time of two nodes as a reference. In spite of the longer distance between nodes, speedup offered by PlanetLab nodes in the case of CS1 is better than in CS3, because of the slightly better hardware of PlanetLab nodes.

40

M. Haque et al. Table 2. Speedup for distributed scenario of CS1 and CS3 Number of Nodes

CS1

CS3

2 4 6 8 10 12 15

1 1.42 1.69 1.95 2.21 2.57 3.24

1 1.26 1.61 1.82 1.99 2.09 3.17

Efficiency in the case of CS1 and CS3 has been calculated using Equation (3) and is shown in table 3. In this case, there is only small difference between the results. The efficiency decreases as the number of processors increases. This is due to the fact that processor communication is usually slower than computation and exchange of local estimates between Akslaves and Akmaster results in frequent communication. Table 3. Efficiency for scenario CS1 and CS3 Number of Nodes

CS1

CS3

2 4 6 8 10 12 15

0.5 0.31 0.26 0.22 0.19 0.17 0.20

0.5 0.35 0.28 0.24 0.22 0.21 0.21

These results allow us to conclude that it has become practical to use distributed computing resources of global experimental networks for fast quantitative stochastic simulation, paying only a small penalty in the form of a minor worsening of response times, speedup and efficiency of the simulation as comparing with the same simulations run on a local area network. The advantage of using globally distributed computing resources is that they can be substantially larger than the ones available locally. We conducted experiments using two different ways of selection of computers in PlanetLab for simulation engines and compared their performance with performance of simulation run on computers of a local area network. The performance of MRIP in CS1 appears to be better that in CS2. Thus, for best results selection of computers from closer geographical location, avoidance of both heavily loaded nodes and busy hours is recommended.


41

5 Conclusions In this paper we have shown that the distributed computing resources of global experimental networks, such as PlanetLab, can be effectively used for running quantitative stochastic simulations in MRIP scenario. Only a small penalty (in the form of a minor worsening of performance) is paid for using globally distributed resources instead of local ones. Launching and on-line control of globally distributed simulations can be done by using for example Akaroa2. It is encouraging news for those who need to run time-consuming quantitative simulations to get accurate final results, but do not have access to sufficiently large number of computers for launching multiple simulation engines. Recently, there has been a surge in development of global and regional experimental networking facilities, see Table 4 [13]. Most of these networks offer free membership and can be effectively used for conducting simulation experiments under control of Akaroa2. Table 4. Selected experimental networking facilities, with size and accessibility Name

Purpose

Size

Access

OneLab Panlab

Multipurpose Multipurpose

Regional Regional

Federica PlanetLab GENI JNB 2 CNGI

Multipurpose Multipurpose Multipurpose Multipurpose Multipurpose

Regional Global Regional Regional Regional

Free membership Planned to be on Payment Free membership Free membership Free membership Free membership Free membership

In future, we plan to investigate the upper bounds for speedup of globally distributed sequential stochastic simulation, such as those in the MRIP scenario. This will require running experiments at full scale, employing hundreds of PlanetLab nodes as simulation engines, with simulations requiring extremely large samples of output data for producing accurate simulation results, in particular if the simulated processes are strongly correlated. Acknowledgments. This work was partially supported by REANNZ (2010/2011 KAREN Capability Build Fund).

References 1. Pawlikowski, K., Yau, V., McNickle, D.: Distributed stochastic discrete-event simulation in parallel time streams. In: 26th Conference on Winter Simulation, pp. 723–730. Society for Computer Simulation International, Orlando (1994) 2. Pawlikowski, K., Ewing, G., McNickle, D.: Performance Evaluation of In-dustrial Processes in Computer Network Environments. In: European Conference on Concurrent Engineering, pp. 129–135. Int. Society for Computer Simulation, Erlangen (1998)

42

M. Haque et al.

3. Pawlikowski, K., McNickle, D.: Speeding Up Stochastic Discrete-Event Simulation. In: European Simulation Symposium, pp. 132–138. ISCS Press, Marseille (2001) 4. Lemke, M.: The Role of Experimentation in Future Internet Research: FIRE and Beyond. In: 6th International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, Berlin, Germany (2010) 5. PlanetLab, http://www.planet-lab.org/ 6. Ewing, G., Pawlikowski, K., McNickle, D.: Akaroa-2: Exploiting Network Computing by Distributing Stochastic Simulation. In: 13th European Simulation Multi-Conference, Warsaw, Poland, pp. 175–181 (1999) 7. Ewing, G., Pawlikowski, K.: Spectral Analysis for Confidence Interval Estimation Under Multiple Replications in Parallel. In: 14th European Simulation Symposium, pp. 52–61. ISCS Press, Dresden (2002) 8. Shaw, N., McNickle, D., Pawlikowski, K.: Fast Automated Estimation of Varience in Sequential Discrete Event Stochistic Simulation. In: 25th European Conference on Modelling and Simulation, Krakow, Poland (2011) 9. Akaroa2, http://www.cosc.canterbury.ac.nz/research/RG/net_sim/simulat ion_group/akaroa/about.chtml 10. CoDeploy, http://codeen.cs.princeton.edu/codeploy/ 11. PlanetLab NZ, http://www.planetlabnz.canterbury.ac.nz/ 12. CoMon, http://comon.cs.princeton.edu 13. Haque, M., Pawlikowski, K., Ray, S.: Challenges to Development of Multipurpose Global Federated Testbed for Future Internet Experimentation. In: 9th ACS/IEEE International Conference on Computer Systems and Applications, Sharm El-Sheikh, Egypt (2011)

Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves Yongnan Li1,2, Limin Xiao1,2, Guangjun Qin1,2, Xiuqiao Li1,2, and Songsong Lei1,2 1

State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China 2 School of Computer Science and Engineering, Beihang University, Beijing, 100191, China {liyongnan.buaa,guangjunster,xiuqiaoli,lss.linux}@gmail.com [email protected]

Abstract. This paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder Theorem to divide point-multiplication over ring Zn into two different point-multiplications over finite field and to compute them respectively. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the basic parallel algorithms in conic curves cryptosystem. A quantitative performance analysis is made to compare this algorithm with two other algorithms we designed before. The performance comparison demonstrates that the algorithm presented in this paper can reduce time complexity of pointmultiplication on conic curves over ring Zn and it is more efficient than the preceding ones. Keywords: conic curves, ring Zn, finite field Fp, point-addition, point-double, point-multiplication, Chinese Remainder Theorem.

1

Introduction

In recent years, three main classes of public key cryptosystem are considered both secure and efficient: integer factorization system, discrete logarithm system and discrete logarithm system based on mathematical curves. Conic curves cryptosystem belongs to the third one. Professor Cao presented the concept of conic curves cryptography firstly in [1-2]. Then a public-key cryptosystem scheme on conic curves over ring Zn was proposed in [3-5]. Research in [6] introduced the definitions of extended point-addition and point-double on conic curves over ring Zn. In this paper, an efficient technique for parallel computation of the pointmultiplication on conic curves over ring Zn is proposed and our algorithm can reduce time complexity of point-multiplication. The analysis of this parallel methodology is based on our previous work about the basic parallel algorithms used in conic curves cryptosystem. Study in [7] proposed several parallel algorithms for cryptosystem on conic curves over finite field Fp. In [8], original point-addition and point-double were paralleled for cryptosystem on conic curves over ring Zn. Work in [9] introduced Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 43–53, 2011. © Springer-Verlag Berlin Heidelberg 2011

44

Y. Li et al.

traditional parallel point-multiplication in conic curves cryptosystem over ring Zn and finite field Fp. Parallel extended basic operations of point-addition and point-double were proposed in [10]. Study in [11] designed two high performance algorithms of point-multiplication for conic curves cryptosystem based on standard NAF algorithm and Chinese Remainder Theorem. The methodology presented in this paper partitions point-multiplication over ring Zn into two point-multiplications over finite field Fp and finite field Fq by calling Chinese Remainder Theorem, which is proposed by famous military strategist “Sun Tzu”. Two point-multiplications are executed respectively and then the temporary results are merged to get the final value. This method is similar with the one we proposed in [11] and the difference is that the preceding research adopted standard NAF algorithm to compute the point-multiplication over finite filed Fp. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the time complexities of the fundamental algorithms on conic curves. And we will evaluate the performance of this parallel algorithm and compare with two old parallel algorithms. The parallel algorithm proposed in this paper not only accelerates the speed of point-multiplication, but also shows higher efficiency than two old parallel algorithms we designed before. The rest of this paper is organized as follows. Next section introduces the definition of point-multiplication on conic curves over ring Zn. Section 3 depicts time complexities of the basic operations in conic curves cryptosystem. Section 4 presents the methodology of paralleling point-multiplication on conic curves over ring Zn. The performance comparison of our techniques is proposed in section 5. The last section concludes the whole paper and points out some future works briefly.

2

Definition of Point-Multiplication

The definitions of point-addition and point-double must be introduced firstly. In conic curves cryptosystem over ring Zn, Cn ( a, b ) means the conic curves. C1, C2 and C3 represent three different fields over ring Zn. For any point P ( x1 , y1 ) ∈ Cn ( a, b) and

Q ( x2 , y2 ) ∈ Cn ( a, b ) , the operator ⊕ is defined as: •

If P ≠ Q , then operation of point-addition is P ⊕ Q .

•

If P = Q , then operation of point-double is 2P .

Operators ⊕ are defined differently in the expressions of point-addition and pointdouble. Point-multiplication signifies summation of many points on conic curves. Parameter k and parameter P represent coefficient and point on conic curves respectively. Point-multiplication kP is defined as: count = k 644 7448 kP = P ⊕ P ⊕ L ⊕ P ,

(1)

In conic curves cryptosystem over ring Zn, we define n=pq (p and q are two different odd prime integers). For more details, please refer to researches in [1-5].

Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves

3

45

Time Complexities of Basic Operations

As depicted in [7-10], time complexities of several basic operations are listed in Table 1 and Table 2. And we set the runtime of single-precision multiplication as the basic measurement unit. The time complexities of multiple-precision addition and deduction are O(1). Table 1. Time complexities of parallel operations Operation Original point-addition over ring Zn

Time complexity computation communication 644444 47444444 8 64444444 4744444444 8 2 2 2 2 2N p 2N p 3N 3N O( + + 9lN + 3lA + 21) + O( + + 6 N + 6lN + 4 N p + 2lA + 5) sn sp sn sp

Original point-double over ring Zn

communication 6444444computation 4744444448 644444444 7444444448 2 2 3N 2 2 N p 3N 2 2 N p O( + + 9lN + 3lA + 2a + 20.5) + O( + + 6 N + 6lN + 4 N p + 2lA + 5) sn sp sn sp

Extended point-addition over ring Zn

computation communication 64444 4744444 8 64444444 4744444444 8 2 2 2N 2 6N p 2N 2 6N p O( + + 6lN p + 28) + O ( + + 4 N + 12 N p + 4lN p + 10) sn sp sn sp

Extended point-double over ring Zn

computation communication 64444 4744444 8 64444444744444448 2 2 2N 2 6N p 2N 2 6N p O( + + 6lN p + 28) + O( + + 4 N + 12 N p + 4lN p + 4) sn sp sn sp

Pointmultiplication over ring Zn Pointmultiplication over finite field Fp

computation communication 64444444 4744444444 8 6444444444 7444444444 2 2 3N 2 2 N p 3N 2 2 N p O(t ( + + 9lN + 3lA + 24.5) − 3.5) + O(t ( + + 6 N + 6lN + 4 N p + 2lA + 5) + sn sp sn sp

computation communication 6444444 74444448 6444444 4744444448 3n 2 3n2 O(t ( + 10 + 3 ⎢⎡lg X ⎥⎤ + 3 ⎢⎡lg P ⎥⎤ )) + O(t ( + 6n + 2 ⎢⎡lg X ⎥⎤ + 2 ⎢⎡lg P ⎥⎤ ) + 1) s s

Multiplication

computation communication 6 4748 64 748 n2 n2 O( + 2) + O( + 2n) s s

Barrett reduction

computation communication 64 748 64 4744 8 2n 2 2n 2 O( + 8) + O( + 4 n) s s

The meanings of the variables in Table 1 and Table 2 are: • • • •

N: multiple-precision of operand n over ring Zn. Np: multiple-precision of operand p over ring Zn. Nq: multiple-precision of operand q over ring Zn. a: a coefficient of conic curves equation.

46

Y. Li et al.

• • • • • • • • • • •

b: a coefficient of conic curves equation. A: a fixed coefficient over ring Zn. s: process number for computing multiplication. l: word length of the computer. X: numerator of inversion-multiplication. P: denominator of inversion-multiplication. n: multiple-precision of operand over finite field. t: the numbers of bit in coefficient k for computing point-multiplication kP. Sn: the value of s over ring Zn. Sp: the value of s over finite field Fp. Sq: the value of s over finite field Fq. Table 2. Time complexities of sequential operations

Operation

Time complexity

Original point-addition over ring Zn

O(2 N 2 + N p 2 + N q 2 + 10 N + 12lN + 4lA + 1.5b + 19.5)

Original point-double over ring Zn

O(2 N 2 + N p 2 + N q 2 + 10 N + 12 N + 4lA + 1.5b + 2a + 17.5)

Extended point-addition over ring Zn

O( N 2 + 5 N p 2 + 5 N q 2 + 23 N + 8lN + 37)

Extended point-double over ring Zn

O( N 2 + 4 N p 2 + 4 N q 2 + 19 N + 8lN + 27)

Point-multiplication over ring Zn

O((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN +4lA + 1.5b + 17.5) + (2a + 1)(t − 1))

Point-multiplication over finite field Fp

O((3(t − 1) 2)(2n2 + 9n + 5 + 4 ⎡⎢ lg X ⎤⎥ + 4 ⎡⎢lg P ⎤⎥ ))

Multiplication

O ( n 2 + 2n )

Barrett reduction

O(n 2 + 4n + 5)

The relationship of variable N, Np and Nq in Table 1 and Table 2 is: N = N p + N q , N p ≥ N q . The value of coefficient A is

C1 N + C2 N q + C3 N p C1 + C2 + C3

. C1 , C2 and

C3 stand for the numbers of point in the fields of C1, C2 and C3. In this paper, the

symbol “Fq” has the same meaning of finite field Fp and the distinction is that its module is prime integer “q”.


4

47

Parallel Point-Multiplication

This section explains the methodology of paralleling point-multiplication for cryptosystem on conic curves over ring Zn. It uses Chinese Remainder Theorem to partition point-multiplication over ring Zn into two point-multiplications over finite field. As Fig.1 shows, there are three steps in the parallel procedure of point-multiplication. Firstly, two reductions are calculated to divide parameter t into tp and tq. Then two point-multiplications are computed over finite field Fp and finite field Fq respectively. Lastly, kP(tp) and kP(tq) are incorporated by calling Chinese Remainder Theorem to get the value of kP(t ) .

Fig. 1. Parallel procedure using Chinese Remainder Theorem

In the first step, two reductions are calculated to map parameter t over ring Zn into tp over finite field Fp and tq over finite field Fq. Then, Tp − left

Tp − right

computation communication 644 7448 644 47444 8 2 = O(2 N p s p + 8) + O(2 N p 2 s p + 4 N p ) ,

(2)

computation communication 644 7448 644 47444 8 = O(2 N q 2 s q + 8) + O (2 N q 2 sq + 4 N q ) .

(3)

We could conclude that the value of Tp − left is bigger than Tp − right because of N p ≥ N q . This procedure costs two communication units (variable t and qq-1), so the parallel runtime of the first step is Tp − first

computation communication 644 7448 6444 474444 8 2 = O(2 N p s p + 8) + O (2 N p 2 s p + 4 N p + 2) .

(4)

48

Y. Li et al.

In the second step, two operations of point-multiplication over finite field Fp and finite field Fq are executed simultaneously. Then the values of the two pointmultiplications and the inversions of the two modules should be multiplied to get the final value of point-multiplication over ring Zn. Parameters pp −1 and qq −1 are two constants in the cryptosystem because p and q are fixed. The multiple-precision of kP ( t p ) pp −1 is 2Np and the multiple-precision of kP ( tq ) qq −1 is 2Nq. Consequently, it costs one point-multiplication and one multiple-precision multiplication over each finite field. We could get the parallel runtime of the second step:

Tp − left

computation 64444444 744444448 2 == O (t (3N p s p + 6lN p + 10) + N p 2 s p + 2)

communication 644444444 47444444444 8 2 + O (t (3N p s p + 6 N p + 4lN p ) + N p 2 s p + 2 N p + 1)

Tp − right

6444444computation 4744444448 = O(t (3 N q 2 sq + 6lN q + 10) + N q 2 sq + 2)

communication 644444444 47444444444 8 2 + O (t (3N q s q + 6 N q + 4lN q ) + N q 2 sq + 2 N q + 1)

,

.

(5)

(6)

Then, Tp − sec ond

computation 64444444 744444448 2 = O(t (3N p s p + 6lN p + 10) + N p 2 s p + 2)

communication 644444444 47444444444 8 2 + O (t (3N p s p + 6 N p + 4lN p ) + N p 2 s p + 2 N p + 1)

.

(7)

In the third step, the values of parameters kP ( t p ) pp −1 and kP ( tq ) qq −1 are merged to get the point-multiplication over ring Zn by computing (8):

(

)

k ( t p ) = kP ( t p ) pp −1 + kP ( tq ) qq −1 mod n .

(8)

Sum of kP ( t p ) pp −1 and kP ( tq ) qq −1 is a 2Np multiple-precision integer. One multiple-precision reduction will be needed because the final result is a 2N multipleprecision integer and 2 N p ≥ N . Therefore, the third step costs one multiple-precision addition and one multiple-precision reduction. The parallel runtime of the third step is

Tp − third

communication 64computation 4744 8 644 474448 2 = O(2 N sn + 8) + O (2 N 2 sn + 4 N + 1) .

(9)


49

Consequently, the parallel runtime of point-multiplication is computation 6444444444 474444444444 8 2 Tp1 = O(t (3N p s p + 10 + 6lN p ) + 2 N 2 sn + 3N p 2 s p + 18)

+

communication 6444444 474444444 8 O(t (3 N p 2 s p + 6 N p + 4lN p ) + 2 N 2 sn

.

(10)

+ 3N p 2 s p + 6 N p + 4 N + 4)

The sequential runtime of point-multiplication could be looked up in Table 2: Ts = O((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN + 4lA + 1.5b + 17.5) + (2a + 1)(t − 1))

.

(11)

Then we can get the speedup ratio: O ((3 ( t − 1) 2)(2 N 2 + N p 2 + N q 2 + 10 N + 12lN S=

+ 4lA + 1.5b + 17.5) + (2a + 1)(t − 1)) . computation 6444444444 474444444444 8 O(t (3 N p 2 s p + 10 + 6lN p ) + 2 N 2 sn + 3N p 2 s p + 18) +

5

(12)

communication 6444444 474444444 8 2 O(t (3N p s p + 6 N p + 4lN p ) + 2 N 2 sn

+ 3N p 2 s p + 6 N p + 4 N + 4)

Performance Comparison

This section evaluates the performance of parallel point-multiplication proposed in this paper. And the quantitative performance comparison is made between this parallel point-multiplication and two other parallel point-multiplications we presented before. The parameters over ring Zn are assumed as: N = 2sn = 2 N p = 2 N q = 4s p = 4sq , a = 2 , b = 1 , l = 32 . As demonstrated in [8], A=

C1 N + C2 N q + C3 N p C1 + C2 + C3

.

(13)

If variable n over ring Zn is big enough, the value of C1 is much bigger than C2 and C3 . Then coefficient A will be approximately equal to N. Therefore, Tp1 in (10) is simplified as:

50

Y. Li et al.

computation communication 64444 744448 644 47444 8 Tp1 = O(t (99 N + 10) + 7 N + 18) + O(70tN + 14 N + 4) .

(14)

The time complexity of the parallel point-multiplication proposed in [11] is (15) and it could be simplified as (16). Tp 2

computation 6444444444447444444444448 2 = O ( t + 1 2 ) ( 3N p s p + 6lN p + 10 ) + 2 N 2 sn + 3N p 2 s p + 18

(

)

64444444744444448 ⎛ ( t + 1 2 ) ( 3N p 2 s p + 6 N p + 4lN p ) ⎞ ⎟ +O⎜ ⎜ + 2 N 2 s + 3N 2 s + 4 N + 6 N + 4 ⎟ n p p p ⎝ ⎠ communication

Tp 2

.

(15)

computation communication 644444 47444444 8 64444 4744444 8 = O ( ( t + 0.5 )( 99 N + 10 ) + 7 N + 18) + O ( 70 N ( t + 0.5 ) + 14 N + 4 ) .

(16)

As demonstrated in Table 1 and Table 2, the runtime of traditional parallel pointmultiplication and sequential point-multiplication could be simplified as: computation communication 6444 474444 8 644 47444 8 Tp 3 = O(t (384 N + 24.5) − 3.5) + O (t (272 N + 5) + 1) ,

(17)

Ts = O((3 ( t − 1) 2)(2.5 N 2 + 522 N + 19) + 5(t − 1)) .

(18)

On condition that the communication time unit is 20 percent of computation time unit, the performance evaluation and comparison could be showed in Table 3, Fig. 2, Fig. 3 and Fig. 4. It could be seen that the methodology of paralleling point-multiplication accelerates the speed of point-multiplication and it is more efficient than two other methods. We also make other assumption of relationship between communication time unit and computation time unit. Same conclusion is derived by analyzing the performance comparison on different conditions. Table 3. Performance evaluation N

t

Tp1

Tp2

Tp3

Ts

8

10

9237.2

9694.2

35323.7

58837.5

8

20

18377.2

18834.2

70650.7

124212.5

8

30

27517.2

27974.2

105977.7

189587.5

8

40

36657.2

37114.2

141304.7

254962.5

8

50

45797.2

46254.2

176631.7

320337.5

8

60

54937.2

55394.2

211958.7

385712.5

8

70

64077.2

64534.2

247285.7

451087.5

16

10

18355.6

19264.6

70395.7

121693.5

Comparison of Three Parallel Point-Multiplication Algorithms on Conic Curves Table 3. (Continued) 16

20

36535.6

37444.6

140794.7

256908.5

16

30

54715.6

55624.6

211193.7

392123.5

16

40

72895.6

73804.6

281592.7

527338.5

16

50

91075.6

91984.6

351991.7

662553.5

16

60

109255.6

110164.6

422390.7

797768.5

16

70

127435.6

128344.6

492789.7

932983.5

24

10

27474

28835

105467.7

188869.5

24

20

54694

56055

210938.7

398724.5

24

30

81914

83275

316409.7

608579.5

24

40

109134

110495

421880.7

818434.5

24

50

136354

137715

527351.7

1028289.5

24

60

163574

164935

632822.7

1238144.5

24

70

190794

192155

738293.7

1447999.5

Fig. 2. Performance comparison while N=8


51

52

Y. Li et al.

1800000 1500000 1200000

Tp1 Tp2 Tp3 Ts

900000 600000 300000 0

10

20

30

40

50

60

70


6

Conclusions

In this paper, we presented a methodology of paralleling point-multiplication on conic curves over ring Zn. The method is proposed based on Chinese Remainder Theorem. The performance comparison between the parallel methodology and two other ones demonstrates that the technique introduced in this paper is the most efficient one. Our researches, including study in this paper, are focused on the basic parallel algorithms in conic curves cryptosystem. We plan to design the parallel algorithm of Elgamal cryptograpy on conic curves over ring Zn based on these parallel algorithms in the future. Acknowledgments. This study is sponsored by the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2009ZX-01, the Fundamental Research Funds for the Central Universities under Grant No. YWF-1002-058 and the National Natural Science Foundation of China under Grant No. 60973007.

References 1. Cao, Z.: A public key cryptosystem based on a conic over finite fields Fp. In: Advances in Cryptology: Chinacrypt 1998, pp. 45–49. Science Press, Beijing (1998) (in Chinese) 2. Cao, Z.: Conic analog of RSA cryptosystem and some improved RSA cryptosystems. Natural Science Journal of Heilongjiang University 16(4), 5–18 (1999) 3. Chen, Z., Song, X.: A public-key cryptosystem scheme on conic curves over the ring Zn. In: 6th International Conference on Machine Learning and Cybernetics, vol. 4, pp. 2183– 2187. IEEE Press, Hong Kong (2007) 4. Sun, Q., Zhu, W., Wang, B.: The conic curves over Zn and public key cryptosystem protocol. J. Sichuan Univ. (Nat. Sci. Ed.) 42(3), 471–478 (2005) (in Chinese) 5. Wang, B., Zhu, W., Sun, Q.: Public key cryptosystem based on the conic curves over Zn. J. Sichuan Univ. (Engin. Sci. Ed.) 37(5), 112–117 (2005) (in Chinese) 6. Li, Y.: Research of Conic Curve Cryptosystems and the Construction of CC-CSP. Thesis for the degree of master in computer application technology, Northestern University, pp. 25–27 (2008) (in Chinese)


53

7. Li, Y., Xiao, L., Hu, Y., Liang, A., Tian, L.: Parallel algorithms for cryptosystem on conic curves over finite field Fp. In: 9th International Conference on Grid and Cloud Computing, pp. 163–167. IEEE Press, Nanjing (2010) 8. Li, Y., Xiao, L., Liang, A., Wang, Z.: Parallel point-addition and point-double for cryptosystem on conic curves over ring Zn. In: 11th International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 317–322. IEEE Press, Wuhan (2010) 9. Li, Y., Xiao, L.: Parallel point-multiplication for conic curves cryptosystem. In: 3rd International Symposium on Parallel Architectures, Algorithms and Programming, pp. 116– 120. IEEE Press, Dalian (2010) 10. Li, Y., Xiao, L., Chen, S., Tian, H., Ruan, L., Yu, B.: Parallel Extended Basic Operations for Conic Curves Cryptography over Ring Zn. In: 9th IEEE International Symposium on Parallel and Distributed Processing with Applications Workshops, pp. 203–209. IEEE Press, Busan (2011) 11. Li, Y., Xiao, L., Wang, Z., Tian, H.: High Performance Point-Multiplication for Conic Curves Cryptosystem based on Standard NAF Algorithm and Chinese Remainder Theorem. In: 2011 International Conference on Information Science and Applications. IEEE Press, Jeju (2011)

Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism on Heterogeneous Multi-core Shigang Li, Shucai Yao, Haohu He, Lili Sun, Yi Chen, and Yunfeng Peng University of Science and Technology Beijing, 100083 Beijing, China [email protected]

Abstract. The ability of expressing multiple-levels of parallelism is one of the significant features in OpenMP parallel programming model. However, pipeline parallelism is not well supported in OpenMP. This paper proposes extensions to OpenMP directives, aiming at expressing pipeline parallelism effectively. The extended directives are divided into two groups. One can define the precedence at thread level while the other can define the precedence at iteration level. Through these directives, programmers can establish pipeline model more easily and exploit more parallelism to improve performance. To support these directives, a set of runtime interfaces for synchronization are implemented on the Cell heterogeneous multi-core architecture using signal block communications mechanism. Experimental results indicate that good performance can be obtained from the pipeline scheme proposed in this paper compared to the naive parallel applications. Keywords: Pipeline Parallelism, OpenMP, Cell architecture.

1

Introduction

Multi-core architectures are becoming the industry standard in the modern computer industry. There are two main categories: homogeneous multi-core and heterogeneous multi-core. The former one includes only identical cores while the later one integrates a control core and several accelerator cores. IBM/Toshiba/Sony Cell processor [10] is a typical heterogeneous multi-core. It comprises a conventional Power Processor Element (PPE) and eight synergistic processing elements (SPEs). SPEs don’t have hardware caches but each possesses a 256 KB local store. Communications between PPE and SPEs can be implemented through DMA, signal or mailbox. For different memory architectures, there are different programming models, such as message passing for distributed memory and shared memory inter-core communication methods. A well-known programming model for shared-memory parallel programming is OpenMP [1]. In the current definition of the OpenMP, multiple-levels of parallelism [9, 15, 16, 17] can be expressed. However, pipeline parallelism is not well supported in OpenMP. Due to the requirement from both programmers and applications, it’s necessary to extend OpenMP directives to express pipelined executions. In this paper we extend the OpenMP programming model with two groups of synchronization directives (one is Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 54–63, 2011. © Springer-Verlag Berlin Heidelberg 2011

Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism

55

based on thread and the other is based on iteration), by which programmers can establish the pipeline model flexibly and simply. Furthermore, runtime interfaces are implemented. To evaluate the performance, we conduct the experiments on the Cell Blade using the NAS IS, EP, LU [14] and SPEC2001 MOLDYN [20] benchmarks. In IS, EP and MOLDYN, from 4.8 to 5.5 speedup factors can be obtained from our pipeline model. In LU, pipeline structure can be established easily using our extended directives rather than using complicated data structures. The remainder of the paper is structured as follows: Section 2 discusses the related work. Section 3 presents the extended directives for pipeline parallelism. In Section 4 we show the runtime support for the pipeline parallelism on Cell Architecture. Experimental results are presented in Section 5 before we conclude in Section 6.

2

Related Work

Pipeline parallelism has been researched on both various programming languages and different architectures. Gonzalez et al. present the research work about extending OpenMP directives to exploit pipelined executions in OpenMP [2]. This pipeline model is work-sharing constructs oriented, while our pipeline model is loop-oriented which makes it possible to partition and pipeline the tasks in the critical region. Michailidis et al. propose a pipeline implementation of LU Factorization in OpenMP using the queue data structure [5]. In this pipeline scheme, no extra directives are extended to support pipeline. In contrast, two functions Put() and Get() are implemented using the original syntax of OpenMP to exchange elements between threads of the pipeline. Baudisch et al. present an automatic synthesis procedure that translates synchronous programs to software pipelines [6]. The compiler [8] generates the dependency graph of the guarded actions. Based on this, the graph is split into horizontal slices [6] that form threads to implement pipeline stages. In addition, threads communicate through FIFO buffers to allow asynchronous execution. Decoupled Software Pipelining (DSWP) [12, 3] can exploit the fine-grained pipeline parallelism inherent in most applications. In order to ensure that critical resources are kept in one thread, DSWP partitions the loop body into a pipeline of threads, rather than partitioning a loop by putting different iterations into different threads. Low-overhead synchronization between the pipeline stages can be implemented with a special synchronization array [3]. Coarse-grained parallelism is suitable for stream programming, because streaming applications are naturally represented by independent filters that communicate over explicit data channels [13]. Thies et al. present exploiting coarse-grained pipeline parallelism in C programs [11] to improve the performance of streaming applications. On Parallel Processor Arrays (PPA), Syrivelis et al. present a runtime support to extract coarse-grained pipeline parallelism out of sequential code [4]. Kurzak et al. present solving system of linear equations on CELL processor [7], and single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism are well exploited on CELL processor. Nevertheless, the pipeline parallelism is less involved in this scheme.

56

S. Li et al.

3

Synchronization Constructs Extensions

In order to exploit pipeline parallelism on both thread and iteration levels, we equip the programmers with two sets of directives correspondingly. One can define the precedences at thread level while the other can define the precedences at iteration level. Synchronizations in the pipeline model are implemented by the extended directives. Each directive is described in detail as follows. 3.1

Synchronization Directives at Thread Level

The OpenMP API uses the fork-join execution model [1]. However, some constructs in the parallel region make the execution or the memory access sequential, such as the critical, atomic and ordered constructs. When the computation amount is great, synchronization constructs badly degrade the performance of the parallel applications. The straightforward method to improve the performance is pipeline and specifying the precedences relationship between different threads. The syntax of the ThrdPipe construct (C or C++ version) is as follows: #pragma omp ThrdPipe [clause[ [,]clause] …] new-line for-loops The clause is one of the following: blck_num (integer-expression) mry_syn (scalar-expression) The ThrdPipe construct specifies that the iterations of the outermost loop will be executed in pipeline by threads in the current context. This construct is one of the extended synchronization constructs which help the programmer to pipeline the loops with data dependency that can’t be parallelized by the loop construct directly. The blck_num clause specifies the number of blocks that the outermost loop is partitioned to. Subsequently, the block size and the theoretical speedup factor can be determined as follows. • Block Size: In our runtime implementation, we simply use a static partition algorithm to partition the outermost loop. For the number of blocks specified p and a loop of n iterations, let the integer q and r that satisfies n=p*q-r, with 0 0, – at any time, a processor executes at most one copy, – for each (i, j) ∈ E, for any copy jl of j, there is one copy ik of i that is on the same processor or that sends its message on time to jl , i.e. if π(jl ) = π(ik ) then tc (jl ) ≥ tc (ik ) + pi else if π(jl ) and π(ik ) are in the same cluster then tc (jl ) ≥ tc (ik ) + pi + ci,j (1) else tc (jl ) ≥ tc (ik ) + pi + ci,j (2) end if If, in a schedule S, ik and jl satisfy the above condition, we will say that the Generalized Precedence Constraint is true for the two copies (in short, that GP C(ik , jl ) is true). Second, a feasible schedule S must additionally satisfy the condition that there is no message contention, i.e. in all channels used to transmit at least two messages m(ik , jl ) and m(rt , sq ) from a processor πik to a processor πjl , with message m(ik , jl ) finishing before message m(rt , sq ), we have if if π(jl ) and π(ik ) are in the same cluster then tm (m(rt , sq )) ≥ tm (m(ik , jl )) + ci,j (1) else tm (m(rt , sq )) ≥ tm (m(ik , jl )) + ci,j (2) end if Now, let C(ik ) be the completion time of a copy ik of a task i, i.e. C(ik ) = tc (ik ) + pi . The maximum completion time, or makespan, Cmax of a solution S is the largest completion time of all copies of all tasks in this solution: Cmax =

max

i∈V,k≤F (i)

{tc (ik ) + pi } .

(1)

As usual for this kind of problem, we want to minimize Cmax , that is, find a ∗ . feasible solution S ∗ with the smallest makespan Cmax One can note that, if ci,j (1) = ci,j (2), this scheduling problem is actually equivalent to the classical DAG scheduling problem with communication delays

Scheduling Tasks and Communications on a Hierarchical System

93

which, in the general case, is a NP-hard problem, even if the number of processors is not limited [16]. For this reason, we will only consider a DAG satisfying the conditions in the following two equations. They guarantee that the DAG has small communication delays. We will denote PRED (i) (respectively SUCC (i)) the set of immediate predecessors (resp. successors) of task i in G. ∀i ∈ V,

min

g∈P RED(i)

pg ≥

max

h∈P RED(i)−{g}

ch,i (1) .

(2)

Equation (2) means that processing times are locally superior or equal to the communication delays inside the clusters. It ensures that the earliest start date of any copy of each task may be computed in polynomial time. ∀i ∈ V,

min

k∈SUCC(i)

pk ≥

max

j∈SUCC(i)−{k}

ci,j (2) .

(3)

Equation (3) is very similar to (2). It means too that the processing times are locally superior or equal to the communication delays between the clusters. However, (2) deals with the predecessors of a task and with the intra-clusters communication delays, while (3) deals with the successors, and with the interclusters communication delays. Also (2) is true in most cases if (3) is true. One can note that there is already a trivial solution to the 2lVds problem: use one cluster only, and schedule all tasks on the processors of this cluster using the algorithm in [6]. This trivial solution, however, is not helpful at all, because real architectures have a limited number of processors in each cluster. For this reason, we propose the following new algorithm 2lVdsOpt. It schedules the tasks and communications in a 2lVds problem in polynomial time and spreads the tasks on as many clusters as possible to use less processors per cluster. 2.2

The 2lVdsOpt Algorithm

This algorithm has four steps. The first step 2lVdsLwb() computes the earliest start date of all copies of each task of the DAG. The second step 2lVdsCs() computes the critical sequences of the DAG according to the earliest start dates calculated during the first step. The third step 2lVdsCc() computes the graph of the critical sequences of the DAG, and its connected components according to the communication delays ci,j (1). The last step 2lVdsBuild() computes the solution, scheduling the tasks and communication on the 2lVds architecture. Computing the Earliest Start Dates. The first step of 2lVdsOpt computes the earliest start date bi of all copies of each task i of the DAG. This is done in procedure 2lVdsLwb() (cf. Algorithm 1). Table 1 presents the earliest start dates of each task of the DAG of Fig. 2 computed by procedure 2lVdsLwb(). Computing the Critical Sequences. The second step of 2lVdsOpt computes the critical sequences resulting from the earliest start dates calculated during step 1. Let B be the set of the earliest start dates bi of all tasks of V .

94

J.-Y. Colin and M. Nakechbandi

Algorithm 1. procedure 2lVdsLwb(V , E, p, c) for all tasks i ∈ V such that P RED(i) = ∅ do let bi = 0 {assign 0 to i as its earliest start date bi } end for while there is a task i which has not been assigned an earliest starting date bi and whose predecessors h ∈ P RED(i) all have an earliest starting date bh assigned to them do let c = maxh∈P RED(i) bh + ph + ch,i (1) find g ∈ P RED(i) such that bh + ph + ch,i (1) = c let bi = max bg + pg , maxh∈P RED(i)−{g} bh + ph + ch,i (1) end while Table 1. Earliest start dates bi of the tasks i of the DAG of Fig. 2 computed by procedure 2lVdsLwb() (cf. Algorithm 1) task i:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

bi :

0

2

4

6

4

7

9

11

0

9

11

13

11

14

16

18

Let GC be the critical subgraph of G according to the earliest start dates in B. (i, j) is an arc of GC if (i, j) ∈ E and bj < bi + pi + ci,j (1). That is, an arc (i, j) in GC means that these two tasks must have copies on the same processor, because there is not enough delay to transmit the result of any copy ik to a copy jl from one processor to another processor of the same cluster. GC is always a forest [5]. A critical sequence sc of the DAG is a proper path of GC. The computation is done in procedure 2lVdsCs() (cf. Algorithm 2). Computing the Graph of the Critical Sequences. The third step of 2lVdsOpt builds the undirected graph GSC of the critical sequences scs and computes its connected components [10]. Let CC be the set of all computed critical sequences scs . Algorithm 2. procedure 2lVdsCs(V , E, p, c, B) GC = ∅ for all arcs (i, j) ∈ E do if bj < bi + pi + ci,j (1) then GC = GC ∪ {(i, j)} end if end for s=0 for all tasks i ∈ V do if task i is a leaf of the critical subgraph GC then let critical sequence scs be the path from the root of the tree in GC that includes task i, to task i s=s+1 end if end for


95

GSC has one node ss for each critical sequence scs of CC computed during the previous step. Also, there is one edge (ss , st ) or (st , ss ) in GSC if ∃(i, j) ∈ E, with i ∈ scs , and i ∈ / sct , and j ∈ sct , such that bj < bi + pi + ci,j (2). This edge means that there is not enough time to transmit one message between at least one task i of scs to another task j of sct between two clusters. So scs and sct must be processed in the same cluster. The computation is done in procedure 2lVdsCc() (cf. Algorithm 3). Algorithm 3. procedure 2lVdsCc(V , E, p, c, B, CC) GSC = ∅ for all critical sequences scs ∈ CC do let ss be the new node related to scs end for for all nodes ss do GSC = GSC ∪ {ss } for all nodes st ∈ GSC − {ss } do if there is no edge between ss and st in GSC and there is at least one arc (i, j) / sct and j ∈ sct , such that bi < bi + pi + ci,j (2) then of E with i ∈ scs and i ∈ add one edge between ss and st to GSC end if end for end for compute the connected components gs of GSC

Fig. 3 shows the six critical sequences sc1 to sc6 found for the DAG of Fig. 2 using the computed earliest start dates in Table 1. It also shows the graph of the critical sequences and its two connected components. Computing the Solution. The last step of 2lVdsOpt builds a solution with minimal makespan using all the data computed in the preceding phases. One cluster is allocated to each connected component, and one processor of this cluster is allocated to each critical sequence of this connected component. One copy of each task of each critical sequence is executed at its earliest start date. All messages are sent as soon as the sending copy of the task finishes its execution. The computation is done in procedure 2lVdsBuild() (cf. Algorithm 4). Fig. 4 shows the Gantt chart of the final schedule found for the DAG of Fig. 2. Two clusters, each with three processors, are used. Tasks 1, 2, 9 and 10 have two copies each in this schedule. 2.3

Analysis of the Algorithm

Let n be the number of tasks and m be the number of arcs. The complexity of procedure 2lVdsLwb() is O(max(m, n)), and the complexity of procedure 2lVdsCs() is O(m). The complexity of building the graph of the critical sequences in 2lVdsCc() is O(n) [5], and of computing its connected components is O(n). Thus the complexity of 2lVdsCc() is O(n) too.

96


1

sc2

2

5

6

9

10

13

sc5

14

sc1

3 7

sc4

sc3

11 15

s1

4 s2

s3

8 s4

12 sc6

s5 16

s6

Fig. 3. The six critical sequences sc1 to sc6 in the critical graph GC of the DAG in Fig. 2 (left), and the graph GSC of these critical sequences with the two resulting connected components (right)

Algorithm 4. procedure 2lVdsBuild(V , E, p, c, B, CC, GSC) for all connected components gc ∈ GSC do allocate a new cluster Πc to gc for all node ss ∈ gc do let scs be the critical sequence related to node ss allocate a new processor πs in cluster Πc to this critical sequence scs for all task i ∈ scs do F (i) = F (i) + 1, tc (iF (i) ) = bi , π(iF (i) ) = πs end for end for end for for all copy jl of task j do let π(jl ) be the processor that executes jl for all task i ∈ P RED(j) do if there is no copy of task i on π(jl ) and π(jl ) does not already receive one message from any copy of i on time for copy jl then remove any message m from any copy of i to processor π(jl ) find one copy ik that can send its message on time to jl send one message m(ik , jl ) from copy ik at date bi + pi to processor π(jl ) end if end for end for

π1 11 21 31 41 cluster Π1 π2 12 22 51 π3 61 π4 91 cluster 92 Π 2 π5 π6 0 2 4 6 8

71 81 101 111 121 102 131 141 151 161 time 10 12 14 16 18 20

Fig. 4. Gantt chart of the solution of the DAG of Fig. 2, using two clusters Π1 and Π2 , with three processors per cluster (π1 , π2 and π3 in Π1 , and π4 , π5 and π6 in Π2 )


97

Using a graph-level approach, one can show that the complexity of the first part of 2lVdsBuild() is O(n2 ). Because the second part of 2lVdsBuild() tries, in the worst case, to find one suitable copy of each predecessor for each copy of each task, it is possible to establish that the complexity of this second part is O(m2 n2 ). The complexity of procedure 2lVdsBuild() is then O(m2 n2 ). So the complexity of the overall algorithm is O(m2 n2 ). Also, we have the following theorems. Theorem 1. The solution built by 2lVdsOpt has minimal makespan. Theorem 2. At least one copy of each task is executed. Theorem 3. The GPC are true for all copies of all tasks. Theorem 4. In the solution computed, each copy of each task receives at least one message on time from at least one copy of each of its predecessor, if a message is needed. Theorem 5. There is no message contention on any unidirectional channel.

3

Conclusion

A Directed Acyclic Graph of tasks with small communication delays had to be scheduled on the identical parallel processors of several clusters connected by a hierarchical network. The number of processors and of clusters was not limited. Message contention had to be avoided. Task duplication was allowed. We presented a new polynomial algorithm that computes the earliest start dates of tasks and spreads these tasks to use few processors per cluster. It also schedules the communications so that there is no message contention and messages are always delivered on time.

References 1. Bampis, E., Giroudeau, R., König, J.-C.: Using Duplication for Multiprocessor Scheduling Problem with Hierarchical Communications. Parallel Processing Letters 10(1), 133–140 (2000) 2. Beaumont, O., Boudet, V., Robert, Y.: A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors. In: 11th Heterogeneous Computing Workshop (HCW 2002). IEEE Computer Society Press, Los Alamitos (2002) 3. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.M.: DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm. In: 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP 2010), Pisa, Italy (2010) 4. Bozdag, D., Ozguner, F., Catalyurek, U.V.: Compaction of Schedules and a Two Stage Approach for Duplication-Based DAG Scheduling. IEEE Transactions on Parallel and Distributed Systems 20(6), 857–871 (2009) 5. Colin, J.-Y., Chrétienne, P.: Scheduling with Small Communication Delays and Task Duplication. Operations Research 39(4), 680–684 (1991)

98


6. Colin, J.-Y., Colin, P.: Scheduling Tasks and Communications on a Virtual Distributed System. European Journal of Operational Research 94(2) (1996) 7. Colin, J.-Y., Nakechbandi, M.: Scheduling Tasks with Communication Delays on 2Levels Virtual Distributed Systems. In: Proceedings of the 7th Euromicro Workshop on Parallel and Distributed Processing (PDP 1999), Funchal, Portugal (1999) 8. Garey, M., Johnson, D.: Computers and Intractability, a Guide to the Theory of NP-Completeness. Freeman, New York (1979) 9. Giroudeau, R., König, J.-C.: Scheduling with Communication Delay. In: Multiprocessor Scheduling: Theory and Applications, pp. 1–26. ARS Publishing (2007) 10. Hopcroft, J., Tarjan, R.: Efficient Algorithms for Graph Manipulation. Communications of the ACM 16, 372–378 (1973) 11. Kalinowski, T., Kort, I., Trystram, D.: List Scheduling of General Task Graphs under LogP. Parallel Computing 26, 1109–1128 (2000) 12. Kruatrachue, B., Lewis, T.G.: Grain Size Determination for Parallel Processing. IEEE Software 5(1), 23–32 (1988) 13. Kwok, Y.-K., Ahmad, I.: Static Scheduling Algorithms for Allocating Directed Task Graphs to Multi-Processors. ACM Computing Surveys (CSUR) 31(4), 406– 471 (1999) 14. Marchal, L., Rehn, V., Robert, Y., Vivien, F.: Scheduling Algorithms for Data Redistribution and Load-Balancing on Master-Slave Platforms. Parallel Processing Letters 17(1), 61–77 (2007) 15. Norman, M.G., Pelagatti, S., Thanisch, P.: On the Complexity of Scheduling with Communication Delay and Contention. Parallel Processing Letters 5(3), 331–341 (1995) 16. Papadimitriou, C.B., Yannakakis, M.: Toward an Architecture Independent Analysis of Parallel Algorithms. In: Proceedings of the 20th Annual ACM Symposium Theory of Computing, Santa Clara, California, USA (1988) 17. Rayward-Smith, V.J.: Scheduling with Unit Interprocessor Communication Delays. Discrete Math. 18, 55–71 (1987) 18. Sarkar, V.: Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. MIT Press, Cambridge (1989) 19. Sinnen, O., Sousa, L.: Communication Contention in Task Scheduling. IEEE Transactions on Parallel and Distributed Systems 16(6), 503–515 (2005) 20. Sinnen, O., To, A., Kaur, M.: Contention-Aware Scheduling with Task Duplication. Journal of Parallel and Distributed Computing 71(1), 77–86 (2011) 21. Tam, A., Wang, C.L.: Contention-Aware Communication Schedule for High Speed Communication. Cluster Computing 6(4), 339–353 (2003)

Spiking Neural P System Simulations on a High Performance GPU Platform Francis George Cabarle1 , Henry Adorna1 , Miguel A. Mart´ınez–del–Amor2, and Mario J. Pérez–Jiménez2 1

2

Algorithms & Complexity Lab, Department of Computer Science, University of the Philippines Diliman Diliman 1101 Quezon City, Philippines [email protected], [email protected] Research Group on Natural Computing, Department of Computer Science and Artificial Intelligence, University of Seville, Avda. Reina Mercedes s/n, 41012 Sevilla, Spain {mdelamor,marper}@us.es

Abstract. In this paper we present our results in adapting a Spiking Neural P system (SNP system) simulator to a high performance graphics processing unit (GPU) platform. In particular, we extend our simulations to larger and more complex SNP systems using an NVIDIA Tesla C1060 GPU. The C1060 is manufactured for high performance computing and massively parallel computations, matching the maximally parallel nature of SNP systems. Using our GPU accelerated simulations we present speedups of around 200× for some SNP systems, compared to CPU only simulations. Keywords: Membrane computing, Spiking Neural P systems, GPU computing, CUDA, parallel computing.

1

Introduction

P systems are by nature distributed, parallel, and non-deterministic computing models defined within Membrane computing, which is a research area initiated by Gheorghe P˘ aun in 1998 [16]. The objective, as with other disciplines of Natural computing (e.g. DNA/molecular computing, quantum computing, etc.), is to obtain inspiration from the way nature computes to provide efficient solutions to the limitations of conventional models of computation e.g. a Turing machine. Membrane computing can be thought of as an extension of DNA or molecular computing, zooming out from the individual molecules of the DNA and including other parts and sections of living cells in the computation, introducing the concept of distributed computing as well [16]. P systems are abstractions of the compartmentalized structure and parallel processing of biochemical information in biological cells. There are several P sytem variants defined in literature, each one based on the abstraction of different aspects (or ingredients) from cells, and that many of them have been proven Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 99–108, 2011. c Springer-Verlag Berlin Heidelberg 2011

100

F.G. Cabarle et al.

to be computationally complete [5]. There are three general classifications of P systems considering the level of abstraction: cell-like (a rooted tree where the skin or outermost cell membrane is the root), tissue-like (a graph connecting the cell membranes) and neural-like (a directed graph, inspired by neurons interconnected by their axons and synapses). The last type refer to Spiking Neural P systems (in short, SNP systems), where the time difference (when neurons fire and/or spike) plays an essential role in the computations [11]. An interesting result of P systems is that they are able to solve computationally hard problems (e.g. NP-complete problems) usually in polynomial, often linear time, but usually requiring exponential space as trade off [16]. Due to the nature of P systems, they are yet to be fully implemented in vivo, in vitro, or even in silico. Thus, practical computations of P systems are driven by silicon-based simulators. There are several simulators for P systems implemented over different software and hardware technologies [7]. In practice, P system simulations are limited by the physical laws of silicon architectures, which are often inefficient or not suitable when dealing with P system features, such as massive parallelism. However, in order to improve the efficiency of the simulators, it is necessary to exploit current technologies, leading to solutions in the area of High Performance Computing (HPC), such as accelerators or manycore processors. In this respect, Graphics Processing Units (GPUs) have been consolidated as accelerators thanks to their throughput-oriented and highlyparallel architecture [9]. Several simulators for P systems have been developed over highly parallel platforms, including reconfigurable hardware as in FPGAs [14], CPU-based clusters [6], as well as in NVIDIA corporation’s Compute Unified Device Architecture (CUDA) enabled GPUs [4,3]. These efforts show that parallel devices are very suitable in accelerating the simulation of P systems, at least for transition and active membrane P systems [3,4]. Efficiently simulating a Spiking Neural P (SNP) system, the P system variant of interest in this work, would thus require new efforts in parallel computing. Since SNP systems have already been represented as matrices due to their graph-like properties [18], simulating them in parallel devices such as GPUs is the next natural step. Matrix algorithms are well known in parallel computing literature, including GPUs [8], due to the highly parallelizable nature of linear algebra computations mapping directly to the data-parallel GPU architecture. An SNP system simulator using CUDA was presented in [1] and [2]. These previous works however were executed in GPUs of workstations only, hence we intend to do better. We adapt and analyse the performance of this simulator on a high-end GPU NVIDIA Tesla C1060, designed ground-up for parallel computing and HPC, by simulating SNP systems of different sizes. A final simulator for SNP systems using CUDA would allow the designers to check their models, and perform other complex computations such as computing backwards. This paper is organized as follows: Section 2 and 3 provide backgrounds for CUDA and SNP systems, respectively. The design of the simulator and simulation results are given in Section 4 and 5, respectively.

Spiking Neural P System Simulations on a High Performance GPU Platform

2

101

GPU Computing and NVIDIA CUDA

As many-core based platforms, GPUs are massively data-parallel processors which have high chip scalability in terms of processing units (cores, threads), and high bandwidth with internal GPU memories. The architectural difference between CPUs and GPUs is the reason why the latter offer larger performance increase over CPU only implementation of parallel code working on large amounts of input data [12]. The main advantages of using GPUs are their low-cost, lowmaintenance and low power consumption relative to conventional parallel clusters and setups, while providing comparable or improved computational power [10]. For example, the latest GPUs of NVIDIA with 512 cores are readily available at consumer electronics stores for around $500. GPUs can be programmed using a framework introduced by NVIDIA in 2007 called CUDA [12]. CUDA is a programming model and hardware architecture for general purpose computations in NVIDIA’s GPUs [12]. The programmer can use CUDA for free of charge (including the compiler, driver, SDK, libraries, etc), and is easy to learn because it’s an extension of the C language. CUDA implements a heterogeneous computing architecture, where two different parts are often considered: the host (CPU side) and the device (GPU side). The host part of the code is responsible for controlling the program execution flow, transferring data to and from the device memory, and executing specific codes, called kernel functions, on the device. The device acts as a parallel coprocessor to the host. The host outsources the parallel part of the program as well as the data to the device, since it is more suited to parallel computations than the host. The kernel code is executed in the device by a set of threads. They are organized into a three-level hierarchy, from highest to lowest: a grid of thread blocks, blocks of threads, and threads which can share data through shared memory and can perform simple barrier synchronization [12,15].Using kernel functions, the programmer can specify the GPU resources: up to 65,535 blocks and up to 512 threads per block.

3

Spiking Neural P Systems

Now we first formally define SNP systems as computing models. An SNP system without delay, of degree m ≥ 1, is of the form Π = (O, σ1 , . . . , σm , syn, in, out), where: [1.] O = {a} is the alphabet made up of only one object a, called spike; [2.] σ1 , . . . , σm are m number of neurons of the form σi = (ni , Ri ), 1 ≤ i ≤ m, where: (a) ni ≥ 0 gives the initial number of spikes (a) contained in neuron σi ; (b) Ri is a finite set of rules of the following forms:

102

F.G. Cabarle et al.

(b-1) E/ac → ap , are Spiking rules, where E is a regular expression over a, c ≥ 1, and p ≥ 1 number of spikes are produced (with the restriction c ≥ p), transmitted to each adjacent neuron with σi as the originating neuron, and ac ∈ L(E); ak → ap , is a special case of (b-1) where L(E) = {ac }, k = c, p = 1; (b-2) as → λ, are Forgetting rules, for s ≥ 1, such that for each rule E/ac → ap of type (b-1) from Ri , as ∈ / L(E); [3.] syn = {(i, j) | 1 ≤ i, j ≤ m, i = j } are the synapses i.e. connection between neurons; [4.] in, out ∈ {1, 2, . . . , m} are the input and output neurons, respectively. The system works as follows: At any given time, a σi (neuron) should use exactly one rule only, if and only if the condition ac ∈ L(E) is met. This condition means as long as the multiplicity of spikes is in the language generated by the regular expression E, a rule (or several of them) is (are) applicable. The rule to be used or applied is chosen non-deterministically. If a spiking rule is used, after rule application c spikes are consumed in the σi , producing p number of spikes to all other σj such that (i, j) ∈ syn. If a Forgetting rule is applied, s number of a are removed from σi and no a or spike is produced. A global clock is followed by the system. Parallelism is at the system level, although each neuron works sequentially.

Fig. 1. Π1 generates the set N - {1}. Π1 outputs are the time differences between the first spike of σ3 and its succeeding spikes. A total ordering of the neurons is seen (σ1 to σ3 ) including a total ordering of the rules (1 to 5).

We designate the SNP system shown in Figure 1 as Π1 [18].For our simulations we use 2 additional systems: Figure 8 in [11] and Figure 14 in [11] which we designate as Π2 and Π3 respectively. Next we present the matrix representation of an SNP system and its computations. This representation makes use of the following vectors and matrix definitions: Configuration vector. Ck is the vector containing all spikes in every σ on the kth computation step/time. C0 is the initial Ck of the system. Spiking vector. Sk shows, at a given Ck , if a rule is applicable (having value 1 ) or not (having value 0 instead). Spiking transition matrix. MSN P is a matrix comprised of aij elements where aij is given as: −c if rule ri is in σj and is applied consuming c spikes; p if rule


103

ri is in σs (s = j and (s, j) ∈ syn and is applied producing p spikes in total; 0 if rule ri is in σs (s = j and (s, j) ∈ / syn. The spiking transition matrix MΠ1 is shown in equation (1). ⎛ ⎞ −1 1 1 ⎜ −2 1 1 ⎟ ⎜ ⎟ ⎟ MΠ1 = ⎜ (1) ⎜ 1 −1 1 ⎟ ⎝ 0 0 −1 ⎠ 0 0 −2 Equation (2) provides the configuration vector at the (k + 1)th step: Ck+1 = Ck + Sk · MΠ

(2)

For Π1 C0 =< 2, 1, 1 >. and we have the S0 =< 1, 0, 1, 1, 0 > given its C0 . Note that a second alternative S0 =< 0, 1, 1, 1, 0 >, is possible if we use rule (2) over rule (1) instead (but not both at the same time). V alidity in this case means that only one among several applicable rules is used and thus represented in the Sk . The C0 , S0 for Π2 and Π3 can be similarly shown.

4

Parallel SNP System Simulation on GPU

We designate the improved SNP system simulator in this paper as snpgpu-sim4 which is an update to snpgpu-sim3 produced in [2]. Among the improvements of snpgpu-sim4 over snpgpu-sim3 include the use of multiple thread-blocks to accomodate matrices more than 512 elements, and a more streamlined part of the simulation code for handling the relationships between Ri , Ck , and Sk . This section will further expound on these, among other things. The simulator takes in 3 inputs: M f , C0f , and Rf which are the file counterparts of M , C0 , and Ri , respectively. Skf is the file counterpart of Sk , which is produced by the simulator itself once it is run. PyCUDA was used in addition to conventional Python and CUDA C languages. PyCUDA is a Python wrapper for NVIDIA CUDA C and C++, enabling programmers to create GPU software using Python, and has been used for high performance computing [13]. The inputs are text files with delimiters, between rule to another rule in a σ and between σs themselves. The elements of M are entered in row-major order format into the file, and are mapped onto each thread of a thread block, within the block grid as shown in Figure 3. Figure 2 shows an instance of host-device interaction. The host functions sequentially and calls the kernel function/s. The device is split up into a grid of thread blocks, each with their own threads, and operate on the data in a single program, multiple data (SPMD) programming style [12]. The simulation algorithm is shown in Algorithm 1, which also indicates where a specific part of the simulation runs on (either host or device parts). Part I loads the 3 initial inputs and the succeeding inputs from their file counterparts, checking for formatting and pre-processing them for Part II. Part II, from Part I’s outputs and from

104

F.G. Cabarle et al.

Fig. 2. Diagram showing a single run of the simulation flow. The host runs sequentially while the device is made up of a grid of thread blocks, each with their own threads operating in paralel. Require: Input files: Ck, M, r. I. (HOST) Load input files.M f , Rf are loaded once only. C0f is also loaded once, then Ckf s afterwards. II. (HOST) Determine if a rule in Rf is applicable based on the number of spikes present in each σ seen in Ckf . Then generate all valid and possible spiking vectors in a list of lists Skf . III. (DEVICE) Run kernel function on all valid and possible Skf s from the current Ckf . Produce the next file configuration counterparts of Ck + 1s and their corresponding Skf s. IV. (HOST+DEVICE) Repeat steps I to IV, till at least one of the two Stopping criteria is encountered. Algorithm 1. Overview of SNP system simulation algorithm

Ckf and Rf , produces all the valid and possible Skf s. Part II produces all valid and possible Skf files as follows: For each ni of σi , the {1,0} strings are produced on a per neuron level. For example, for Π1 we have n1 = 2 for σ1 . Now we have σ1 strings ‘10’ (choose to use R1 instead of R2 ) and ‘01’ (choose to use R2 over R1 ). We only have one string for σ2 , the string ‘1’, since σ2 has only one rule and it is readily applicable. Neuron σ3 produces only one string also, ‘10’, since only one rule is applicable given its n3 = 1. Only R4 is used in σ3 and not R5 . Once all the neuron level {1,0} strings are produced, the strings are exhaustively paired up with the other strings in the other σs from left to right as the ordering is important. The output therefore of Part II in this example given Ck =< 2, 1, 1 > are (1,0,1,1,0) and (0,1,1,1,0). Elements of the input files are treated as strings up to this point, because of the the concatenation and regular expression checking processes, among others. Part III now treats the input elements as integral values. Equation 2 is performed in parallel such that each thread is either adding or multiplying a (matrix or vector) element. Once the Ck+1 are produced by the device, the results are


105

moved back to the host. Part IV then checks whether to proceed or to stop based on 2 stopping criteria for the simulation: (I) if a zero vector (vector of zeros) is encountered, (II) if the succeeding Ck s have all been produced in previous computations. Both (I) and (II) make sure that the simulation halts and does not enter an infinite loop.

Fig. 3. Different representations of a given matrix X: (a) original matrix form (b) linear array in row-major order form, (c) using CUDA thread blocks in a single thread block grid. The linear array shows how the array’s elements are laid out: a (4 × 4) grid made up of (2 × 2) thread blocks. Each thread in a thread block computes a unique element of the array in parallel, and all of them execute the same kernel function.

5

Simulation Results and Observations

The simulations in this paper were executed using an Intel Xeon E5504 quad core CPU running at 2 GHz per core (there are two of these CPUs so there are effectively 8 cores). Each core has a 4MB cache. The GPU is an NVIDIA Tesla C1060 high performance GPU with 240 streaming-processor (SP) cores organized as 30 streaming multiprocessor (SMs) and has 4GB of memory for storing data used by the kernel functions. A 64-bit Ubuntu 10.04 Linux operating system was used to host the simulations. A sequential i.e. CPU only version of snpgpusim4 was created and compared to snpgpu-sim4. We designate this CPU only simulator as snpcpu-sim4. snpcpu-sim4 is identical to snpgpu-sim4 except for the computation of equation (2). Figure 4 shows the running times of the simulators with Π1 as the SNP system. The run times per SNP system are shown using three different time measurements: the real time, user time, and sys time taken using the Ubuntu Linux command time, based on the Unix command of the same name. The real time is the time that has elapsed during the run of the program (a ‘wall clock’ time measurement). The user time is the time spent by the program running in the CPU while in user mode. The sys time is the total CPU time used by the OS on behalf of the program that is being measured, and while the process is in kernel mode. A program or process in kernel mode means the process can use system calls or services such as allocating memory for itself, including hardware access (a more privileged

106

F.G. Cabarle et al.

execution mode) while being in user mode means the program is usually restricted to its initial resources only (less privileged execution mode) [17]. In Figure 4 we see the large improvement of snpgpu-sim4 over snpcpu-sim4, as expected. As expected also, snpcpu-sim4 used up more time from the CPU as seen in the real and sys times. It’s worth mentioning that snpgpu-sim4 used a bit more of the CPU in the user times (though still significantly less than snpcpu-sim4 ) because snpgpu-sim4 still needed some work from the CPU to process the inputs. Another noteworthy point is that with all three runtime figures (Figure 4 to 6) the user run time is significantly far less compared to the other two time measurements because it only measures the time used by the program alone in the CPU, and no other programs are involved in the time measurement. Table 1summarizes averages of the kernel function runtimes and the CPU counter parts of the kernel functions as well as the average speedups. The maximum size, in terms of the number of neurons (Cknum ) and rules (Rnum ) of a system, that the current setup can simulate is given by Cknum = 4GBytes/(16Bytes + 4Bytes × Rnum ).

Fig. 4. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π1 showing (a) real, (b) user, and (c) sys times usage

Fig. 5. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π2 showing (a) real, (b) user, and (c) sys times usage


107

Fig. 6. Runtime graph of snpgpu-sim4 versus snpcpu-sim4 for Π3 showing (a) real, (b) user, and (c) sys times usage Table 1. Summary of averages: kernel and CPU times, and speedup. All time measurements are in seconds, except for KRTA which is in microseconds. RTSA is Real Time Speedup Average, UTSA is User Time Speedup Average, STSA is System Time Speedup Average. KRTA is the Kernel Runtime Average, the amount of time the kernel function spent running inside the GPU/device. CRTA is the CPU Runtime Average, the amount of CPU time used by the CPU only (i.e. sequential) counterpart of the kernel function. RTSA UTSA STSA Π1 156.1439811343 3.5999180999 178.3754195194 Π2 3.2014649226 0.9619771863 4.3513513514 Π3 67.0445847755 8.4018691589 192.8963174046

KRTA 107.33688871 μs 216.442000587 μs 153.418998544 μs

CRTA 3.8535563 3.938559 3.9137748

Acknowledgments. Francis Cabarle is supported by the DOST-ERDT program. Henry Adorna is funded by the DOST-ERDT research grant and the Alexan professorial chair of the UP Diliman Department of Computer Science. M.A. Mart´ınez–del–Amor and M.J. Pérez-Jiménez are supported by “Proyecto de Excelencia con Investigador de Reconocida Val´ıa” of the “Junta de Andaluc´ıa” under grant P08-TIC04200, and by the project TIN2009–13192 of the “Ministerio de Educaci´ on y Ciencia” of Spain, both co-financed by FEDER funds.

References 1. Cabarle, F., Adorna, H., Mart´ınez-del-Amor, M.A.: An Improved GPU Simulator For Spiking Neural P Systems. Accepted in the IEEE Sixth International Conference on Bio-Inspired Computing: Theories and Applications, Penang, Malaysia (September 2011) 2. Cabarle, F., Adorna, H., Mart´ınez-del-Amor, M.A.: A Spiking Neural P system simulator based on CUDA. Accepted in the Twelfth International Conference on Membrane Computing, Paris, France (August 2011)

108

F.G. Cabarle et al.

3. Cecilia, J.M., Garc´ıa, J.M., Guerrero, G.D., Mart´ınez-del-Amor, M.A., Pérez-Hurtado, I., Pérez-Jiménez, M.J.: Simulating a P system based efficient solution to SAT by using GPUs. Journal of Logic and Algebraic Programming 79(6), 317–325 (2010) 4. Cecilia, J.M., Garc´ıa, J.M., Guerrero, G.D., Mart´ınez-del-Amor, M.A., Pérez-Hurtado, I., Pérez-Jiménez, M.J.: Simulation of P systems with active membranes on CUDA. Briefings in Bioinformatics 11(3), 313–322 (2010) 5. Chen, H., Ionescu, M., Ishdorj, T.-O., P˘ aun, A., P˘ aun, G., Pérez-Jiménez, M.: Spiking neural P systems with extended rules: universality and languages. Natural Computing: an International Journal 7(2), 147–166 (2008) 6. Ciobanu, G., Wenyuan, G.: P Systems Running on a Cluster of Computers. In: Mart´ın-Vide, C., Mauri, G., P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) WMC 2003. LNCS, vol. 2933, pp. 123–139. Springer, Heidelberg (2004) 7. D´ıaz, D., Graciani, C., Gutiérrez, M.A., Pérez-Hurtado, I., Pérez-Jiménez, M.J.: Software for P systems. In: P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) The Oxford Handbook of Membrane Computing, ch. 17, pp. 437–454. Oxford University Press, Oxford (2009) 8. Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS 2004), pp. 133–137. ACM, NY (2004) 9. Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Communications of the ACM 53(11), 58–66 (2010) 10. Harris, M.: Mapping computational concepts to GPUs. In: ACM SIGGRAPH 2005 Courses, NY, USA (2005) 11. Ionescu, M., P˘ aun, G., Yokomori, T.: Spiking Neural P Systems. Journal Fundamenta Informaticae 71(2,3), 279–308 (2006) 12. Kirk, D., Hwu, W.: Programming Massively Parallel Processors: A Hands On Approach, 1st edn. Morgan Kaufmann, MA (2010) 13. Kl¨ ockner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA: GPU Run-Time Code Generation for High-Performance Computing. Scientific Computing Group, Brown University, RI, USA (2009) 14. Nguyen, V., Kearney, D., Gioiosa, G.: A Region-Oriented Hardware Implementation for Membrane Computing Applications and Its Integration into Reconfig-P. In: P˘ aun, G., Pérez-Jiménez, M.J., Riscos-N´ un ˜ez, A., Rozenberg, G., Salomaa, A. (eds.) WMC 2009. LNCS, vol. 5957, pp. 385–409. Springer, Heidelberg (2010) 15. NVIDIA corporation, NVIDIA CUDA C programming guide, version 3.0. NVIDIA, CA, USA (2010) 16. P˘ aun, G., Ciobanu, G., Pérez-Jiménez, M. (eds.): Applications of Membrane Computing. Natural Computing Series. Springer, Heidelberg (2006) 17. Stallings, W.: Operating systems: internals and design principles, 6th edn. Pearson/Prentice Hall, NJ, USA (2009) 18. Zeng, X., Adorna, H., Mart´ınez-del-Amor, M.A., Pan, L., Pérez-Jiménez, M.: Matrix Representation of Spiking Neural P Systems. In: Gheorghe, M., Hinze, T., P˘ aun, G., Rozenberg, G., Salomaa, A. (eds.) CMC 2010. LNCS, vol. 6501, pp. 377–391. Springer, Heidelberg (2010)

SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances Moussa Taifi, Justin Y. Shi, and Abdallah Khreishah Temple University, Computer Science Department, Philadelphia, PA, USA {moussa.taifi,shi,akhreish}@temple.edu

Abstract. The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.

1

Introduction

The economy of scale affords cloud computing extreme cost effectiveness potentials. While it is in general difficult to assess the real cost of a computation task, the auction-based provisioning scheme offers a fair pricing structure. Theoretically, prices under fair market conditions reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. From the consumer’s perspective, high performance computing (HPC) applications are the biggest potential beneficiaries since their infrastructure costs are the most expensive. From the seller’s perspective, HPC applications represent the most reliable income stream since they are the most resource intensive users. Theoretically, resource usage efficiency is automatically maximized under the auction-based provisioning schemes. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 109–120, 2011. c Springer-Verlag Berlin Heidelberg 2011

110

M. Taifi, J.Y. Shi, and A. Khreishah

Traditional HPC applications are typically optimized for hardware features to obtain processing efficiency. Since transient component errors can halt the entire application, it has become increasingly important to create autonomic applications that can automate checkpoint and re-starting with little loss of useful work. Although the existing HPC applications are not suitable for volatile computing environments, with an automated checkpoint-restart (CPR) HPC toolkit, it is plausible that practical HPC applications could gain additional cost advantages using auction-based resources by dynamically minimized CPR overheads. We first establish models for estimating running times of a HPC application using auction-based cloud resources. The proposed models take into account the time complexities of the HPC application, the overheads of checkpoint-restart, and the publicly available resource bidding history. We seek to unravel the interdependencies between the applications’ computing/communication complexities, the number of required processors, bidding prices and the eventual processing costs. We then introduce the SpotMPI toolkit and show how it automate MPI application processing using volatile resources and the guidance of the formal models. Applying our models to recent bidding histories of Amazon EC2 HPC resources, we report preliminary results for two HPC application types with different computing and communication complexities.

2 2.1

Background Auction-Based Computing: Spot Instances

Amazon is one of the first cloud computing vendors that provides at least two types of cloud instances: on-demand instance and spot instance. An on-demand instance has a fixed price. Once ordered, it provides services according to Amazon’s Service Level Agreement (SLA). Spot instance is a type of resource whose availability is controlled by the current bidding price and the auctions market. There are three special features of Amazon’s spot instance pricing policy: – A successful bid does not guarantee the exclusive resource access for the entire requested duration. The Amazon engine can terminate access at any time if a higher bid is received. – Amazon does not charge a partial hour (job terminated before reaching the hour boundary) if the termination was caused by out-bidding. Otherwise, the partial hour is charged in full if the user terminates the job. – Amazon will only charge the user the highest market price that is less than the user’s successful bid. We have chosen two types of Amazon EC2 HPC resources for this study. The cc1.4xlarge and the cg1.4xlarge are cluster HPC instances that provide cluster level performance(23 GB of memory, 8 cores, 10 Gigabit Ethernet). The main difference is the presence of GPUs (2 x NVIDIA Tesla “Fermi” M2050)in the cg1.4xlarge which provide more power for compute intensive applications.

SpotMPI: Auction-Based HPC Computing

111

Figures 1 records a sample market price history for the cc1.4xlarge instance type from May 5 to May 11, 2011. This instance type shows typical user behavior for more legacy HPC applications. The cg1.4xlarge instance type illustrates ressources for HPC applications that can benefit from GPU processing. Since many legacy HPC applications are not suitable for GPU processing, cg1.4xlarge history is less interesting.

Fig. 1. Market Prices of cc1.4 Instance in May 2011

2.2

HPC in the Cloud

Although HPC applications are the biggest potential beneficiaries of cloud computing, except for a few simple applications, there are still many practical concerns. – Most mathematical libraries rely on optimized numerical codes that exploit common hardware features for extreme efficiencies. Some of the hardware features are not mapped in virtual machines. Hardware cache is one of the examples. Consequently, HPC applications suffer more performance drawbacks in addition to the normal virtualization overhead. – Many HPC applications have high inter-processor communication demand. Current virtualized networks have difficulty meeting these high demands. – All existing HPC applications handle only two communication states: success and failure. While success is a reliable state, failure is not. Existing applications treat timeout identical to failure. Consequently, any transient component failure can halt the entire application. Using volatile spot instances for these applications is a serious challenge. Initially, the low end cloud services provide little guarantee on the deliverable performance for HPC applications. Recently, high end cloud resources have been developed specifically for HPC applications. These improvements have demonstrated hopeful features ([28], [25] and [17]). These improvements show the diminishing overhead of virtual machine monitors such as XEN [6]. Due to the severity of declining MTBFs, fault tolerance for MPI applications has also progressed. These developments inspired the design and development of SpotMPI. 2.3

Checkpoint-Restart (CPR) MPI Applications

Much research has been done to provide fault tolerance for MPI applications. FTMPI [12] uses interactive process fault tolerance. Starfish [3] supports a number

112


of CPR protocols and LA-MPI [13] provides message level failure tolerance. Egida [21] experimented with message logging grammar specification as means for fault tolerance. Cocheck [24] extends the Condor [18] scheduler to provide a set of fault tolerance methods that can be used by MPI applications. We choose the OpenMPI’s coordinated CPR because of its cluster wide checkpoint advantage since more fine grained strategies will not work in this highly volatile environment ( [16], [20] and [30]). The challenges for using the volatile spot instances are not much different than regular clusters with crash failures. In fact, for performance analysis, the spot instance out-of-bid failures can be modeled as random crash failures, since the Amazon cloud engine terminates out-bid instances without prior notice[2]. These out-of-bid failures force applications to adopt technologies to prevent excessive work loss due to frequent interruptions. Map-reduce applications ([11] can easily adopt single task CPR using spot instances [8]). A map-reduce application does not require inter-task communications. Parallel processing can be controlled external to the individual tasks. Therefore, spot instances can be used as “accelerators” via a simple job monitor that tracks and restarts dead jobs automatically [9]. Another noticeable effort using simulation to study the spot instances includes [5] and [26]. By simulating the behavior of a single instance under different bids, these work outlined the inherent tradeoff between completion time and budget. In [5] a decision model is proposed that describes a simulator that is able to determine under a set of conditions the total expected time of a single application. Yet another study [26] discussed a set of checkpoint strategies that can maximize the use of the spot instances while minimizing the costs. To the best of authors’ knowledge there have been no direct evaluation of practical MPI applications on spot instances. The volatile auction-based computing platform challenges the established HPC programming practices.

3

Evaluating MPI Applications Using Auction-Based Platform

For HPC applications using large number of processors, the CPR overhead is the biggest cost factor. Without CPR optimization, it is unlikely to gain practical acceptance for MPI application to run on the volatile auction-based platforms. We report a theoretical model based on application resource time complexities [22] and optimal CPR models ([10], [27]). In addition, we describe a toolkit named SpotMPI that can support autonomic MPI application using spot instance clusters. This toolkit can monitor spot instances and bidding prices, automate checkpointing at bidding price (and history) adjusted optimal intervals and automatically restart application after out-of-bid failures.

4

Theoretical Model

Auction prices vary dynamically depending on the supply and demand in the Amazon market place. There are no guidelines from Amazon as how the prices are set.


113

Table 1. Definition of Symbols and Variables for Modeling the Runtime Symbol t0 α K0 K1 K2 T E ET noobbidi tobserved P W u N ns Tpar

Description Interval of application-wide checkpoint. Expected rate of out-of-bid failures. Time needed to create and store a checkpoint. Time needed to read and recover a checkpoint. Average out-of-bid downtime. Estimated time needed to run the application with no checkpoints and no failures Expected running time between checkpoints Expected total running time Total number of out of bid failures corresponding to bidi over tobserved Total observed time Number of processing units Instrumented processor capacity in number of computational steps per second Instrumented network capacity in bytes per second Problem size Number of iterations Parallel processing time

Unlike other project [29] that use autoregressive models to maximize the profit for fictitious cloud provider, we focus on the intrinsic characteristics of users application and bidding history. We are interested in the inherent dependencies between these characteristics and their impact on the optimal CPR intervals – the largest cost factor for MPI applications to run on a volatile platform. 4.1

Bid-Aware Optimal CPR Interval

We assume that the time between consecutive out-of-bid failures is exponentially distributed with rate α. This allows the out-of-bid failures to be modeled the same way as component failures but at different rates. Thus we can extend the previous works on optimal CPR interval for distibuted memory applications. In this paper, we refer to the original CPR interval work by [27], which was extended by [10] and was later adapted to MPI by [19]. We start our discussion using the same symbols. Like in [19], we obtain the expected application running time with checkpoints and failures. Important assumptions are that out-of-bid failure occurs only once per checkpoint interval and all failures are independent: ET =

T t2 (K0 + t0 + α(t0 K1 + 0 )) t0 2

This leads to the optimal CPR interval by ([19], [27] and [10]): 2K0 t0 = (1) α A crucial difference between stable clusters and spot instances clusters is that an out-of-bid failure will force an application downtime that is absent for component

114


failures. This means that the restart time K1 cannot happen until the average downtime per out-of-bid failure K2 has elapsed. The expected running time using spot instances becomes: ET =

T t2 (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) t0 2

K2 can be obtained using the price history and the current bid. Given the pricing history and a current bid, we can also calculate the new density α of out-of-bid failures. Thus the optimal bid-aware CPR interval can be calculated as: 2K0 (2) t0 = noobbid i

tobserved

5

SpotMPI Toolkit

SpotMPI is a HPC toolkit constructed using OpenMPI coordinated CPR library [15], the Starcluster [1], and the BLCR project [14]. The OpemMPI and BLCR libraries facilitate the execution of automatic CRP at optimal intervals. Starcluster facilitates the creation and management of HPC clusters using Amazon EC2 resources. The latest Starcluster also supports spot instances and allows Python plugins during the launch of the cluster.As shown in Figure 2, the SpotMPI toolkit integrates multiple complementary tools for the auction-based HPC computing. 6WDUFOXVWHU 2UFKHVWUDWLRQ 6HUYLFH

+3& $SSOLFDWLRQV

6SRW03, )UDPHZRUN

$:6 $PD]RQ 6SRW ,QVWDQFHV

2SHQ03, &KHFNSRLQW 5HVWDUW 6HUYLFH

Fig. 2. SpotMPI environment

SpotMPI has four components: cluster monitor, CPR calculator, checkpoint executor, and restarter. These modules are initiated by the Starcluster script at spot instance cluster creation time. The cluster monitor pulls the status and


115

bidding prices of all instances continuously. The interactive bidding price and dynamic price history are used by the CPR calculator to generate the next optimal CPR interval. A composite timing model (next section) is responsible for estimating the total processing times. The checkpoint executor saves the state of the MPI application in the users EBS volume at dynamically adjusted intervals. Any out-bid failure will cause the application to halt. Upon a winning bid, the application will automatically restart from the last checkpoint using the Open MPI restart library. &KHFNSRLQW LQWHUYDO IRUFDVWHU

&KHFNSRLQW 5HVWDUW 6HUYLFH

$PD]RQ(&2Q 'HPDQGLQVWDQFHVHJ IRUPRQLWRULQJ

0RQLWRULQJ 6HUYLFH

&OXVWHU 2UFKHVWUDWLRQ HJ6WDUFOXVWHU

$PD]RQ(%6VWRUDJH

$PD]RQ(&6SRWLQVWDQFHVHJ &OXVWHULQVWDQFHVFF[ODUJH DQGFJ[ODUJH

8VHU(%69ROXPH HJ'DWDDQG03, ([HFXWDEOHV

Fig. 3. SpotMPI architecture design

6

Computational Results

Steady State Timing Model. To evaluate T , we need an estimate of the failure free processing time. We use the steady state timing model [22] to determine required running time based on major component usage complexities. Table 1 shows the symbols used in timing models. The general problem of assessing the processing time of a parallel application is difficult. There are too many hard to quantify factors. However, a steady state timing model can capture the intrinsic dependencies between major time consuming elements, such as computing, communication and input/output, by using instrumented capabilities like W and u. The idea is to eliminate the nonessential constant factors. Thus contrasting timing models can reveal non-trivial parallel processing insights [7]. In this paper, we chose to study two typical algorithm classes (Table 2) for the use of spot instance computing. Timing models in general can be applied to all deterministic algorithm classes [22]. Evaluation of Bid-aware Optimal CPR Interval. We validate the bidaware CPR interval using non-optimal intervals. In Figures 4 the behavior of the speed up is visualized under different CPR intervals.

116

M. Taifi, J.Y. Shi, and A. Khreishah Table 2. Algorithm Classes A1 and A2

Compute and Communication Timing model Sample Application Complexities 2 A1 : (O(n2 ), O(n)) Tpar = PNW + 16N Molecular force simulation u 3 2 A2 : (O(n3 ), O(n2 )) Tpar = PNW + 16N Linear solvers u

This figure shows the clear advantages of bid-aware optimal CPR intervals that have avoided longer completion times and higher total costs. We also notice that as the bid increases the advantage of optimal CPR interval decreases. This is because at higher bids, frequent checkpointing is not needed as much. Total Expected Speedup vs Bid prices vs Checkpoints intervals 60 0.01*optimal C/R interval 0.2*optimal C/R interval 1*optimal C/R interval 20*optimal C/R interval 50*optimal C/R interval

50

Speed up

40

30

20

10

0 0.52

0.53

0.54

0.55

0.56 Bid prices

0.57

0.58

0.59

0.6

Fig. 4. A1 Speedup Using 100 Spot Instances and Different CPR Intervals

Bidding Price and Application Processing Time. We are also interested in understanding, given the price history, how a new bid would affect the total processing time. Once we do this, we can then derive a number of other important metrics, such as speedup, efficiency, total cost, speedup per dollar, and efficiency per dollar deploying different numbers of processing units. In the following calculations, we assume: – The application uses the bid-aware optimal CPR intervals. – The HPC application is run under the optimal granularity (synchronization overhead is zero). – The Amazon resources deliver the advertised capabilities. We can then plug the steady state timing models in Table 2 directly into equation 2 4.1: ( PNW + 16N t2 u )ns (3) (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) ET (A1 ) = t0 2 3

ET (A2 ) =

2

( PNW + 16N t2 u )ns (K0 + t0 + α(t0 (K1 + K2 ) + 0 )) t0 2

(4)


117

Table 3. Critical Parameters Variable P W u N ns

Range 200 to 1, 000 instances 1.5×109 measured algorithmic step per second using cc1.4xlarge Network speed: 250MB per second measured Problem size: 104 to 105 Number of iterations: 103 to 106

Equations 3 and 4 capture the intrinsic dependencies between critical factors, such as bidding price, price history, the number of spot instances (P ) and overall processing time. To minimize errors, we conducted program instrumentations to get the ranges of W and u. Table 3 shows the value ranges in our calculations. We report the results in Figures 5 and 6. Total Expected Efficiency vs Bid prices vs Number of Instances

Total Expected Speedup vs Bid prices vs Number of Instances

400

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

0.8

200

0.6 0.4

0.54

0.56 Bid prices

0.58

0.2 0.52

0.6

2000

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

1000

0 0.52

0.54

0.56 Bid prices

0.58

0.6

0.54

0.56 Bid prices

0.58

0.6

30

0.8 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

0.6 0.4 0.2 0 0.52

0.54

0.56 Bid prices

0.58

0.6

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

20

10

0 0.52

Total Speed up per dollar vs Bid prices vs Number of Instances

Total Expected Maximum Cost vs Bid prices vs Number of Instances 3000

Total Speed up per dollar

Total Expected Maximum Cost

0 0.52

Total Efficiency per dollar

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

Total Expected time vs Bid prices vs Number of Instances Total Expected time in hours

1

Efficiency

Speed up

600

1.5

0.54

0.56 Bid prices

0.58

0.6

Total Efficiency per dollar −3 xvs10Bid prices vs Number of Instances #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

1

0.5

0 0.52

0.54

0.56 Bid prices

0.58

0.6

Fig. 5. A1 using 200 to 1,000 spot instances with n=100,000 to 100,0000 iterations

First, we observe that HPC applications can indeed gain practical feasibility using spot instances under optimized CPR intervals. As indicated by the Amdahl’s law [4], the diminish of return is also clearly visible when the number of spot instances increases for the same algorithms. For A1 (with linear communication complexity), speed up and efficiency drop significantly when spot instances are greater than 200. For A2 (with O(n2 ) communication complexity), speedup and efficiency drop much earlier. We also notice that for (A2 ), the bidding prices have much bigger impact on speedup than A1 . The added dimension of bidding price reveals cost effectiveness of different configurations. Although higher bids can deliver better performance, the cost effectiveness actually decreases (see Speedup Per Dollar charts). Therefore, the

118


users should use these figures to optimize budget, processing deadline or anything in between. The non-trivial insight is the high price sensitivity for algorithms with high communication complexities. The cost effectiveness is also difficult to visualize without the proposed tools. These results provide the ground for selecting the best number of processors (spot instances) and the most promissing bidding price for a given objective. Total Expected Speedup vs Bid prices vs Number of Instances

Total Expected Efficiency vs Bid prices vs Number of Instances

100

Efficiency

Speed up

60 40

0.2 0.1

20 0.52

0.54

0.56 Bid prices

0.58

0 0.52

0.6

Total Expected Maximum Cost vs Bid prices vs Number of Instances 1500 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

1000

500

0 0.52

0.54

0.56 Bid prices

0.58

0.6

0.54

0.56 Bid prices

0.58

0.6

7

0.8 #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

0.6 0.4 0.2 0 0.52

0.54

0.56 Bid prices

0.58

0.6

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

6 5 4 3 2 0.52

Total Speed up per dollar vs Bid prices vs Number of Instances Total Speed up per dollar

Total Expected Maximum Cost

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

0.3

Total Efficiency per dollar

#PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

80

Total Expected time vs Bid prices vs Number of Instances Total Expected time in hours

0.4

3

0.54

0.56 Bid prices

0.58

0.6

Total Efficiency per dollar −3 xvs10Bid prices vs Number of Instances #PU=200 #PU=400 #PU=600 #PU=800 #PU=1000

2

1

0 0.52

0.54

0.56 Bid prices

0.58

0.6

Fig. 6. A2 Using 200 to 1,000 Spot Instances For N=10,000 and 1,000 Iterations

7

Conclusion

Finding the optimal bidding strategy for any application is a difficult problem. For specific applications, our proposed approach gives reasonable predictions that can guide the choice of a promising bidding strategy based on the intrinsic dependencies of critical factors. The timing model along the bid-aware CPR model provide an effective tool to determine the optimal bid as well as the optimal number of processing units needed for completing a specific application. This research paves the ground for more specialized pricing models for cloud providers by giving more insights about the return of investment. For example, since the speedup gain slows down when the number of processors reaches a level, it makes sense to give lower prices as “volume discounts” that is sensitive to the communication complexities. The new pricing models may change users’ behavior that in tern would also affect the providers that eventually would reach an equilibrium. Meanwhile, the resource utilization is maximized. Other innovative ideas are also possible. For example, self-healing applications [23] could enjoy much better cost advantages by setting bidding ranges to organize defensive rings to protect the users core interests while maintaining the lowermost cost structures.


119

Spot instances give the provider much freedom in dispatching resources for meeting dynamic users needs. This ultimate freedom allows for the ultimate computational efficiency and fair revenue/cost generation. It also challenges the HPC community for highly efficient and more flexible programming means that can automatically exploit cheaper resources on the fly. Acknowledgment. The authors would like to thank Professor Slobodan Vucetic and his Ph.D. student Vladmir Coric for the initial discussions of Amazon bidding histories. This research is supported in part by the National Science Foundation grant CNS 0958854 and educational resource grants from Amazon.com.

References 1. Starcluster (2010), http://web.mit.edu/stardev/cluster/ 2. Amazon hpc cluster instances (2011), http://aws.amazon.com/ec2/hpc-applications/ 3. Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic mpi programs on clusters of workstations. In: Proceedings of the Eighth International Symposium on High Performance Distributed Computing, 1999, pp. 167–176 (1999) 4. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, April 18-20, pp. 483–485. ACM, New York (1967) 5. Andrzejak, A., Kondo, D., Yi, S.: Decision model for cloud computing under sla constraints. In: Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp., pp. 257–266 (2010) 6. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 164–177. ACM, New York (2003) 7. Blathras, K., Szyld, D.B., Shi, Y.: Timing models and local stopping criteria for asynchronous iterative algorithms. Journal of Parallel and Distributed Computing 58(3), 446–465 (1999) 8. Borthakur, D.: The hadoop distributed file system: Architecture and design (2007), http://developer.yahoo.com/hadoop/tutorial/ 9. Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for mapreduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 7. USENIX Association (2010) 10. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006) 11. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008) 12. Fagg, G., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000) 13. Graham, R.L., Choi, S.E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4), 285–303 (2003)

120


14. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. In: Journal of Physics: Conference Series, vol. 46, p. 494. IOP Publishing (2006) 15. Hursey, J.: Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems. PhD thesis, Indiana University, Bloomington, IN, USA (July 2010) 16. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open mpi. In: Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pp. 1–8 (2007) 17. Iosup, A., Ostermann, S., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Transactions on Parallel and Distributed Systems 22(6), 931–945 (2011) 18. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical report, Technical Report (1997) 19. Lusk, E.: Fault tolerance in mpi programs. Special issue of the Journal High Performance Computing Applications, IJHPCA (2002) 20. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proc. Int. High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pp. 1–11 (2010) 21. Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead faulttolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 48–55. IEEE, Los Alamitos (1999) 22. Shi, J.Y.: Program scalability analysis. In: International Conference on Distributed and Parallel Processing. Geogetown University, Washington D.C (1997) 23. Shi, J.Y., Taifi, M., Khreishah, A., Wu, J.: Sustainable gpu computing at scale. In: 14th IEEE International Conference in Computational Science and Engneering 2011 (2011) 24. Stellner, G.: Cocheck: Checkpointing and process migration for mpi. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531. IEEE Computer Society, Washington, DC, USA (1996) 25. Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: Proc. 10th Int. Pervasive Systems, Algorithms, and Networks (ISPAN) Symp., pp. 4–16 (2009) 26. Yi, S., Kondo, D., Andrzejak, A.: Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 2010 IEEE 3rd International Conference on Cloud Computing, pp. 236–243. IEEE, Los Alamitos (2010) 27. Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974) 28. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the performance impact of xen on mpi and process execution for hpc systems. In: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed computing, p. 1. IEEE Computer Society, Los Alamitos (2006) 29. Zhang, Q., Grses, E., Boutaba, R., Xiao, J.: Dynamic resource allocation for spot markets in clouds. In: Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2011) 30. Zheng, G., Shi, L., Kalé, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, pp. 93–103. IEEE, Los Alamitos (2004)

Investigating the Scalability of OpenFOAM for the Solution of Transport Equations and Large Eddy Simulations Orlando Rivera1 , Karl Fürlinger2, and Dieter Kranzlmüller1,2 1

Leibniz Supercomputing Centre (LRZ), Munich, Germany {Orlando.Rivera,Dieter.Kranzlmueller}@lrz.de 2 MNM-Team Ludwig-Maximilians-Universität (LMU), Munich, Germany [email protected]

Abstract. OpenFOAM is a mainstream open-source framework for flexible simulation in several areas of CFD and engineering whose syntax is a high level representation of the mathematical notation of physical models. We use the backward-facing step geometry with Large Eddy Simulations (LES) and semiimplicit methods to investigate the scalability and important MPI characteristics of OpenFOAM. We find that the master-slave strategy introduces an unexpected bottleneck in the communication of scalar values when more than a hundred MPI tasks are employed. An extensive analysis reveals that this anomaly is present only in a few MPI tasks but results in a severe overall performance reduction. The analysis work in this paper is performed with the tool IPM, a portable profiling and workload characterization tool for MPI programs.

1 Introduction OpenFOAM (Open Field Operation and Manipulation) is an extensive framework for the solution of Partial Differential Equation (PDE) using the Finite Volume Method (FVM). It is one of the most popular open source tools used in continuum mechanics and Computational Fluid Dynamics (CFD). Written in C++, it makes use of advanced features in OOP (Object Oriented Programming) and modern programming techniques to mimic the mathematical notation of tensor algebra and PDE solutions [8]. OpenFOAM is not a monolithic program. Instead, it consists of many libraries grouped by functionality, on top of which solvers are built. A solver is a problemspecific glue-like program, which is linked with appropriate libraries for a specific problem. Some libraries are common to all solvers: basic mesh manipulation, parallelization, finite volume method, etc. Others libraries are used only if needed, e.g., for turbulence models, for compressible or incompressible flows, for dynamic mesh handling, and so on. The parallelization of OpenFOAM is performed using MPI (Message Passing Interface). Applications and solvers in OpenFOAM are the same for serial or parallel execution; a master-slave model is used for parallel runs. Non-blocking and blocking send/receive functions and reduction operations are at the core of each solver. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 121–130, 2011. c Springer-Verlag Berlin Heidelberg 2011

122

O. Rivera, K. Fürlinger, and D. Kranzlmüller

In this paper we analyze the scalability and performance characteristics of OpenFOAM for a Large Eddy Simulation test case. We perform a strong scaling study up to 256 MPI tasks and a weak scaling study up to 512 MPI tasks. We use performance tools, in particular the Integrated Performance Monitor (IPM), in order to gain an understanding of the main performance influencing factors and we identify an imbalanced time in point to point operations as a cause for limited scalability. The rest of this paper is organized as follows: In Sect. 2 we give an overview of related performance studies that have been conducted using OpenFOAM. In Sect. 3 we describe our experimental setup and in Sect. 4 we describe the results of our performance study. We conclude and discuss directions for future work in Sect. 5.

2 Related Work The parallel behavior of OpenFOAM is not very well understood when executed on massively parallel systems. Scalability and efficiency of OpenFOAM is an area of open debate. It has been reported that OpenFOAM scales well for small core counts, up to 64 MPI tasks using the lid-driven cavity test by Calegari et al. [12] and in a study for the HPC advisory council [9]. Pringle reported his efforts to port and test OpenFOAM on a Cray XT5 system (the HeCTOR system at EPCC) with the laminar lid-driven cavity test as well [4]. He was successful to run up to 2048 MPI tasks. Results on scalability were measured using meshes with 1003 and 2003 cells (one and eight million, respectively). In this setup OpenFOAM experiences a scalability limit at around 512 MPI tasks. The CSC - IT Center for Science in Finland has also reported benchmarking results running the lid-driven cavity test with up to 22 million cells in two systems, in one of then reaching super linear scalability with 1024 cores [1].

3 Test Cases and Experimental Setup An important type of simulation in turbulent regimes is the Large Eddy Simulation (LES), which is becoming very popular because it resolves the most relevant turbulent features within the fluid with high degree of accuracy. LES is based on the concept that smaller scale turbulence is isotropic, while larger turbulent and energetic eddies are the result of the current geometrical configuration. Larger scales are simulated, and small scales are filtered out and the effects are modeled; this process is known as the closure problem [10]. The building blocks of many LES solvers are the transport equations which are the main mechanism to solve many problems in Fluid Dynamics. For a first set of experiments we solved a simplified form of the transport equation for a scalar quantity: dθ + a∇θ = 0. dt

(1)

For a second set we performed a full large eddy simulation (LES). The chosen geometry for our tests was the backward-facing step. This is a sufficiently simple test for the numerical and physical properties and there are enough experimental and numerical results for proper validation [7]. Most importantly, it allows us to discover some features and characteristics of the underlying methodology such as

Investigating the Scalability of OpenFOAM

123

those regarding the numerical stability, domain decomposition strategies, MPI execution characteristics, and the performance of iterative solvers and their scalability. The test case is described in detail in [6] and depicted in Fig. 1. The backwardfacing step has a Reynolds number Re of 4800 with respect to the step height. The dimensionless inlet bulk velocity is 4.8, normed with respect to the geometrical units. The domain is 25 times larger than the step height in the x direction and three times the step height in the z direction.

Fig. 1. The backward facing step geometry. The step size is H, the total length of the domain is 25H, height and depth are 3H. Periodic boundary conditions on the near and far walls are used.

The solver used in our LES experiments is called pisoFoam for LES, OpenFOAM version 1.7.x, and it uses the Pressure Implicit solution by Split Operator method (PISO) [5]. The turbulence modeling is the k-equation eddy-viscosity model with the cube root of cell volume as LES delta. At the sides of the domain periodic boundary conditions are used, while top and bottom were set as non-slip walls. Finally, inlet boundary conditions with an artificial noise of 2% of the velocity value and the outlet with pressure-driven type were specified. Pressure equations were solved using the geometric algebraic multigrid (GAMG) and the Bi-Conjugate Gradient (BiCG) iterative solvers with 3 pressure corrections. Other fields were solved with the BiCG solver. The tests were run over 100 time steps with output at every 50 time steps. In all cases a CFL condition number less than 1.0 was specified. Three meshes with different resolutions were used. The first mesh has 250, 96, and 64 cells in the x, y and z direction, respectively. It contains 2.15 million hexahedrical cells with an expansion ration of 2 in the perpendicular direction resulting in better resolution near the top and bottom walls; cells close to the wall are half the size of cells along the center line. Also, for capturing better flux features, cells were stretched in the x direction close to the step by 50%. The maximal aspect ratio is 5.62 resulting in a mesh of good quality. Runs on 16, 32, 64, 128, and 256 cores were performed on this mesh for a strong scaling study. The second and third meshes contains 4 and 8 times more cells than the first mesh, i.e., 8.62 and 17.20 million cells, respectively. 256 MPI tasks were used on the 8.62 million cell mesh, while 512 MPI tasks on the finer mesh. The number of cells in x, y, and z direction, as well as the expansion ratios, were adjusted to conform to meshes of equivalent characteristics of the first mesh resulting in maximum aspect ratios of 3.90 for the 8.62 million cell mesh, and 4.01 for the 17.20 million cell mesh. The second and third mesh was used in weak scaling studies.

124


To understand the scalability characteristics of OpenFOAM, we employed IPM (Integrated Performance Monitor)1, which is a portable profiling and workload characterization tool for MPI applications. IPM drastically reduces the overhead caused by application instrumentation and displays very detailed information at the same time [2]. Studying an application with IPM reveals important parameters beyond mere studies of total wallclock scalability. It identifies bottlenecks, detects hot spots and collects statistics that help to optimize representative sections of the code. Despite some challenges posed by the advanced software engineering techniques used by OpenFOAM, IPM was able to instrument OpenFOAM and derive useful data. All our experiments were conducted using a massively parallel general-purpose computer, an SGI Altix 4700, with 9728 Intel Itanium 2 cores, a peak performance of 62.3 TFlops and 19 partitions connected by a high performance NUMA link interconnect [11] installed and operated by the Leibniz Supercomputing Centre (LRZ). All runs, when possible, were restrained to a single shared-memory partition. In case a partition was not large enough, two partitions with the same number of cores were specified. The system is operated by the Leibniz Supercomputing Centre (LRZ).

4 Experimental Results In this section we describe the results of our experiments. We start with the strong scaling study from 16 to 256 tasks (512 tasks for the transport equation) followed by the weak scaling study up to 512 tasks. 4.1 Strong Scaling Study The solution of the simplified transport equation is done by means of the preconditioned conjugate gradient with a diagonal incomplete Cholesky decomposition. We used 16, 32, 64, 128, 256, and 512 MPI tasks for the strong scaling test on the 2.15 million cell mesh. Fig. 2 (left) shows the scalability results of this experiment. Evidently, this setup scales up to 128 tasks, after which a slowdown of the execution occurs. Looking at the contribution from various MPI calls, a large contribution of the MPI Recv routine becomes evident. Fig. 2 (right) shows the time spent in various MPI routines for each rank for the 128 task run. Evidently MPI Recv is the most dominant contributor and the time in MPI Recv is highly variable across ranks. MPI Recv is used by pseudo reduction operations in OpenFOAM and our results suggest that paying more attention to how synchronization values are distributed by OpenFOAM’s internal solvers is warranted. In the rest of this section we focus on the LES test case. For our strong scaling LES study we used 16, 32, 64, 128, and 256 MPI tasks on the 2.15 million cell mesh. All runs had the same setup and the domain decomposition was done with the Metis partitioner in order to have approximately the same number of cells per sub-domain and to minimize the maximum connectivity of the sub-domains [3]. The resulting subdomains are well-balanced and vary at most 3% in their number of cells. Table 1 shows the average number of cells per task for each run. 1

http://www.ipm2.org


MPI_Recv

200

Wall time (sec)

MPI_Waitall

MPI_Allreduce

125

Total MPI

3.5

180 3.0

160 MPI time (sec.)

140 120 100 80 60

2.5 2.0 1.5 1.0

40

0.5

20 0

0.0

16

32

64

MPI Task

128

256

512

0

127 MPI ranks

Fig. 2. Scaling study (left), and distribution of time in various MPI routines (right) for the transport equations test case

In Table 1 we also summarize the wallclock execution time from 16 to 256 MPI tasks as measured by IPM. This table also shows the percentage of execution time spent in MPI and I/O routines. For both MPI and I/O the minimum, maximum, and average percentages over all MPI tasks are listed. Note that minimum and maximum values for MPI and I/O routines do not generally refer to the same MPI rank. Table 1. Wallclock execution time, minimum, maximum, and average fraction of time spent in MPI and I/O routines, respectively, for 16, 32, 64, 128, and 256 MPI tasks MPI Tasks 16 32 64 128 256

#cells Wall Time MPI I/O (avg) (sec) (% min) (% max) (% avg) (% min) (% max) (% avg) 134 400 574.23 6.40 14.43 9.96 0.08 0.51 0.13 67 200 303.82 14.24 26.02 14.24 0.11 0.74 0.14 33 600 197.47 25.98 40.71 31.91 0.23 0.31 0.71 16 800 213.07 27.81 65.07 33.42 0.37 0.91 0.55 8 400 270.00 25.65 69.57 31.70 0.63 1.20 0.90

From Table 1 we see that the pisoFoam solver has an acceptable scalability up to 64 MPI tasks. With 128 and 256 MPI tasks the execution actually slows down. This could be due to several factors, such as a data set that is too small, a suboptimal use of MPI communication facilities, or a slow interconnection network. If we plot the minimum, the maximum and the average percentages of the MPI wall time, we see that for 16, 32, and 64 MPI tasks the average is in the middle of the maximum and minimum. However, for 128 and 256 the average value follows the minimum, while the maximum is much larger (Fig. 3, left). This is a clear indication that one or a few tasks experience significant overheads. IPM does not only measure the overall time spent in MPI, it also gives statistics for individual MPI function calls and data transfer sizes. A comparison of the number of calls (Fig. 3, right) with the aggregate time spent in each function (not shown in Fig. 3) reveals that there is no correlation between these numbers. OpenFOAM’s main communication mechanism used to interchange field values among sub domains is by means of MPI Isend/MPI Irecv/MPI Waitall. As shown in

126


Fig. 3. Time in MPI as a fraction of wallclock time, plotting minimum, maximum, and average among ranks (left). Number of calls by MPI routine (log scale, right).

Fig. 3 (right), the number of calls of MPI Irecv/MPI Isend is one or two orders of magnitude larger compared to any other MPI function. The number of MPI Waitall calls remains almost constant, or even decreases slightly with the number of tasks. Calls to MPI Allreduce increase but not at the same rate as sends and receives. From Fig. 4 (left) we see a reduction in the relative time used by MPI Isend/MPI Irecv and MPI Allwait when 256 MPI tasks are used, compared to the 64 MPI task run, but the number of calls increases (Fig. 3, right). Smaller communication areas, locality communication patterns and the low latency interconnect explains this behavior. In Fig. 4 (right) we can see that Metis has decomposed the domain with some sub domains divided in more than two parts. Hence, increasing the total number of neighboring sub domains, and, subsequently, the calls of MPI Isend/MPI Irecv can result in load imbalance. As shown in Fig. 3 (right) and Fig. 4 (left), the time in MPI Allreduce is comparatively high considering that only a single scalar value is aggregated. How many times this function is called depends on the type of iterative solver and the number of

Fig. 4. Time spent in various MPI routines as the core count increases (left). Example for the graph-based Metis domain decomposition (right). Rank 0 is shown in dark grey, rank 5 in light gray.


127

iterations this solver requires to converge. For example, the 16 MPI task run, and using the BiCG solver for the pressure, results in 5179 calls for the first time step. By specifying the GAMG solver, the number of calls is 7839. There is, however, no advantage in using the BiCG over the GAMG since the number of iterations needed by BiCG is several hundreds (500–700) compared to the 20–40 used by GAMG solver, and the speed up is three to ten times faster by using the GAMG solver. Hence, trying to reduce the number of MPI Allreduce calls by changing the pressure-correction solver would be a mistake. The same can also be said for MPI Recv, which is not used to interchange values at the inter domain boundary, but to transport single scalar values. Both MPI Allreduce and MPI Recv are being called very infrequently but constitute an important fraction of the MPI time, for 256 MPI tasks more than 9% of time is spent in these routines. MPI Recv is a limiting factor not only because its contribution grows with increasing number of tasks, but also because there is a clustering of a few MPI tasks responsible for this increment. If we recall Fig. 3 (left) for 128 and 256 MPI tasks, the maximum value of the MPI time (as percentage of the wall time) was far away from the average. This is because few ranks were slowing down the whole execution time. Since there is a synchronization at the end of each iteration, equation solution and time step, it is enough that one MPI rank, for whatever reason, slows down to have a very negative impact in the whole application. Fig. 5 can help to understand this behavior. Here the absolute times for MPI Recv, MPI Allreduce and the sum of all MPI routines (ordered by rank for 128 and 256 MPI tasks) are plotted. MPI Allreduce uniformly constitutes about 25% and 32% of the MPI time for 128 and 256 MPI tasks.

Fig. 5. Total MPI time, time in MPI Recv and MPI Allreduce per rank. 128 MPI tasks (top) and 256 MPI tasks (bottom).

128


MPI Recv behaves differently. In the the 256 task case for example, MPI ranks 0 to 7 have MPI overheads that double the average, primarily caused by the peak in MPI Recv time. For these ranks (0-7) between 61% to 69.57% of the wall time (270 sec.) are spent solely in MPI routines. For 128 tasks the situation is similar, ranks 12 to 15 present the MPI Recv peak (4 ranks), while the rest have a smother distribution. As mentioned before, MPI Recv/MPI Send pairs are mainly used to communicate scalar values, and as such interconnect bandwidth is not an issue here. MPI Recv is also not called more frequently than other functions, in fact compared to MPI Isend/MPI Irecv they are called too few times to argue that latency is a main factor. Whether 128 MPI tasks is an important limit and/or under which conditions is a question that cannot be answered at this point without more sophisticated tracing tools and a deeper understanding of how and when MPI Recv is used by OpenFOAM. 4.2 Weak Scaling Several questions remained unanswered in the previous section, mainly because the mesh with 2 millions of cells was not large enough for runs with more than 256 MPI tasks. Two meshes were prepared, one of 8.62 millions of cells to be run with 256 cores and the second of 17.20 millions of cells to be used with 512 cores. Table 2. Wall time, ranks with the minimum and maximum MPI and I/O time and average (percentage) for 256 and 512 tasks. Both the GAMG and BiCG solver were used for comparison for the 512 runs. MPI Tasks 256 512 512

#cells Solver Wall Time MPI I/O (avg) (sec) (% Min) (% Max) (% Avg) (% Min) (% Max) (% Avg) 33 688 GAMG 486.64 31.2 54.39 36.77 0.39 1.29 0.74 33 600 GAMG 600.95 41.56 68.77 47.29 0.32 0.56 0.88 33 600 BiCG 1893.24 20.01 27.16 23.04 0.11 0.18 0.28

The domain decomposition was done with Metis using the same parameters, as described in the strong scaling section, in order to have approximately 33 000 cells per sub domain. This number proved to be sufficiently large for a reasonable speedup for 64 MPI tasks with the smaller mesh. In Table 2 we list the percentage of time in MPI and I/O for the strong scaling study. Both the GAMG and BiCG solver were used for the 512 runs for comparison. For increased scalability one can specify the BiCG instead of the GAMG, and thus reducing the MPI overhead. However, the GAMG solver, with all its MPI issues, is 3 times faster than the BiCG solver (performance is actually the proper value to be measured and not scalability). This analysis gave us an indication that one should isolate and pay more attention to the GAMG and its specific MPI functionality. MPI Recv and MPI Allreduce are important overheads for GAMG (with 256 and 512 MPI tasks). For the BiCG solver, MPI Allreduce represents an important overhead, 11% which is half of the MPI time. This result agrees with the ones reported in [9], [12], and [4].


129

In previous work [13] we have shown that MPI Recv time is concentrated in a few ranks which slow down the simulation. However, in order to generalize these findings it will be necessary to run tests with more complicated geometries and domains, higher counts of cells per rank with larger domains, and new strategies for domain decomposition.

5 Summary and Outlook Understanding the parallel behavior of OpenFOAM is a complex task due to the internal structure and flexibility that OpenFOAM offers. However, some patterns and specific characteristics have been discovered. We have seen from the results presented that some configurations are more suitable for scaling and under which circumstances bottlenecks may occur. One of the choices for this problem and type of mesh, is to use the GAMG solver. This solver converges rapidly to a solution, faster than BiCG, but its MPI footprint increases and becomes more problematic with a larger number of MPI tasks. We have learned that the scalability of the GAMG solver is limited up to a certain point (for our study 64 MPI tasks). An inflexion point can be seen, however, where the good scalability properties of the BiCG could be used on massively parallel systems. Moreover, scalability alone is not a relevant parameter but overall performance. At this point is difficult to generalize, much more data and test cases with different solvers need to be investigated. Compressible, density based or large-mach-number solvers might have a different MPI signature. In our experiments the major performance issues did not occur for MPI routines with the largest visit counts or data transfers. Instead MPI routines that appeared harmless at first, such as MPI Recv and MPI Allreduce, were identified as problematic. We have also recognized that these problems were concentrated in only a few MPI tasks and that further research is needed to consolidate these observations. In terms of performance, I/O is not an important issue in our experiments. That does not mean there is no opportunity to improve it. In OpenFOAM each MPI instance reads its input and writes its output. If OpenFOAM breaks the barrier of thousands of cores, we will have consequently thousand of files, which is not suitable for any available file system. Lastly, we have to emphasize that all these findings would not have been possible at all, if we did not have profiling and tracing tools. Without these tools we could be optimizing or paying attention to wrong places. IPM was flexible enough to produce the required information (were other tools have failed) at the same time it was concise and informative (where some other tools could have produced huge amounts of redundant information). We hope these tools keep pace with the development in software and hardware, and specifically with the expected scale of next-generation machines. Acknowledgements. This project was in part financially supported the EU reintegration grant MADAME under agreement no. PIRG07-GA2010-268351.

130


References 1. CSC - IT Center for Science. OpenFOAM - CSC, http://www.csc.fi/english/research/sciences/CFD/CFDsoftware/ openfoam/ofpage 2. Fürlinger, K., Wright, N.J., Skinner, D.: Effective performance measurement at petascale using IPM. In: Proceedings of The Sixteenth IEEE International Conference on Parallel and Distributed Systems (ICPADS 2010), Shanghai, China (December 2010) 3. Karypis, G., Kumar, V.: MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0 (2009), http://www.cs.umn.edu/˜metis 4. Pringle, G.: Porting OpenFOAM to HECToR A dCSE Project (2010), http://www.hector.ac.uk/cse/distributedcse/reports/openfoam/ openfoam/index.html 5. Jasak, H.: Error Analysis and Estimation for the Finite Volume Method with Applications to Fluid Flow. PhD thesis, Department of Mechanical Enginering, Imperial College of Science, Technology and Medicine (1996) 6. Kobayashi, H., Wu, X.: Application of a local subgrid model based on coherent structures to complex geometries. Center for turbulent research Stanford University and NASA. Annual research brief, pp. 69–77 (2006) 7. Le, H., Moin, P., Kim, J.: Direct numerical simulation of turbulent flow over a backwardfacing step. Journal of Fluid Mechanics 330(1), 349–374 (1997) 8. Weller, H.G., Tabor, G., Jasak, H., Fureby, C.: A tensorial approach to computational continuum mechanics using object orientated techniques. Computers in Physics 12(6), 620–631 (1998) 9. HPC Advisory Council. OpenFOAM Performance Benchmark and Profiling (2010), http://www.hpcadvisorycouncil.com/pdf/OpenFOAM_Analysis_and _Profiling_Intel.pdf 10. Berselli, L.C., Iliescu, T., Layton, W.J.: Mathematics of Large Eddy Simulations of Turbulent Flows, 1st edn., pp. 18–25. Springer, Heidelberg (2005) 11. Leibniz-Rechenzentrum (LRZ). Hardware Description of HLRB II (2009), http://www.lrz.de/services/compute/hlrb/hardware/ 12. Calegari, P., Gardner, K., Loewe, B.: Performance Study of OpenFOAM v1.6 on a Bullx HPC Cluster with a Panasas Parallel File System. In: Open Source CFD Conference, Barcelona, Spain (November 2009) 13. Rivera, O., Fürlinger, K.: Parallel aspects of openfoam with large eddy simulations. In: Proceedings of the 2011 International Conference on High Performance Computing and Communications (HPCC 2011), Banff, Canada (September 2011)

Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access Fan Gao and Jefferson Tan Faculty of Information Technology, Monash University, VIC 3800 Australia [email protected], [email protected]

Abstract. Classical authentication and authorization in grid environments can become a user management issue due to the flat nature of credentials based on X.509 certificates. While such credentials are able to identify user affiliations, such systems typically leave out a crucial aspect in user management and resource allocation: privilege levels. Shibboleth-based authentication mechanisms facilitate the secure communication of such user attributes within a trust federation. This paper describes a role-based access control framework that exploits Shibboleth attribute handling and CAS (Community Authorization Services) within a Grid environment. Users are able obtain appropriate access levels to resources outside of their domain on the basis of their native privileges and resource policies. This paper describes our framework and discusses issues of security and manageability. Keywords: grids, resource allocation, user management, single sign-on.

1 Introduction Grids are used to “solve unique research problems and to collaborate between different researchers across the globe” [1]. They provide substantial support for research in a variety of applications, but a number of challenges remain, including security. We focus on two types of security services within grids: access control and communication security. Currently, Public Key Infrastructure (PKI) [2] provides the primary authentication mechanism, as in the case of the Grid Security Infrastructure (GSI) [3], where PKI-based X.509 identity certificates are used for authentication. Apart from the ability to employ asymmetric encryption, there is the advantage of using proxy certificates to support Single Sign-On (SSO) [4], allowing for security delegation via the more limited proxy certificates. The notion of SSO is not unique to X.509 certificates, as it is also used in Shibboleth SSO [5], which works within the context of a trust federation. A participating entity entrusts user authentication to another participating entity from which a service is requested. Due to the design of secure transactions to facilitate release of user data to support such trust relationships, we have found an apt application for Shibboleth SSO: role-based access control, translating to role-based authorization. This paper focuses on authentication, authorization and resource access control with Shibboleth and Community Authorization Service (CAS). The goal is to provide access control in large-scale grids where service providers need not rely solely Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 131–140, 2011. © Springer-Verlag Berlin Heidelberg 2011

132

F. Gao and J. Tan

on their own basis for the privileges that they grant to remote users. Our scheme supports the determination and application of a remote user’s appropriate authorization level based on local policies. In Section 2, related works are reviewed in light of our work. Section 3 illustrates the overview of the proposed framework’s design. Section 4 discusses the details of this framework, and we evaluate the framework design in Section 5. Finally, the conclusions are presented in Section 6.

2 Background and Related Works The Grid Security Infrastructure (GSI) [3] relies on X.509 and SSL/TLS mechanisms through which a trusted third party can sign, and thus vouch for, either user or service credentials consisting of a public and private key pair and a certificate. This trusted third party is the Certificate Authority (CA), which bears the mutual trust of all participating entities. The four steps in GSI authentication to access the Grid [1] are: 1. 2.

3.

4.

Obtain the CA certificate that contains its public key. The secret key pair is generated at the local host with a certificate request. The latter is signed using the user’s private key, which is used by the CA to verify the certificate’s authenticity. The CA verifies the user‘s information, typically with the assistance of an authorized registration authority operator (RAO). The CA then signs the request using its private key, and releases the signed certificate to the user. Store the CA’s certificate and the signed certificate, which are all used for subsequent authentication transactions.

These certificates (and associated keys) can form the basis access control. From among the three access control models, Mandatory Access Control (MAC), Discretionary Access Control (DAC), and Role-Based Access Control (RBAC), the MAC model is often used in secure grids [6]. Security labels describe some aspect of an entity that can be used to describe access conditions with another entity. Two types of security labels are security classification on an object and security clearance on a user. In the DAC model, users control access to files and other resources that they create, use, or own, and can pass access rights on as preferred. However, the RBAC model seems more appropriate in incorporating the user’s role in deciding to allow access to resources, and roles reflect a user’s privileges. RBAC is considered to be scalable and flexible ([1][7]). It also provides a mechanism to articulate policies instead of being locked into a particular security policy. Shibboleth-based grid resource access control [5] integrates Shibboleth and authorization based on the X.812 Access Control Framework standard [8]. Shibboleth pushes trust out towards local security infrastructures where users are accountable. Each participating organization relies on its own Identity Provider (IdP) to vouch for its users when accessing remote services. An optional Where Are You From (WAYF) service allows users to identify their home organizations prior to being redirected to their own IdP. A Service Provider (SP) integrates pre-existing or new web services, enforcing IdP-based authentication before making access control decisions.

Shibboleth and Community Authorization Services: Enabling Role-Based Grid Access

133

Consequently, access to resources can be decided on the basis of local security policies of the resource provider. Cacheable Decentralized Groups Access Control [9], based on decentralized group membership, relies on a global namespace, and end users could create and manage the decentralized groups by access control lists (ACLs) on shared resources. In Trust Degree Based Access Control, [9] a centralized security server is used to manage and compute a trust value for every access request. The authorization model, based on the eXtensible Access Control Markup Language (XACML) resulted in the Community Authorization Service (CAS) [10]. It promises to improve access control in terms of performance and ease of maintenance [9]. Some issues are crucial in the context of this study. Users are typically not experts in grid computing, and there should be a way to simplify authentication and authorization procedures. Moreover, as grids become increasingly complex, their management would dramatically increase in difficulty as well. Confidentiality is the centre of attention for Resource Providers (RPs), with data resources being commonly confidential in different communities and domains. Scalability affects the performance of GSI. Normally, additional mechanisms increase overheads, as would the storage requirement for user information such as attributes, keys, roles, etc. There are human overheads as well, such as in obtaining a signed certificate from the CA via a registration authority operator. Authentication management such as key production, cancellation, and certificate storage is also becoming a challenge in large-scale environments. There is also the problem of trust. If trust relationships are broken, the whole system could be affected. There are also interoperability issues among several security mechanisms that are individually effective but may not have been designed to fit together. These mechanisms only focus on particular issues or provide only part of the solution. There is no adherence to one complete, end-to-end standard among them.

3 RBAC with Shibboleth and CAS Our work emphasizes that utilizing a user’s role attribute, which Shibboleth-style authentication can supply at sign-on, enhances access control. As per Shibboleth SSO, users authenticate themselves through the IdP of their home organization, which releases one or more role attributes. The same sign-on allows them to obtain credentials from an on-demand CA, which can incorporate the role data into a newly signed certificate. This is used to obtain role-based capabilities from the CAS, which gives the users access to authorized resources at the appropriate level of access. As Figure 1 shows, the proposed RBAC framework includes six major components. The Authentication System is based on Shibboleth. The Grid Service Web Portal is the gateway to the target Grid resource. The Credential Proxy Server generates proxy certificates as needed by an authenticated user. The Certificate Pool System is a novel method of restricting access by using a limited pool of certificates that can be active at any given time. The Authorization Server enforces policies that take user credentials, including role attributes, to make authorization decisions. Grid Resources include the Globus gatekeeper service for launching jobs or the GridFTP service for file transfers. The simple workflow of our framework is as follows:

134

F. Gao and J. Tan

Fig. 1. The proposed RBAC framework with Shibboleth and CAS

1. 2. 3. 4. 5. 6.

7. 8. 9.

The user sends an access request to the Grid Service Web Portal via browser. The portal sends an authentication and attribute request to the user. The Authentication System authenticates the user and releases his role attribute. The user forwards authentication data and the role attribute to the Grid Service Web Portal, which forwards attributes to the Credential Proxy Server. The authenticated user sends an access request to the Credential Proxy Server. The Credential Proxy Server obtains a local certificate for the user from the Certificate Pool System, and generates a proxy credential based on the user’s role attribute, reflecting the appropriate privilege level for the user. The Proxy Server sends the proxy credential to the Authorization Server. The Authorization Server provides/activates the appropriate access capabilities for the user based on the role-based proxy credential. The user may now access Grid resources at the appropriate level of access.

The Community Authorization Service. (CAS), once a component of the Globus Toolkit (GT) for fine-grained access control, is part of our proposed framework. The CAS Server verifies the user’s credentials and generates proxy credentials for the user with specific capabilities based on the user’s role, as well as the security policies stored in the CAS Database. The latter also stores information about users, groups, actions, resources and policies. CAS acts as a broker between user and resource provider. The resource provider also makes final decisions on user access to grid resource, i.e., the resource provider could temporarily reject the user even if the latter has certain authorization to the resources.


135

Fig. 2. Overview of CAS

Shibboleth. (Figure 3) is open source software developed by the Middleware Architecture Committee for Education - Internet2 (MACE-Internet2). For authentication purposes, Shibboleth is considered to be an extension of the typical local authentication and authorization approach where the user could access multiple resources of any participant of a federation by their local authentication information.

Fig. 3. Shibboleth architecture and workflow, redrawn from Jie et al [4]

The Certificate Pool System (CPS) is a novel system to manage certificates in a local system for remote users. Figure 4 illustrates the architecture of CPS, which has two major components: the Certificate Pool Manager (CPM) and Certificate Pool Database (CPD). CPS stores a certain number of certificates in the CPD, based on the policies set by the owning organization. The certificates are generated and signed by the CA for remote users who want to use local Grid resources. These certificates have shorter lifespans than certificates for local users. The main functions of the CPM are:

136

F. Gao and J. Tan

Fig. 4. Architecture of the Certificate Pool Manager (CPM)

• • •

• • • •

To check the validity of certificates stored in the CPD. To generate certificate status reports for users or administrators. To allocate a certain number of certificates to particular roles. For instance, assume 100 certificates stored in CPD. The CPM can allocate 10 for administrators, 70 for staff and 20 for students. The ratio of certificates for different roles is variable. To control the maximum number of concurrent remote user access to local resources. To respond to certificate requests from legitimate users or entities. To maintain certificates in the CPD, i.e., to destroy a certificate if revoked or if it expires, and to obtain or renew them via the CA when required. When the CPM sends a request to the local CA to sign the certificate, the CPM controls the maximum lifetime of certificates based on some policy.

The CPS can improve authentication and authorization since users do not need personal certificates from the local CA. It increases security by centralized certificate control over concurrent remote users and certificate life spans. It also simplifies the management of certificates for remote users, workload for the local system administrator decreases.

4 Framework Design The framework consists of one part for remote users and one for local users. The following are assumed: • • •

In our prototype, three roles are defined: administrator, staff and student. More sophisticated classifications may well suit other situations. The framework design does not reflect the influence of a firewall. HTTPS is used in all interactions in this framework, to provide SSL/TLS security.


137

4.1 Framework Design for Remote Users The framework based on the typical Single Sign-On model [12] is redesigned for our own purposes. Figure 5 illustrates seven major components, which we describe below.

Fig. 5. System architecture and workflow for a remote user

The Identity Provider (IdP) provides authentication and authorization services via the user’s home institution. Note that it relies on some underlying security infrastructure, typically an authentication, authorization and accounting (AAA) system for local user accounts and accounting data. The Where Are You From (WAYF) Service is used to redirect a remote user to their home institution’s IdP to begin the authentication process. A Portal is considered a gatekeeper to local system services. It must verify the legitimacy of a remote user and obtain the user’s attributes. The portal’s Assertion Requester generates and presents an authentication request to a user to authenticate through his/her home institution. It is assumed that the local system and the user’s home institution are in the same federation. The Attribute Requester is used to obtain the user’s attributes from his/her home institution based on the local system requirements and delivers these attributes to the local system for subsequent authorization purposes. The

138

F. Gao and J. Tan

GridShib-CA is used to generate local proxy credentials based on the user’s role attributes and certificate, stored in the Certificate Pool System (CPS). The specific Subject Name (SN) (e.g. remote_administrator/01, remote_staff/02 and remote_student/05 etc.) is composed in that proxy credential on account of the user’s role. In effect, the GridShib-CA makes it possible for roles to enter the picture. It sends an attribute request to obtain a user’s attributes from the user’s home institution via the portal. It is also used to request local certificates for remote users from the CPS. The Certificate Pool System (CPS) is used to manage and store local certificates in the certificate pool for remote users. The Community Authorization Service (CAS) provides access control for remote users, based on a user’s Distinguish Name (DN) and the Subject Name (SN) in the proxy credentials, and based on local policies. CAS generates specific capabilities for a remote user, used to access authorized local resources. There is a “capability valid time” notice generated by the CAS to demonstrate a maximum valid time for those capabilities. The Resource Provider (RP) provides resources, e.g., computing resources, storage resources, applications and services. Figure 5 shows 19 steps of authentication and authorization for a remote user. Steps 1-8 are typical authentication steps of Shibboleth via portal for the remote user. Steps 9 through 15 present a procedure to generate role-based local proxy credentials for the remote user. Finally, steps 16 through 19 describe the local resource capability request procedure and access to local Grid resources. 4.2 Framework Design for a Local User In comparison with that for a remote user, the architecture and workflow for a local user is much simpler. For example, the IdP is part of the local system, so there is no need to use the Portal and WAYF service to redirect the user to the IdP, though doing so will still work. In addition, the Certificate Pool System (CPS) is unlikely to apply to local users since they each have local certificates, and local policies may not be restrictive to local users. They may also authenticate directly via the local Authentication System behind the IdP, rather than through the IdP itself.

5 Pros and Cons Single Sign-On allows users to be authenticated by their home institution’s authentication infrastructure. If authenticated, they obtain local proxy credentials and capabilities to repeatedly transact with resources without subsequent sign-on steps. Usability improves for users who are easily authenticated across domains based on the trust relationship within a federation. “Role-grained” Access Control comes from having a user’s role attribute allowing the local system to provide appropriate privileges without maintaining individual accounts for remote users. In a typical grid, resources are typically allocated via one-size-fits-all queues, or via groups to which individual users may be added manually. In our model, CAS policies apply to roles, i.e., “rolegrained” allocation. User authentication is effectively delegated in hierarchical fashion via home institutions, providing scalability. Apart from above benefits, other points of manageability can be made. Each institution needs only to manage its own IdP for its own users. New users or new institutions imply no major changes to local


139

account management, assuming that generic accounts are already available for remote users in general. Moreover, resource providers define role-based authorization policies for their resources, based on roles rather than per user. The CPS avoids having users approach a registration authority operator prior to the approval of a certificate request. Each institution can autonomously use its own authentication system but still be federated. Different mechanisms such as VOMS [1] and PERMIS [13] could be used locally without impairing interoperability. Privacy is protected in Shibboleth, as local systems request minimal user attributes from the user’s home IdP, and our framework only adds role data to what is normally required. Of course, there are disadvantages as well. First, Shibboleth is a static trust based infrastructure, where each institution must trust each other. If the trust relationship is broken, the whole federation is affected. An unauthorized user who masquerades as a valid remote user can access other institutions in the federation. Moreover, if the IdP releases the wrong role data for a given user, CAS will effectively authorize higher privileges than is appropriate. Second, attributes defined in the remote user’s IdP may not meet the authorization requirement of other institutions. Moreover, if these definitions change at some point, mismatches may arise. Third, CAS does not support users from multiple VOs, as it only supports basic access control policies. Finally, CAS may become a performance bottleneck with only one CAS server. There are other issues to consider. The framework is not particularly designed to withstand attacks. A masquerade attack can let an adversary gain privileged roles. A Denial-of-Service (DoS) attack can bring the CAS server down, making the authorization service unavailable to all. Also, according to the characteristics of Shibboleth, a user who belongs to a specific institution can only possess one role. What if the user belongs to multiple institutions, having different roles? This can be a minor accounting problem, or a bigger one if there is reason to revoke the user’s privileges completely, but the user’s privileges exist across multiple identities with individual roles.

6 Conclusions and Future Work We discussed a proposed role-based access control system that addresses authentication, authorization and decentralized resource allocation and account management in a grid. It integrates Shibboleth and CAS to improve usability, scalability, manageability, privacy and performance of existing authentication and authorization systems using a Single Sign-On, role-based and “role-grained” access control mechanism. In our implementation, it supports workable workflows for remote and local users, and supporting interoperability across different mechanisms. A novel Certificate Pool System was employed to improve security, manageability and performance. Possible extensions to our work include: • • •

a distributed CAS system to improve robustness and scalability; implementing this framework with VOMS [1], PERMIS [13] and other mechanisms; adding support for users with multiple roles in different institutions.

140

F. Gao and J. Tan

References 1. Chakrabarti, A.: Grid computing security. Springer, New York (2007) 2. Gutmarm, P.: PKI: It’s not dead, just resting. Computer 35(8), 41–49 (2002) 3. Foster, I., Kesselman, C., Tsudik, G., Tuecke, S.: Security architecture for computational grids. In: 5th ACM Conf. on Computer and Communications Security (CCS 1998), pp. 83–92. ACM, NY (1998) 4. Jie, W., Arshad, J., Ekin, P.: Authentication and authorization infrastructure for Grids— issues, technologies, trends and experiences. J. Supercomput. 52(1), 82–96 (2010) 5. Sinnott, R.O., Jiang, J., Watt, J., Ajayi, O.: Shibboleth-based access to and usage of grid resources. In: Proc. 7th IEEE/ACM Int. Conf. Grid Computing (Grid 2006), pp. 136–143. IEEE Computer Society, Washington, DC (2006) 6. Daswani, N., Kern, C., Kesavan, A.: Foundations of security: what every programmer needs to know. Apress Media LLC, New York (2007) 7. Pereira, A.L., Muppavarapu, V., Chung, S.M.: Role-based access control for grid database services using the community authorization service. IEEE Trans. Dependable and Secure Computing 3(2), 156–166 (2006) 8. ITU-T Recommendation X.812 | ISO/IEC 10181-3:1996, Security Frameworks for open systems: Access control framework (1996) 9. Hemmes, J., Thain, D.: Cacheable decentralized groups for grid resource access control. In: 7th IEEE/ACM Int. Conf. Grid Computing (Grid 2006), pp. 192–199. IEEE Computer Society, Washington, DC (2006) 10. Ni, X., Luo, J., Song, A.: A trust degree based access control for multi-domains in grid environment. In: 11th Int. Conf. Computer Supported Cooperative Work in Design (CSCWD 2007), pp. 864–869. IEEE, Piscataway (2007) 11. Lang, B., Foster, I., Siebenlist, F., Ananthakrishnan, R., Freeman, T.: A multipolicy authorization framework for grid security. In: 5th IEEE Int. Symp. Network Computing and Applications (NCA 2006), pp. 269–272. IEEE, Los Alamitos (2006) 12. Jensen, J., Spence, D., Viljoen, M.: Grid single sign-on in CCLRC. In: Proc. UK eScience All Hands Meeting 2006, Nottingham, UK. National e- Science Centre, Edinburgh (2006) 13. Chadwick, D., Otenko, A.: The PERMIS X.509 role based privilege management infrastructure. Future Generation Computer Systems 19(2), 277–289 (2003)

A Secure Internet Voting Scheme Md. Abdul Based and Stig Fr. Mjølsnes Department of Telematics Norwegian University of Science and Technology (NTNU) {based,sfm}@item.ntnu.no Abstract. We describe information security requirements for a secure and functional Internet voting scheme. Then we present the voting scheme with multiple parties; this voting scheme satisfies all these security requirements. In this scheme, the voter gets a signed key from the registrar, where the registrar signs the key as blinded. The voter uses this signed key during the voting period. All other parties can verify this signature without knowing the identity of the voter, hence the scheme provides privacy of the voter. This voting scheme also satisfies voter verifiability and public verifiability. In addition, the scheme is receipt-free.

1

Introduction

Internet voting is a very interesting research topic in recent years. Many countries have started deploying Internet voting in their areas, and it has been taken as a very challenging task to have a secure Internet voting scheme. In Internet voting, we want privacy of the voter, and at the same time we want to verify that the ballot cast by a voter has been counted properly. In addition, we do not want to show how the ballot is cast to the vote buyer or coercer. These requirements make the Internet voting a tough topic for the researchers. Regarding physical coercion-free election, it is recommended that the voter will first go to an election booth controlled by the election officials. Then the voter will show some identity to get access to the election booth. After that the voter will vote over Internet. We can make the voting scheme remote so that people can vote from anywhere, but in that case we need to compromise the physical coercion. The voter may be coerced by the vote buyer to vote for a particular candidate in the remote Internet voting. In order to protect the voter to sell the vote, the voting scheme must be receipt-free. We can assume that there is a voter computer [1] inside the election booth to perform the cryptographic task on behalf of the voter. The voter will just choose a vote for a particular candidate and the voter computer will construct the ballot from that vote on behalf of the voter. Then the voter will not be able to sell the vote to the bad people or the coercer. We use bulletin board in the voting scheme to make it verifiable. There are two ways of verification: the voter can verify that the ballot is counted properly and anyone can verify that the counting is correct. The first one refers to voter verifiability and the second one is called public or universal verifiability. Our voting scheme is both voter verifiable and public verifiable. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 141–152, 2011. c Springer-Verlag Berlin Heidelberg 2011

142

M.A. Based and S.F. Mjølsnes

Outline of the paper: The related work is described in Section 2. The security requirements for a secure and functional Internet voting scheme are presented in Section 3. In Section 4, the voting scheme is presented. The corresponding message sequence diagrams are shown in Section 5. The security analysis is done in Section 6, where the conclusions and future plans are presented in Section 7.

2

Related Work

An Internet voting scheme with multiple parties is presented in [2], where a registrar was proposed to sign a key for the voter that is used during the voting period. But, the register could easily link the key with the identity of the voter in that scheme. We improve this part in our voting scheme such that the registrar signs a blinded copy of the key K. So the registrar has no way to link the key with the identity of the voter. The inductive approach to verifying cryptographic protocols is published by Lawrence et. al. in [3]. They describe protocols inductively as sets of traces, where a trace is a list of communication events. Kremer et. al. present the analysis of an Electronic Voting Protocol in the Applied Pi Calculus in [4]. Groth investigates the existing voting schemes based on homomorphic threshold encryptions and evaluates the security of the schemes in the UC framework in [5], and also proposes some modifications to make the voting schemes secure against active adversaries. The zero-knowledge protocols for voting systems and homomorphic cryptosystems are published in [5,6,7]. A non-interactive zeroknowledge protocol (NIZK) is described in [1]. The authors propose multiple counting servers to count the ballots jointly. The counting servers can individually verify the validity of the ballot without interacting with the voter. They also propose a voter computer to construct the ballot from a vote. We refer to their voter computer and NIZK protocol that can also be used in our voting scheme.

3

Security Requirements

We classify the information security requirements of Internet voting scheme as basic security requirements and expanded security requirements [8]. These requirements are mostly related to each other. 3.1

The Basic Requirements

The basic security requirements are: Eligibility of the Voter. To allow the authenticated voter to cast the ballot, eligibility of the voter is important. To have this, authentication and identification of voter are needed so that only the eligible voters will have access to cast the ballot.

A Secure Internet Voting Scheme

143

Confidentiality of the Ballot. The ballot cast by a voter must be confidential throughout the transmission from the voter to the servers. The servers represent the election officials or authentication servers or counting servers. In most cases, the Public Key Cryptography (PKC) is used in this purpose through the public key encryption techniques. The voter encrypts the ballot using the public key of the servers, so only the servers can decrypt and see the contents of the ballot. Integrity of the Ballot. Integrity of the ballot is to guarantee that the ballot cast by an eligible voter should not be altered or modified by any unauthorized party. To provide integrity, digital signature or group signature [9] techniques are mostly used [10]. Privacy and Secrecy. Privacy refers to both the privacy of the voter and the privacy of the ballot. The privacy should not be compromised in a voting scheme. Secrecy means the way in which a voter casts the ballot should not be revealed to any party [11]. Robustness. The voting scheme should ensure that the components (both the servers and the voter computers) of the voting scheme must be functional during the voting period such that if some components fail then the other components will still work without stopping the voting process. Fairness. The voting scheme should not publish any intermediate results such that no participant gains any knowledge about the partial results before the final tallying. This is important, because, the knowledge of the partial results may affect the intentions of the voters who have not cast their ballots before knowing the partial results to vote for a particular candidate. That is why, a voting scheme should be fair. Soundness. The counting process of a voting scheme should discard a ballot that is not valid. Ballot verification is needed before counting the ballots. Completeness. The counting process of a voting scheme is complete if the counting process counts all the valid ballots. In this case, the final tally is equal to the number of valid ballots cast by the voter. Verifiability of the counting process is important to provide completeness in a voting scheme. Unreuseability of the Ballot. The counting process of a voting scheme should count only one ballot for each eligible voter. The authentication process in some schemes does not allow a voter to vote twice or more, though some voting schemes allow a voter to vote multiple times. But, in that case, the counting process includes only the final ballot cast by the voter. 3.2

The Enhanced Requirements

The enhanced security requirements for Internet voting schemes are: Unlinkability and Untraceability. Unlinkability between the voter and the ballot is important for the privacy of the voter and the ballot. By untraceability neither the voter nor the ballot acquirer can add identifiable information to the ballot. This property supports the unlinkability property.

144


Validity of the Ballot. The counting process of a voting scheme should validate the ballots before counting them to satisfy soundness, completeness, and unreuseability requirements of the counting process. In Internet voting scheme, we define a ballot as valid if the ballot is cast by an eligible voter and the content of the ballot is also valid. In general, the Zero Knowledge Proof (ZKP) protocols [14] are mainly used for this purpose. The ZKP protocols are classified into two types based on interaction between the prover and the verifier:

. Interactive Zero Knowledge Proof Protocol . Non-interactive Zero Knowledge Proof Protocol

Interactive zero knowledge protocol [7] allows the prover to interact with the verifier and prove the validity of the ballot. The voters play the role as the prover and the counting servers play the role as the verifier in the voting schemes. First, the voter commits to a value of the ballot, and sends the ballot to the servers. Then the servers send a random challenge string to the voter, and the voter responds to the challenge string. The verifiers then verify the ballot by verifying the responses. Like interactive zero-knowledge protocol, there is no interaction between the prover and the verifier in a Non-Interactive Zero-Knowledge (NIZK) protocol [10]. In NIZK, the verifier verifies the ballot without interacting with the prover. A NIZK requires no online communication between the prover and the verifier, hence usually it is faster and more efficient. On the other hand, a NIZK protocol requires that both the prover and verifier share a common random string (as challenge string), usually provided by a trusted third party and a pre-arranged use of this random string is required. Verifiability. Verifiability refers to the verifiability of the counting process. It is divided into two categories:

. Universal Verifiability or Public Verifiability . Voter Verifiability

In universal verifiability or public verifiability, any observer can verify that the counting process contains all the valid ballots and the final tally is correct [6]. On the other hand, if only the voters can verify the counting process and the tally, then the voting scheme is called voter verifiable [1,12]. In the scheme [1], the counting servers publish the final tally and the nonce values that were added to the ballot. The scheme is voter verifiable, since every voter can verify the nonce values. Receipt-freeness. In Internet voting, receipt-freeness [13] is another challenging requirement introduced by Benaloh and Tuinstra [15] to provide coercionresistance. Receipt-freeness means, in a voting scheme, the voters should not be allowed to produce a receipt of the ballot to prove for which candidate the ballot is cast. To protect vote-buying by the coercer or the vote buyer this is very important.


145

Fig. 1. The Voting Scheme

Coercion-resistance. Coercion-resistance is achieved through receipt-freeness if the voting schemes do not allow the vote-buyer or coercer to enforce a voter to vote for a particular candidate. Re-voting or multiple voting [1] by a voter are allowed in some schemes to make them coercion-resistant. To protect the voter from physical coercion, election booth based voting schemes are recommended. In case of remote Internet voting, there is no cryptographic way to make it free from physical coercion.

4

The Voting Scheme

Fig. 1 shows the voting scheme. In this scheme, the voter interacts with a registrar to get a signed key and uses that key for voting. Then the voter sends the ballot to the ballot acquirer together with this signed key. The ballot acquirer verifies the signature of the registrar on this key and sends the ballot to the counting servers after voting period is over. The ballot acquirer signs the key and sends the key back to the voter. The voter publishes this key on the bulletin board. The counting servers also sign the key and publish the signed key on the bulletin board. So the voter and any observer can verify that the number of keys signed by the counting servers is equal or less than to the number of keys signed by the ballot acquirer. 4.1

The Parties

The various parties in the voting scheme are as follows:

. Voter - A voter is an eligible one to vote. We assume that the voter holds a smartcard containing some private information that identifies the eligibility

146

.

. .

.

.

4.2


of the voter to the registrar. The voter creates a private/public key pair, where the public key is to be signed by the registrar so that the voter can use this key during the voting period. After the voter has chosen the candidate to vote for, the choice is encoded, and the ballot is split into shares and sent to the ballot acquirer. We do not describe the ballot construction here, rather refer to [1]. Voter Computer - Voter Computer (VC) is used to perform the cryptographic task on behalf of the voter to construct the ballot from a vote. We assume that VC is trusted and operates inside an election booth. The voter cannot see the cryptographic detail of the construction of a ballot, so cannot sell the ballot to the vote buyer or coercer. Registrar - The role of the registrar is to validate the eligibility of the voter and to sign the blinded copy of the public key supplied by the voter. Ballot Acquirer - The role of the ballot acquirer is to acquire the ballot from the voter and to verify the signature of the registrar on the key included with the ballot. After the ballot acquisition stage, the ballot acquirer mixes all the ballots and sends these to the counting servers. To defend coercion, a voter may vote multiple times. In that case, the ballot acquirer keeps a record and sends the latest ballot of the voter to the counting servers. Bulletin Board - The content of the bulletin board is public. This board is used to provide voter verifiability and universal verifiability. After voting, the voter sends the signed key (signed by the ballot acquirer) to the bulletin board. After counting, the same key is published by the counting servers. All counting servers will publish the same key, so no single counting server can alter or delete a key without detection. The voter can verify that the key signed by the ballot acquirer is the same key published by the counting servers. Any observer can observe that the number of keys signed by the ballot acquirer must be equal to or larger than the number of keys published by the counting servers. Because, the counting servers may ignore some invalid ballots after ballot verification. In that case, the key will not be published. But, the counting servers can maintain a list of the keys of the discarded ballots (if needed) for future claims by the voters. Counting Servers - A group of counting servers verifies that each individual ballot is correctly constructed and computes the result of the election using multiparty computations. No single counting server gets complete information about the individual ballot, but together they reach a result that all counting servers agree upon. Communication Model

We assume that the bulletin board is implemented as a centralized server and the contents of the bulletin board are public. The voter and the counting servers communicate with the bulletin board over the Internet. The counting servers are the same role, but different parties, the communication between them is done over authenticated and encrypted channels.


147

Fig. 2. The Protocol between the Voter and the Registrar

The communication between the voter and the registrar is interactive over the Internet. The implementations of the Registrar, Ballot Acquirer, Counting Servers, Bulletin Board are owned, developed and deployed by different trusted parties and all Internet connections are authenticated and confidential based on standard technology such as TSL/SSL. Each party is assumed to have a public key known by all roles and have a private key to sign message with.

5

The Protocol

The protocol is shown as message sequence diagrams in this section. The voter shows some valid identity to enter the election booth and generates the key pair K, K −1 where K −1 is the private key and K is the corresponding public key. 5.1

The Protocol between the Voter and the Registrar

The protocol between the voter and the registrar is shown in Fig. 2. The voter sends a signed and encrypted blinded copy (the blinding scheme is not shown in detail here) of K to the registrar. That is, the voter sends [K v−1 ]R to the registrar. Here, K is the blinded copy of K, signed with the private key of the voter V −1 , and encrypted with the public key of the registrar R. The registrar verifies the signature of the voter, and if the voter is valid and not voted before, the registrar signs the blinded K with its private key R−1 , encrypts it with the public key of the voter V , and sends this signed and encrypted blinded key to the voter. Then the voter can unblind the message and gets a signed public key. 5.2

The Protocol between the Voter and the Ballot Acquirer

The protocol between the voter and the Ballot Acquirer is shown in Fig. 3. The voter generates the ballot (we can assume that a voter computer [1] will perform the cryptographic task to construct a ballot from a vote so that the voter does not know the cryptographic details of the vote), signs it with the private key K −1 , adds the signed KR−1 to the ballot, and encrypts it with the

148


Fig. 3. The Protocol between the Voter and the Ballot Acquirer

Fig. 4. The Protocol between the Voter and the Bulletin Board

public key of the counting server (CS). The voter then adds the signed key again (KR−1 ) with this message and encrypts the message with the public key of the ballot acquirer (BA). The voter sends this signed and encrypted ballot to the ballot acquirer. The ballot acquirer can verify the signature on the key after decrypting the message. The ballot acquirer signs the key and sends the signed key KBA−1 to the voter. 5.3

The Protocol between the Voter and the Bulletin Board

The protocol between the voter and the bulletin board is shown in Fig. 4. The voter sends the signed key KBA−1 to the bulletin board. 5.4

The Protocol between the Ballot Acquirer and the Counting Servers

The protocol between the Ballot Acquirer and the Counting Servers is shown in Fig. 5. After verifying the signature on the key K, the ballot acquirer signs the encrypted message received from the voter with its private key BA−1 , and sends it to the counting server. The counting server only receives ballots signed by the ballot acquirer.


149

Fig. 5. The Protocol between the Ballot Acquirer and the Counting Servers

Fig. 6. The Protocol between the Counting Servers and the Bulletin Board

5.5

The Protocol between the Counting Servers and the Bulletin Board

The protocol between the Counting Servers and the Bulletin Board is shown in Fig. 6. The counting servers (we assume that there are multiple counting servers to count the ballots [1]) decrypt the message and count the ballots and publish the tally. The counting servers also sign and publish the signed key KCS −1 on the bulletin board (BB).

6

Security Analysis

This section analyzes (informal) the security properties of the voting scheme presented in this paper. The voting scheme satisfies the following: Eligibility. The voter shows the identity to the election officials to enter the election booth and then sends the signed key (as blinded) to the registrar. The registrar verifies the signature of the voter. So, only the eligible voter can vote. Confidentiality and Integrity of the Ballot. In our scheme, the ballot is encrypted with the public key of the counting servers and signed using the private key K −1 of the voter. Only the counting servers can decrypt and see the content of the ballot. The counting servers can also verify the key KR−1 by verifying the signature of the registrar on this key (because the key must be signed by the registrar). This provides confidentiality and integrity of the ballot.

150


Privacy and Secrecy. In our scheme, the voter first enters an election booth controlled by the election officials by showing some identity and credentials. Then the voter casts the vote inside the booth. No one knows any information how the voter casts the ballot inside the election booth. Using the blind signature and ballot encryption (encryption with the public key of the counting servers) techniques, our scheme provides privacy of the voter and the ballot. Robustness and Fairness. There are multiple counting servers in our voting scheme. As long as the threshold number of counting servers are functional, the voting scheme will work. No intermediate results should be published before the final tally. Since there are multiple counting servers to count the final tally, it is not possible for a single counting server to publish any intermediate result. The counting servers work together to count and publish the result of the voting. Soundness, Completeness, and Unreuseability of the Ballot. These properties are achieved through ballot verification, voter verifiability and universal verifiability. Unlinkability and Untraceability. In our scheme, the voter sends a signed (signed by the voter) blinded copy of the key K to the registrar. The registrar verifies the signature and signs the blinded copy of the key. Then the registrar sends this signed (signed by the registrar) blinded key to the voter. The voter unblinds it, gets the signed key and uses this key during the voting period. The registrar has no knowledge about the key, so it can not link the voter with this key. So, no one can add identifiable information to the ballot in our scheme. Validity of the Ballot. Regarding the validity of the ballot by a set of counting servers, our scheme is similar to the scheme presented in [10]. By using the noninteractive zero-knowledge protocol [10] the counting servers can individually verify the validity of the ballot. Voter Verifiability. After voting, the voter receives the signed key from the ballot acquirer (as shown in Fig. 4). This means that the ballot acquirer has received the ballot. The counting servers also publish the same key signed by these servers on the bulletin board. The voter can easily verify that the key included by the voter in the ballot is published by the counting servers. Universal Verifiability. Any observer can observe the contents on the bulletin board. The counting servers publish the signed key (signed by the counting servers) after ballot verification and counting. The voter publishes the signed key (signed by the ballot acquirer) after the ballot has been received by the ballot acquirer. The number of keys published by the counting servers must be equal to or less than the number of keys signed by the ballot acquirer (the counting servers may discard some invalid ballots after ballot verification). Since all the counting servers will publish the same key, no single counting server can alter or delete a key without detection.


151

Receipt-freeness and Coercion-resistance. We assume a voter computer inside the election booth that performs the cryptographic task on behalf of the voter to construct the ballot from a vote. The voter does not know the detail construction of the ballot, so the voter can not prove to anyone how the voting was done. We assume this voter computer as a trusted ballot generator. Thus, the voting scheme is receipt-free and hence, coercion-resistant.

7

Conclusions and Future Plans

In this paper, we first present the security requirements for a secure Internet voting scheme. We then present a new scheme that satisfies all these security requirements. The scheme involves multiple parties for voter authentication (by registrar) and ballot generation (by voter computer), and multiple counting servers to count the ballots. We do not need to trust all the counting servers. As long as a single counting server is honest, the ballot verification and counting will be correct in our voting scheme. This essentially increases public trust in ballot counting. The voter and any observer can verify that the counting is correct by observing the contents of the bulletin board. Thus, voter verifiability and public verifiability are also satisfied. In summary, we can say that, we have presented a voting scheme here that satisfies all the basic and enhanced security requirements for Internet voting schemes. We have analyzed the security properties of our voting scheme informally. Ongoing work involves verification of these security properties by using formal verification tool, for example, Isabelle [3]. After implementation of some parts of the scheme (counting), performance evaluation is an important future work. Acknowledgement. The authors are thankful to Peter Ryan, University of Luxemberg, David Gray and Denis Butin, Dublin City University, for their valuable comments and suggestions regarding the voting scheme presented in this paper.

References 1. Based, M. A., Mjølsnes, S.F.: Universally Composable NIZK Protocol in an Internet Voting Scheme. In: Cuellar, J., et al. (eds.) STM 2010, LNCS, vol. 6710, pp. 147– 162. Springer, Heidelberg (2011) 2. Based, M.A., Reistad, T.I., Mjølsnes, S.F.: Internet Voting using Multiparty Computations. In: Proceedings of the the 2nd Norwegian Security Conference (NISK 2009), Tapir Akademisk Forlag, pp. 136–147 (2009) ISBN: 978-82-519-2492-4 3. Paulson, L.C.: The Inductive Approach to Verifying Cryptographic Protocols. Journal of Computer Security (2000) 4. Kremer, S., Ryan, M.: Analysis of an Electronic Voting Protocol in the Applied Pi Calculus. In: Sagiv, M. (ed.) ESOP 2005. LNCS, vol. 3444, pp. 186–200. Springer, Heidelberg (2005)

152


5. Groth, J.: Evaluating Security of Voting Schemes in the Universal Composability Framework. Springer, Heidelberg (2004) ISBN 978-3-540-22217-0 6. Schoenmakers, B.: A Simple Publicly Verifiable Secret Sharing Scheme and its Application to Electronic Voting. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 148–164. Springer, Heidelberg (1999) 7. Iversen, K.R.: The Application of Cryptographic Zero-Knowledge Techniques in Computerized Secret Ballot Election Schemes. Ph.D. dissertation, IDT-report, 1991:3, Norwegian Institute of Technology (February 1991) 8. Meng, B.: Analyzing and Improving Internet Voting Protocol. In: Proceedings of the IEEE International Conference on e-Business Engineering, pp. 351–354. IEEE Computer Society, Los Alamitos (2007) ISBN 0-7695-3003-6 9. Boyen, X., Waters, B.: Compact Group Signatures Without Random Oracles. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 427–444. Springer, Heidelberg (2006) 10. Based, M.A., Mjølsnes, S.F.: A Non-interactive Zero Knowledge Proof Protocol in an Internet Voting Scheme. In: Proceedings of the the 2nd Norwegian Security Conference (NISK 2009), Tapir Akademisk Forlag, pp. 148–160 (2009) ISBN:97882-519-2492-4 11. Gray, D., Sheedy, C.: E-Voting: a new approach using Double-Blind Identity-Based Encryption. Presented in STM 2010: 6th International Workshop on Security and Trust Management, Athens, Greece (September 23-24, 2010) 12. Chaum, D.: Secret-ballot receipts: True voter-verifiable elections. IEEE Security and Privacy 2(1), 38–47 (2004) 13. Hirt, M., Sako, K.: Efficient Receipt-Free Voting Based on Homomorphic Encryption. In: Preneel, B. (ed.) EUROCRYPT 2000. LNCS, vol. 1807, pp. 539–556. Springer, Heidelberg (2000) 14. Krantz, S.G.: Zero Knowledge Proofs. AIM Preprint Series, 2007-46 (July 25, 2007) 15. Benaloh, J., Tuinstra, D.: Receipt-free secret-ballot elections. In: Proceeding of STOC 1994, pp. 544–553 (1994)

A Hybrid Graphical Password Based System Wazir Zada Khan1, Yang Xiang2, Mohammed Y. Aalsalem1, and Quratulain Arshad1 1

School of Computer Science, Jazan University, Saudi Arabia {wazirzadakhan,aalsalem.m}@jazanu.edu.sa, [email protected] 2 School of Information Technology, Deakin University, Australia [email protected]

Abstract. In this age of electronic connectivity, where we all face viruses, hackers, eavesdropping and electronic fraud, there is indeed no time when security is not critical. Passwords provide security mechanism for authentication and protection services against unwanted access to resources. A graphical based password is one promising alternatives of textual passwords. According to human psychology, humans are able to remember pictures easily. In this paper, we have proposed a new hybrid graphical password based system, which is a combination of recognition and recall based techniques that offers many advantages over the existing systems and may be more convenient for the user. Our scheme is resistant to shoulder surfing attack and many other attacks on graphical passwords. This resistant scheme is proposed for small mobile devices (like smart phones i.e. ipod, iphone, PDAs etc) which are more handy and convenient to use than traditional desktop computer systems. Keywords: Graphical passwords, Authentication, Network Security.

1 Introduction A password is a secret that is shared by the verifier and the customer. ”Passwords are simply secrets that are provided by the user upon request by a recipient.” They are often stored on a server in an encrypted form so that a penetration of the file system does not reveal password lists [2]. Passwords are the most common means of authentication because they do not require any special hardware. Typically passwords are strings of letters and digits, i.e. they are alphanumeric. Such passwords have the disadvantage of being hard to remember [3]. Weak passwords are vulnerable to dictionary attacks and brute force attacks where as Strong passwords are harder to remember. To overcome the problems associated with password based authentication systems, the researchers have proposed the concept of graphical passwords which use pictures instead of textual passwords and are partially motivated by the fact that humans can remember pictures more easily than a string of characters [4].Graphical passwords have been known from the mid 1990s. The idea of graphical passwords was originally described by Greg Blonder in 1996 [5]. The first and most important advantage is that they are easier to remember than textual passwords. Human beings have the ability to remember faces of people, places they visit and things they have

Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 153–164, 2011. © Springer-Verlag Berlin Heidelberg 2011

154

W.Z. Khan et al.

seen for a longer duration. Thus, graphical passwords provide a means for making more user-friendly passwords while increasing the level of security. Besides these advantages, the most common problem with graphical passwords is the shoulder surfing problem: an onlooker can steal user’s graphical password by watching in the user’s vicinity. Many researchers have attempted to solve this problem by providing different techniques [6]. Due to this problem, most graphical passwords schemes recommend small mobile devices (PDAs) as the ideal application environment. Another common problem with graphical passwords is that it takes longer to input graphical passwords than textual passwords [6]. The login process is slow and it may frustrate the impatient users. Graphical passwords serve the same purpose as textual passwords differing in consisting of handwritten designs (drawing), possibly in addition to text. The exploitation of smart phones like ipod and PDA’s is increased due to their small size, compact deployment and low cost. In this paper, considering the problems of text based password systems, we have proposed a new graphical password scheme which has desirable usability for small mobile device. Our proposed system is new graphical passwords based hybrid system which is a combination of recognition and recall based techniques and consists of two phases. During the first phase called Registration phase, the user has to first select his username and a textual password. Then objects are shown to the user to select from them as his graphical password. After selecting the user has to draw those selected objects on a touch sensitive screen using a stylus. During the second phase called Authentication phase, the user has to give his username and textual password and then give his graphical password by drawing it in the same way as done during the registration phase. If they are drawn correctly the user is authenticated and only then he/she can access his/her account. For practical implementation of our system we have chosen i-mate JAMin smart phone which is produced by HTC, the Palm Pilot, Apple Newton, Casio Cassiopeia E-20 and others which allow users to provide graphics input to the device. It has a display size of 240x320 pixels and an important feature of Handwriting recognition. The implementation details are out of the scope of this paper. Rest of the paper is organized as follows. In Section II, all existing graphical password based schemes are classified into four main categories. Section III gives a little review of existing research and schemes which are strongly related to our work. Section VI discusses the problems of all existing graphical password based schemes. In Section V our proposed system is described in detail. In Section VI we have compared our proposed system with existing schemes by drawing out the flaws in existing schemes. Section VII provides discussion. Finally Section VIII concludes the paper.

2 Classification of Graphical Password Based Systems Graphical passwords schemes can be broadly classified into four main categories: recognition-based, cued-recall based, pure-recall based and hybrid systems. Recognition Based Systems are also known as Cognometric Systems or Search metric Systems. Recognition based techniques involves identifying whether one has seen an image before. The user must only be able to recognize previously seen images, not generate them unaided from memory.

A Hybrid Graphical Password Based System

155

Pure Reacll Based Systems are also known as Drwanmetric Systems. In pure recallbased methods the user has to reproduce something that he or she created or selected earlier during the registration stage. Cued Recall Based Systems are also called Iconmetric Systems. In cued recall-based methods, a user is provided with a hint so that he or she can recall his his/her password. Hybrid Systems are typically combination of two or more schemes. Like recognition and recall based or textual with graphical password schemes.

3 Related Work Haichang Gao et al. [10] have proposed and evaluated a new shoulder-surfing resistant scheme called Come from DAS and Story (CDS) which has a desirable usability for PDAs. This scheme adopts a similar drawing input method in DAS and inherits the association mnemonics in Story for sequence retrieval. It requires users to draw a curve across their password images (pass-images) orderly rather than click directly on them. The drawing method seems to be more compatible with people’s writing habit, which may shorten the login time. The drawing input trick along with the complementary measures, such as erasing the drawing trace, displaying degraded images, and starting and ending with randomly designated images provide a good resistance to shoulder surfing. A user study is conducted to explore the usability of CDS in terms of accuracy, efficiency and memorability, and benchmark the usability against that of a Story scheme. The main contribution is that it overcomes a drawback of recallbased systems by erasing the drawing trace and introduces the drawing method to a variant of Story to resist shoulder-surfing. P.C.van Oorshot and Tao Wan [1] have proposed a hybrid authentication approach called Two Step. In this scheme users continue to use text passwords as a first step but then must also enter a graphical password. In step one, a user is asked for her user name and text password. After supplying this, and independent of whether or not it is correct, in step two, the user is presented with an image portfolio. The user must correctly select all images (one or more) pre-registered for this account in each round of graphical password verification. Otherwise, account access is denied despite a valid text password. Using text passwords in step one preserves the existing user sign-in experience. If the user’s text password or graphical password is correct, the image portfolios presented are those as defined during password creation. Otherwise, the image portfolios (including their layout dimensions) presented in first and a next round are random but respectively a deterministic function of the user name and text password string entered, and the images selected in the previous round.

4 Problem Domain There are many problems with each of the graphical based authentication methods. These are discussed below:

156

W.Z. Khan et al.

4.1 Problems of Recognition Based Methods Dhamijia and Perrig proposed a graphical password based scheme Déjà Vu, based on Hash Visualization technique [11]. The drawback of this scheme is that the server needs to store a large amount of pictures which may have to be transferred over the network, delaying the authentication process. Another weakness of this system is that the server needs to store the seeds of portfolio images of each user in plaintext. Also, the process of selecting a set of pictures from picture database can be tedious and time consuming for the user [7]. This scheme was not really secure because the passwords need to store in database and that is easy to see. Sobrado and Birget developed a graphical password technique that deals with the shoulder surfing problem [3]. In their first scheme the system displays a number of passobjects (pre-selected by user) among many other objects as shown in Fig: 3. To be authenticated, a user needs to recognize pass-objects and click inside convex hull formed by all the pass objects. They developed many schemes to solve the shoulder surfing problem but the main drawback of these schemes is that log in process can be slow. Another recognition based technique is proposed by Man et al. He proposed a shoulder-surfing resistant algorithm which is similar to that developed by Sobrado and Birget. The difference is that Man et al has introduced several variants for each pass-object and each variant is assigned a unique code. Thus during authentication the user recognize pre-selected objects with an alphanumeric code and a string for each pass-object. Although it is very hard to break this kind of password but this method still requires the user to memorize alphanumeric codes for each pass-object variants. “Passface” is another recognition based system. It is argued by its developer that it is easy for human beings to remember human faces than any other kind of passwords. But Davis et al [12] have found that most users tend to choose faces of people from the same race. This makes the Passface password somewhat predictable. Furthermore, some faces might not be welcomed by certain users and thus the login process will be unpleasant. Another limitation of this system is that it cannot be used by those people who are face-blind [6]. 4.2 Problems of Recall Based Methods The problem with the Grid based methods is that during authentication the user must draw his/her password in the same grids and in the same sequence. It is really hard to remember the exact coordinates of the grid. The problem with Passlogix is that the full password space is small. In addition a user chosen password might be easily guessable [6]. DAS scheme has some limitations like it is vulnerable to shoulder surfing attack if a user accesses the system in public environments, there is still a risk for the attackers to gain access to the device if the attackers obtained a copy of the stored secret, and, brute force attacks can be launched by trying all possible combinations of grid coordinates, ) Drawing a diagonal line and identifying a starting point from any oval shape figure using the DAS scheme itself can be a challenge for the users, and finally Difficulties might arise when the user chooses a drawing which contains strokes that pass too close to a grid-line, thus, the scheme may not be able to distinguish which cell the user is choosing.


157

“PassPoints” is the extended version of Blonder’s idea by eliminating the predefined boundaries and allowing arbitrary images to be used. Using this scheme it takes time to think to locate the correct click region and determine precisely where to click. Another problem with these schemes is that it is difficult to input a password through a keyboard, the most common input device; if the mouse doesn’t function well or a light pen is not available, the system cannot work properly [6]. Overall, with both “PassPoints” and “Passlogix”, looking for small spots in a rich picture might be tiresome and unpleasant for users with weak vision. In Viskey’s scheme’s main drawback is the input tolerance. Pointing to the exact spots on the picture has proven to be quite hard thus Viskey accepts all input within a certain tolerance area around it. It also allows users to set the size of this area in advance. However, some caution related to the input precision needs to be taken, since it will directly influence the security and the usability of the password. In order to practically set parameters, a four spot VisKey theoretically provides approximately 1 billion possibilities for defining a password. Unfortunately this is not large enough to prevent off-line attacks from a high-speed computer. Therefore no less than seven defined spots are required to overcome the likelihood of brute force attacks.

Fig. 1. A shoulder Surfing Resistant Graphical Password Scheme [3]

5 Proposed System Taking into account all the problems and limitations of graphical based schemes, we have proposed a hybrid system for authentication. This hybrid system is a mixture of both recognition and recall based schemes. Our proposed system is an approach towards more reliable, secure, user-friendly, and robust authentication. We have also reduced the shoulder surfing problem to some extent. 5.1 Proposed Algorithm Steps 1-3 are registration steps and steps 4-9 are the authentication steps. The algorithm of our proposed system is as follows:

158

W.Z. Khan et al.

─ Step 1 The first step is to type the user name and a textual password which is stored in the database. During authentication the user has to give that specific user name and textual password in order to log in. ─ Step 2 In this second step objects are displayed to the user and he/she selects minimum of three objects from the set and there is no limit for maximum number of objects. This is done by using one of the recognition based schemes. The selected objects are then drawn by the user, which are stored in the database with the specific username. Objects may be symbols, characters, auto shapes, simple daily seen objects etc. Examples are shown in Figure 2. ─ Step 3 During authentication, the user draws pre-selected objects as his password on a touch sensitive screen (or according to the environment) with a mouse or a stylus. This will be done using the pure recall based methods. ─ Step 4 In this step, the system performs pre-processing. ─ Step 5 In the fifth step, the system gets the input from the user and merges the strokes in the user drawn sketch. ─ Step 6 After stroke merging, the system constructs the hierarchy. ─ Step 7 Seventh step is the sketch simplification. ─ Step 8 In the eighth step three types of features are extracted from the sketch drawn by the user. ─ Step 9 The last step is called hierarchical matching. During registration, the user selects the user name and a textual password in a conventional manner and then chooses the objects as password. The minimum length for textual password is L=6. Textual password can be a mixture of digits, lowercase and uppercase letter. After this the system shows objects on the screen of a PDA to select as a graphical password. After choosing the objects, the user draws those objects on a screen with a stylus or a mouse. Objects drawn by the user are stored in the database


159

Fig. 2. Some examples of objects shown to the user

with his/her username. In object selection, each object can be selected any number of times. Flow chart of registration phase is shown in Figure 3. During authentication, the user has to first give his username and textual password and then draw pre-selected objects. These objects are then matched with the templates of all the objects stored in the database. Flow chart of authentication phase is shown in Figure 4. The phases during the authentication like the pre-processing, stroke merging, hierarchy construction, sketch simplification, feature extraction, and hierarchical matching are the steps proposed by Wing Ho Leung and Tsuhan Chen in their paper [13]. They propose a novel method for the retrieval of hand drawn sketches from the database, finally ranking the best matches. In the proposed system, the user will be authenticated only if the drawn sketch is fully matched with the selected object’s template stored in the database. Pre-processing of hand drawn sketches is done prior to recognition and normally involves noise reduction and normalization. The noise occur in the image by user is generally due to the limited accuracy of human drawn images. [14]. A number of techniques can be used to reduce noise that includes Smoothing, filtering, wild point correction etc. Here in the proposed system Gaussian smoothing is used which eliminates noise introduced by the tablet or shaky drawing.

or specifically in two dimensions

Where r is the blur radius (r2 = u2 + v2), and σ is the standard deviation of the Gaussian distribution. In case, if user draws very large or a very small sketch then the system performs size normalization which adjusts the symbols or sketches to a standard size. The Stroke merging phase is use to merge the strokes which are broken at end points. If the end points are not close, then that stroke is considered as open stroke and it may

160

W.Z. Khan et al.

be merged with another open stroke if the end point of one stroke is close to the end point of the other. The strokes are then represented in a hierarchy to simplify the image and to make it meaningful for further phases [13]. In the next step of sketch simplification, a shaded region is represented by a single hyper-stroke. After sketch simplification three types of features are extracted from the user re-drawn sketch. These features are hyper stroke features, Stroke features, and bi-stroke features.

Fig. 3. Flow chart for Registration Phase

In the last step of hierarchical matching, the similarity is evaluation the top to bottom hierarchical manner. The user is allowed to draw in an unrestricted manner. The overall process is difficult because free hand sketching is a difficult job. The order in which the user has selected the objects does matter in our proposed system i.e. during the authentication phase, the user can draw his pre-selected objects in the same order as he had selected during the registration phase. So, in this way the total combinations of each password will be 2n –1, ‘n’ being the number of objects selected by the user as password during the registration phase.


161

6 Comparison of Proposed System with Existing Schemes Our system offers many advantages over other existing systems as discussed below: Comparing to the “Passface” system, our system can also be used for those who are face-blind. We have used objects instead of human faces for selecting as password because later on during the authentication phase, the user has to draw his/her password and it is a much more difficult task to draw human faces than simple objects. Also we believe that as compared to human faces, objects are easier to remember which are in daily use. Our system has eliminated the problems with grid based techniques where the user has to remember the exact coordinates which is not easy for the user. Our system just compares the shapes of the objects drawn by the user during authentication. Our scheme is less vulnerable to Brute force attack as the password space is large. It is also less vulnerable to online and offline dictionary attacks. Since

Fig. 4. Flow Chart for Authentication Phase

162

W.Z. Khan et al.

stylus is used, it provides ease to the user for drawing objects and also it will be impractical to carry out dictionary attack. Our scheme is better than Man et al scheme. This is because in his scheme the user has to remember both the objects and string and the code. In our method the user has to remember the objects he selected for password and also the way he has drawn the objects during registration. Our proposed system differs from CDS in that the user has to first select a textual password and then a graphical password, making it more secure. Comparing to Van Oorschot’s system Two Step Authentication system, our system is more secure since users not only select graphical password but also draw their password, making it difficult to hack and even if the textual password is compromised, the graphical password cannot be stolen or compromised since the user is also drawing the graphical password. Our proposed system works in the same way as Two Step Authentication system i.e the user has to choose a textual password before choosing a graphical password but difference is that in our system during authentication, after giving the username and textual password, the user has to draw his graphical password which is matched with its stored template drawn by the user during the registration phase. This approach protects from hacking the password and prevents them from launching different attacks. Thus our system is more secure and reliable than two step authentication system. As with all graphical based systems our system will also be slow. The normalization and matching will take time. An important issue of our system is that it is somewhat user dependent during authentication. It depends upon the user’s drawing ability. Thus, the system may not be able to verify the objects drawn by the user and as a result the actual user may not be authenticated. The possible attacks on graphical passwords are Brute force attack, Dictionary attacks, Guessing, Spy-ware, Shoulder surfing and social engineering. Graphical based passwords are less vulnerable to all these possible attacks than text based passwords and they believe that it is more difficult to break graphical passwords using these traditional attack methods. Our System is resistant to almost all the possible attacks on graphical passwords.

7 Conclusion and Future Work The core element of computational trust is identity. Currently many authentication methods and techniques are available but each with its own advantages and shortcomings. There is a growing interest in using pictures as passwords rather than text passwords but very little research has been done on graphical based passwords so far. In view of the above, we have proposed authentication system which is based on graphical password schemes. Although our system aims to reduce the problems with existing graphical based password schemes but it has also some limitations and issues like all the other graphical based password techniques. To conclude, we need our authentication systems to be more secure, reliable and robust as there is always a place for improvement. Currently we are working on the System Implementation and Evaluation. In future some other important things regarding the performance of our system will be investigated like User Adoptability and Usability and Security of our system.


163

References 1. van Oorschot Tao Wan, P.C.: TwoStep: An Authentication Method Combining Text and Graphical Passwords. In: 4th International Conference, MCETECH 2009, Ottawa, Canada (May 4-6, 2009) 2. Authentication, http://www.objs.com/survey/authent.htm (last visited on May 15, 2011) 3. Sobrado, L., Birget, J.C.: Graphical Passwords, The Rutgers Schloar, An Electronic Bulletin for Undergraduate Research, vol. 4 (2002), http://rutgersscholar.rutgers. edu/volume04/sobrbirg/sobrbirg.htm 4. Elftmann, P.: Diploma Thesis, Secure Alternatives to Password-Based Authentication Mechanisms, Aachen, Germany (October 2006) 5. Blonder, G.E.: Graphical password. U.S. Patent 5559961, Lucent Technologies, Inc., Murray Hill, NJ (August 1995) 6. Suo, X., Zhu, Y., Owen, G.S.: Graphical Passwords: A Survey. In: Proceedings of Annual Computer Security Applications Conference (2005) 7. Approaches to Authentication, http://www.e.govt.nz/plone/archive/services/see/see-pkipaper-3/chapter6.html?q=archive/services/see/see-pki-paper3/chapter6.html (last visited on May 15, 2011) 8. Roman, V.Y.: User authentication via behavior based passwords. In: Systems, Applications and Technology Conference, Farmingdale, NY (2007) 9. Biometric Authentication, http://www.cs.bham.ac.uk/~mdr/teaching/modules/security/lect ures/biometric.html (last visited on May 02, 11) 10. Gao, H., Ren, Z., Chang, X., Liu, X., Aickelin, U.: A New Graphical Password Scheme Resistant to Shoulder-Surfing. In: 2010 International Conference on CyberWorlds, Singapore (October 20-22, 2010) 11. Perrig, A., Song, D.: Hash Visualization: A New Technique to improve Real-World Security. In: International Workshop on Cryptographic Techniques and E-Commerce, pp. 131– 138 (1999) 12. Davis, D., Monrose, F., Reiter, M.K.: On User Choice in Graphical Password Schemes. In: 13th USENIX Security Symposium (2004) 13. Leung, W.H., Chen, T.: Hierarchical Matching For Retrieval of Hand Drawn Sketches. In: Proceeding of International Conference on Multimedia and Expo (ICME 2003), vol. 2 (2003) 14. Khan, H.Z.U.: Comparative Study Of Authentication Techniques. International Journal of Video & Image Processing and Network Security IJVIPNS 10(04) 15. Token Based Authentication, http://www.w3.org/2001/sw/Europe/events/foafgalway/papers/fp /token_based_authentication/ (last visited on May 02, 2011) 16. Knowledge Based Authentication, http://csrc.nist.gov/archive/kba/index.html (last visited on May 02, 2011) 17. Knowledge based Authentication, http://searchsecurity.techtarget.com/definition/knowledgebased-authentication (last visited on May 02, 2011)

164

W.Z. Khan et al.

18. A Survey on Recognition based Graphical User Authentication Algorithms, http://www.scribd.com/doc/23730953/A-Survey-on-RecognitionBased-Graphical-User-Authentication-Algorithms (last Visited on May 02, 2011) 19. Jain, A., Bolle, R., Pankanti, S. (eds.): Biometrics: personal identification in networked society. Kluwer Academic, Boston (1999) 20. Hurson, A.R., Ploskonka, J., Jiao, Y., Haridas, H.: Security issues and Solutions in Distributed heterogeneous Mobile Database Systems. In: Advances in Computers, vol. 61, pp. 107–198 (2004) 21. Biddle, R., Chiasson, S., van Oorschot, P.C.: Graphical Passwords: Learning from the First Twelve Years, Carleton University - School of Computer Science, Technical Report TR-11-01 (January 4, 2011) 22. Weinshall, D.: Cognitive authentication schemes safe against spyware, (short paper). In: IEEE Symposium on Security and Privacy (May 2006) 23. Hayashi, E., Christin, N., Dhamija, R., Perrig, A.: Use Your Illusion: Secure authentication usable anywhere. In: 4th ACM Symposium on Usable Privacy and Security (SOUPS), Pittsburgh (July 2008) 24. Davis, D., Monrose, F., Reiter, M.: On user choice in graphical password schemes. In: 13th USENIX Security Symposium (2004) 25. Passfaces Corporation. The science behind Passfaces, White paper, http://www.passfaces.com/enterprise/resources/white_papers.h tm (last visited on May 05, 11) 26. De Angeli, A., Coventry, L., Johnson, G., Renaud, K.: Is a picture really worth a thousand words? Exploring the feasibility of graphical authentication systems. International Journal of Human-Computer Studies 63(1-2), 128–152 (2005) 27. Moncur, W., Leplatre, G.: Pictures at the ATM: Exploring the usability of multiple graphical passwords. In: ACM Conference on Human Factors in Computing Systems (CHI) (April 2007) 28. Pering, T., Sundar, M., Light, J., Want, R.: Photographic authentication through untrusted terminals. In: Pervasive Computing, pp. 30–36 (January-March 2003) 29. Wiedenbeck, S., Waters, J., Sobrado, L., Birget, J.: Design and evaluation of a shouldersurfng resistant graphical password scheme. In: International Working Conference on Advanced Visual Interfaces (AVI) (May 2006) 30. Bicakci, K., Atalay, N.B., Yuceel, M., Gurbaslar, H., Erdeniz, B.: Towards usable solutions to graphical password hotspot problem. In: 33rd Annual IEEE International Computer Software and Applications Conference (2009) 31. Jermyn, I., Mayer, A., Monrose, F., Reiter, M., Rubin, A.: The design and analysis of graphical passwords. In: 8th USENIX Security Symposium (August 1999) 32. Valentine, T.: An Evaluation of the PassfaceTM Personal Authentication System, Technical Report. Goldmsiths College University of London, London (1998) (the first report known in the literature)

Privacy Threat Analysis of Social Network Data Mohd Izuan Hafez Ninggal and Jemal Abawajy School of Information Technology, Deakin University, 3217 Victoria, Australia {mninggal,jemal}@deakin.edu.au

Abstract. Social network data has been increasingly made publicly available and analyzed in a wide spectrum of application domains. The practice of publishing social network data has brought privacy concerns to the front. Serious concerns on privacy protection in social networks have been raised in recent years. Realization of the promise of social networks data requires addressing these concerns. This paper considers the privacy disclosure in social network data publishing. In this paper, we present a systematic analysis of the various risks to privacy in publishing of social network data. We identify various attacks that can be used to reveal private information from social network data. This information is useful for developing practical countermeasures against the privacy attacks. Keywords: Privacy disclosure, Social networks, Threat analysis, Data publications.

1 Introduction Online social networking has become one of the most popular activities on the web [1-2]. The dramatic increase in the number, size, and variety of online social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Privacy is an important issue when one wants to make use of data that involves individuals’ sensitive information. While expressly acknowledging the immense potential and the importance of social networks as a communication tool, many real-world social networks contain sensitive information and serious privacy concerns have been raised. Social networks often contain some private attribute information about individuals as well as their sensitive relationships. Simply removing or replacing the identifying attributes such as name and SSN by meaningless unique identifiers is far from sufficient to protect privacy [3]. Serious concerns on privacy protection in social networks have been raised in recent years [4-5], particularly when social network data is published [3]. The increasing availability of rich social media, popular online social networking sites, and sophisticated data mining techniques have made privacy in social networks a serious concern. As a result, the thriving phenomenon of social networks has also attracted the attention of lawmakers. In a significant number of countries (primarily, but not exclusively, in Europe and North America) Data Protection Authorities have Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 165–174, 2011. © Springer-Verlag Berlin Heidelberg 2011

166

M.I.H. Ninggal and J. Abawajy

started to scrutinize both, the business models of social networks as well as the practices performed on such platforms, where such attention usually is reserved to general privacy issues (such as: identity theft, on-line frauds, data security) or to sector specific data protection problems (such as: financial transactions, e-commerce, electronic communication, children's protection), but not yet to marketing performed on such platforms (exceptions – with provisions specifically governing commercial communication on social networks – are a number of countries, members to the European Union). The US has in place Statute Law provisions or Self-Regulation guidelines specifically dealing with privacy issues on social networks. In most countries, general provisions and requirements in place for processing of personal data will apply to social networks. In some jurisdictions (inclusive of several nonEuropean ones in North and South American as well as in Asia and Oceania) dedicated rules are in preparation and likely to come into force in a near future. In this paper, we consider the privacy disclosure in social network data publishing. We present a systematic analysis of the various risks to privacy in published social network data. Privacy becomes an important issue when one wants to make use of data that involve individuals’ sensitive information. Social network data often contains sensitive information about the users [4-6]. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. The conventional methods proposed for micro data privacy [7], cannot be used directly to ensure privacy of social network data due to complex dependencies between the data and various relationships. We will survey and analyze various possible privacy leak concern when a social network data is published. We identify various attacks such as different structural queries that can be used to reveal private information from social network data. This information is useful for developing countermeasures to reduce the risk of such attacks. The rest of the paper is organized as follows. Section 2 presents a high level architecture of social network and discuses the motivation for social network data publication. Section 3 analyses various threats to social network data privacy concerns. The privacy breach is presented in Section 4. Finally, Section 5 gives the conclusion.

2 Publishing Social Network Data Social networks have become an important data source. Certainly, they have made data collection on individuals much easier. This new phenomenon has generated a wealth of data that is collected and maintained by the social media service providers. Sometimes it is necessary or beneficial to release such data to the public. The data generated by social media services often referred to as the social network data. In many situations, the data needs to be published and shared with others. The usefulness of social network data in capturing real world social activities has attracted many parties demanding the data for analysis purposes. Social network analysis has been a key technique in modern sociology, geography, economics, and information science [3]. Researchers in sociology, epidemiology, and health-care related field collect data about geographic, friendship, family, and sexual networks. Social network data is useful to them to study disease propagation and risk. In addition, there is also

Privacy Threat Analysis of Social Network Data

167

increased interest by researchers in governments institution in mining social network data for information and security purposes [8]. All these quests require the data to be shared or published.

Fig. 1. High level system components of social network

Social networks describe entities (often people) and the relationships between them. Social network analysis is often used to understand the nature of these relationships, such as patterns of influence in communities, or to detect collusion and fraud. A high level system component of a social network is shown in Fig. 1. In the architecture, there are users, social media services, and third party data recipients. Example of social media services include collaborative projects, blogs, content communities, social networking sites, virtual game worlds, and virtual communities [9]. Facebook is the most popular social networking sites currently has more than 500 million active users and they spend over 700 billion minutes per month of using the application [10]. Social media service users can be any real world entity that uses the service like individual or organization. When a user uses an online social media service, they usually are asked to create a profile and to give information about themselves. This information includes personal identifiable information like social security number, name and phone number which uniquely identify a person. They may also give semiidentifiable information like home address, former school he/she went or former company he/she has worked as well as private or sensitive information that users may like to make available to selected entity while keep it hidden from the public view. Sensitive information can include religion, political view, type of disease (as in healthcare network) or generated income (as in financial network). On top of that, there are also data generated from the social activity from the services. Some of this data may also carry sensitive information. All these information is kept and maintained by the service provider. In many situations, the data needs to be published and shared with others. The overall intent is for the data to be used for the public good such as in the evaluation of economic models, in the identification of social trends, and in the pursuit of the stateof-the-art in various fields. Usually, such data contain personal information such as medical records, salaries, and so on, so that a straightforward release of data is not appropriate. For examples, business companies are analyzing the social connections in social network data to uncover customer relationship that can benefit their services and product sales. The analysis result of social network data is believed to potentially provide an alternative view of real-world phenomena due to the strong connection

168


between the actors behind the network data and real world entities [11-12]. In fulfilling the demands for the data, online social media operators have been sharing the data it maintains with external third parties such as business advertisers, application developers, as well as academic researchers. Therefore, releasing the data to third parties has to be done in a way that can guarantee the privacy of the users. In other word, the data must undergo a privacy-preserving phase before being released to third parties. Online social media service providers who maintain the data may has specific interest in specific analysis outcomes of their data but due to the lack of in-house expertise to conduct the analysis, outsourcing the task to external parties often comes as the alternative option. In different situation, the owner of the data shares the data with third parties. For example, advertising partners tend to be interested in the sort of information that is held by the service provider. The data usually contain valuable information that can enable better social targeting of advertisements. On the other hand, the request to use the data can also come from third party applications embedded in the social media application itself. For instance, Facebook has thousands of third-party applications and the number is growing exponentially [13]. Even though the process of data sharing in this case is implicit, the data is indeed passed over from the data owner (service provider) to different party (the application). The data given to these applications is usually not sanitized to protect users’ privacy [14]. Social network data usually contain users’ private information; it is important to protect this information in any sharing activities. When data is shared, risk to data privacy violation will likely result as the key problem both, for all those participating in activities on social networks (be it or for private or for business purposes) as well as for companies running such platforms. Thus, publishing the data may violate individual privacy. Individual privacy is defined as “the right of the individual to decide what information about himself should be communicated to others and under what circumstances” [15]. A privacy breach occurs when sensitive information about the user is disclosed to an adversary. There are well-known examples of unintended disclosure of private information such as the AOL search data [16] and attacks on Netflix data [9]. The problem is challenging due to the diversity and complexity of data and their relationships, on which an adversary can use many types of background knowledge to conduct an attack. In the following section, we analyze various threats to social network data. Although in many ways a user gives ‘consent’ when they sign up to an online social network site, most are unaware of the implications of voluntarily providing personal information on profiles as well as not being aware of how this information may be processed. Privacy implications associated with online social networking depend on the level of identifiability of the information provided, it’s possible recipients, and its possible uses [17]. According to a survey released for EU Data Protection Day 2010, almost 50% of human resources professionals active in Europe perform on-line checks on candidates and 25% of all job applications are rejected based on the results of searches performed on candidates’ on-line reputation and profiles, where dismissal is usually grounded on ‘inappropriate comments’ or ‘unsuitable photos or videos’ found on the Internet. Interestingly the same survey shows that on the other side consumers continue to significantly underestimate the risks deriving from their on-line profiles: in the UK only 9% (in France 10% and in


169

Germany 13%) of job seekers held that personal information available about them on the Internet would exercise influence on the outcome of their applications. In health care area, Personal Health Record (PHR) systems such as Google Health (health.google.com) allows users to store and manage personal information, including health information, emergency contacts, insurance plans, medications, immunizations, past procedures, test results, medical conditions, allergies, medications, family histories and lab results. Sharing of this information across user accounts is also supported. Placing detailed histories of health information online could expose users to significant risks [18]. Social finance network services like Wikinvest (www.wikinvest.com) are also changing the way finance has been done. Unacceptable disclosure of this type of data can result in serious consequences for individual ranging from scam and frauds to physical threats.

3 Threat Analysis In this section, we examine how a social network is translated into a data graph, what kind of sensitive information may be at risk and how an adversary may launch attack on individual privacy. 3.1

Data Representation

The data generated by social media services is usually viewed or represented as a graph network that contains vertices and connections between them. The network considered here is binary, symmetric, and without self-loops. Formally, a social network can be represented as , where , ,…, is a set of vertices and is the set of edges. We define as the degree of vertex . , which is the number of the vertices connected to node Harinda

Andy Eddie

Izuan

Claudia Bob Davood

Fed Gary Jemal

a) Original network

b) Naïve anonymized network

Fig. 2. Graphical social network representation

Fig. 2 illustrates an example of social network as a graph. The vertices usually represent real world actors or entities like individuals or organizations. Each vertex has a profile that usually contains personal attributes, such as name, gender, birth date, political view, religion etc. These individuals are usually connected by edges to represent some sort of social tie or link made between them. For example, in Social Networking Sites, these edges represent the connected friends each member has.

170


Therefore, edge can also have its attributes to describe the properties of the connection. For example in content communities type of social media, this attribute can be the content of the document they are collaborating with or the text comments and its timestamp that a user made in his/her subscribed blogs. This may also be the type of relationship between two users in social networking sites (e.g. political affiliation). 3.2

Background Knowledge

The first step in understanding attacks on privacy is to know what external information of a graph may be acquired by an adversary. Background knowledge is the piece of information that is known to the attacker and used by the attacker to perpetrate privacy attack. In social network data, the information that could be used as background knowledge to intrude user privacy is personal attribute and structural (or topological) attribute. The personal attribute is the information that describes a person such as name, address, date of birth, political view etc. Some attributes act as an identifier itself where it can be unique to individuals. Some other attributes are categorized act as semi identifiers or quasi attributes. Several quasi attributes, when combined together is potential to identify a person. Therefore, it is usually exploited as mapping parameter to find targeted individual in social network data.

Gary

Gary

Gary

Fig. 3. Structural information of Gary

The structural attribute is the information that can describe how an entity is connected to other entities in social network data. This information includes: i. Degree - This is the number of direct social links or relationships that an entity has. in Fig. 3 shows the social link that Gary has. This information does not carry direct sensitive information but is potential to be used as an effective mapping parameter in searching target individual in a network. Furthermore, acquiring this information from the network is relatively easy. For example in Facebook, the number of friends that appear in the user profile is the direct social tie the user has. This information does not uniquely represent an individual. However, in situation where the data range is small –e.g. only involve on a specific social group in the network - the return results could be very minimal or possibly unique. Using , adversary can map Gary as vertex 7 in naïve anonymized network of Fig. 2b. ii. Neighborhood – This refers to a set of neighboring entities that has direct social links to a target entity which they also have mutual link between them. For example in Fig.3b. If an adversary knows that Gary has four best friends who also three of them know each other, the adversary could still map Gary as vertex 7 even though there are several other vertices share the same number of degree.


171

iii. Sub-graph – This refers to a set of relationships which the target entity is connected to which is a subset of the whole graph (Fig. 3c). To assume an adversary know this richer information is quite too strong. However, an adversary may create a set of dummy profiles and create social link between these profiles with certain patterns. Having that, the adversary then use those dummy profiles to establish a social link to target individual. The social link can be established as easy as adding target to the friend list or address book. Another way is the adversary simply constructs a coalition with other friends which also forming a small uniquely identifiable sub-graph. Having knowledge about specific pattern of relationship that he/she purposely created, the attacker later uses that pattern to locate the target individual in the released data. This is known as active attack and passive attack respectively [5]. iv. Network graph metrics – A network graph has many matrices. Some of them can implicitly reveal an individual. For instance, in a closed network communities which is a political movement group, the centrality metric could potentially reveal the leader of the group. There is another situation where an individual does not know about the other relationship of his/her neighborhoods. The closeness metric could give false assumption to the individual if it is found some of their neighborhoods have non-supposed affiliation. If a social network is published as simply a graph with no other information (such as attributes of a node) then re-identification attacks are still possible. It has been shown that if an attacker (or set of attackers) participates in the social network, then in many cases it is possible for the attacker to identify nodes corresponding to accounts under his control [5]. 3.3

Data Mapping Mechanism

The adversaries normally access the data by performing queries. The queries usually performed with several parameters like auxiliary information. Structural Queries is a series of knowledge queries that able to provide answers to a restricted knowledge about a target node in the network [19]. These queries exploit the structural information that may be available to an adversary, including complete and partial descriptions of vertex neighborhoods, and connections to hubs in the network. i. Vertex refinement queries - this query is a locally expanding structural query. It describes the structure of local neighborhood from a vertex’s perspective (a targeted individual perspective) in an iterative way. The weakest knowledge query, , simply returns the attribute information of the vertex . For unlabelled graph, the queries return null; returns the degree (the number returns the degree of each neighbor of of social links) that vertex v has; vertex v. These iterative queries can be defined as , where H

,… ,

returns the multiset of degree of each vertices adjacent to .

172


Example 3.1. Fig. 3 shows the computation of, , and for each node. is the vertex label of every vertices. In this case, the graph is set unlabeled, then is and they are uniform for all vertices. Let assume the targeted = {4}, which is Gary’s degree. individual is Gary in Fig. 1a, then = {2, 2, 3, 3} which represents Gary’s neighbors’ degrees. ii. Sub-graph queries - these queries identify the existence of a sub-graph around the targeted vertex by counting the number of edges in described subgraph. By these queries, the adversary is assumed to be able to gather some fixed number of social link focused around the target . By exploring the neighborhood of , the adversary learns the existence of a sub-graph around representing partial information about the structure around . iii. Hub fingerprint queries – these queries give information about how the vertex is linked to a set of selected hubs in the network. A hub is identified as a vertex with high degree and high betweenness centrality and they are often outliers in a network. For example in social networking sites like Facebook, a hub may correspond to a very famous person who has very high number of social links. These queries represent a range of structural information that may be available to adversaries, including complete and partial descriptions of vertex's local neighborhoods, and vertex's connections to hubs in the network.

4 Classification of Privacy Breach The privacy breaches in social networks can be categories into identity disclosure, sensitive link disclosure and sensitive attribute disclosure [20-21]. Identity disclosure happens when an adversary is able to map a record to specific individual. The identity disclosure may be considered as the key of privacy violation in social networks because it usually leads to the disclosure of content information as well as the information about relationship they have got. It could also lead to the revelation of an individual’s existence in a closed community network where he/she has strong privacy expectation of their existence in that group. For example, Facebook allows its user to create a network group with invited only member. This closed community network group could theme from secret society to political movement to religious purpose. Therefore, revealing someone existence in such group would also violate their privacy. In the sensitive link disclosure attack, the relationships between two individuals are revealed - The link among vertices in social network data can be a symbolism of relationship between individuals or organizations. This information is generated from social activities when using social media services. There are relationships that are safe for public to know, but some individuals may not prepare to reveal specific relationship they have. An adversary may want to know the degree of relationship between two entities. The disclosure occurs when the adversary is able to find out the existence of a relationship between two users, which the involved individuals prefer to keep it private. For examples, in social network data, based on the friendship relationships of a person and the public preferences of the friends such as political affiliation, it may be possible to infer the personal preferences of the person in


173

question as well. Two entities in a social network may have many connections. Some are safe for the public to know and others that should remain private. If the relationship of two individuals can be determined by certain path, then the privacy is compromised. In the sensitive attribute disclosure attack, the sensitive data associated with each vertex or edge is compromised. Attribute disclosure occurs when an adversary is able to determine the value of a sensitive user attribute, which the user intended to keep it as private. Sensitive attributes may be associated with an entity as well as link relationship. In application level, the visibility of the attribute information is often variable. A member's profile can be set to be viewed publicly as well as by limited people in the network. In social network sites, content which commonly viewable by public usually something about hobbies and interests. However, certain application requires the user to give specific information accordingly to the application theme. In health based application, there could be information such as drinking and drug habits or type of disease that the user gives in the profile for monitoring purpose by other user in the system such as doctor. On the other hand, in online sexual network, there are sexual-based information like preferences and orientation. Meanwhile, there is also sensitive information generated from the interaction between users. For example in messaging network and email, the sensitive content are usually the text message, the timestamp, the frequency of interaction and other information correspond to both parties. Users usually have strong perception that this information is kept private [5].

5 Conclusion Although in many ways a user gives ‘consent’ when they sign up to an online social network site, most are unaware of the implications of voluntarily providing personal information on profiles as well as not being aware of how this information may be processed. Privacy implications associated with online social networking depend on the level of identifiability of the information provided, it’s possible recipients, and its possible uses. In this paper, we considered the privacy disclosure in social network data publishing. We present a systematic analysis of the various risks to privacy in publishing of social network data. We identify various attacks that can be used to reveal private information from social network data. This information is useful for developing practical countermeasures against the privacy attacks.

References 1. Alexa. The top 500 sites on the web (2011), http://www.alexa.com/topsites 2. Bonneau, J., Preibusch, S.: The privacy jungle: On the market for data protection in social networks. Economics of Information Security and Privacy, 121–167 (2010) 3. Zhou, B., Pei, J., Luk, W.: A brief survey on anonymization techniques for privacy preserving publishing of social network data. ACM SIGKDD Explorations Newsletter 10(2), 12–22 (2008) 4. Kleinberg, J.M.: Challenges in mining social network data: processes, privacy, and paradoxes. ACM, New York (2007)

174


5. Backstrom, L., Dwork, C., Kleinberg, J.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. ACM, New York (2007) 6. Srivastava, J., et al.: Data mining based social network analysis from online behavior (2008) 7. Fung, B., et al.: Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR) 42(4), 1–53 (2010) 8. Rosenblum, D.: What anyone can know: The privacy risks of social networking sites. IEEE Security & Privacy, 40–49 (2007) 9. Facebook. Facebook Statistic (2011), http://www.facebook.com/press/info.php?statistics 10. Kaplan, A.M., Haenlein, M.: Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons 53(1), 59–68 (2010) 11. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’networks. Nature 393(6684), 440–442 (1998) 12. Watts, D.J.: Networks, dynamics, and the small-world phenomenon. American Journal of Sociology, 493–527 (1999) 13. Narayanan, A., Shmatikov, V.: De-anonymizing social networks. IEEE, Los Alamitos (2009) 14. Felt, A., Evans, D.: Privacy protection for social networking APIs. In: 2008 Web 2.0 Security and Privacy, W2SP 2008 (2008) 15. Westin, A.F.: Privacy and freedom, London, vol. 97 (1967) 16. Hansell, S.: AOL removes search data on vast group of web users. New York Times 8, C4 (2006) 17. Gross, R., Acquisti, A.: Information revelation and privacy in online social networks. ACM, New York (2005) 18. Williams, J.: Social networking applications in health care: threats to the privacy and security of health information. ACM, New York (2010) 19. Hay, M., et al.: Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1(1), 102–114 (2008) 20. Liu, K., et al.: Privacy-preserving data analysis on graphs and social networks. Next Generation of Data Mining, 419–437 (2008) 21. Zheleva, E., Getoor, L.: Preserving the privacy of sensitive relationships in graph data. Springer, Heidelberg (2007)

Distributed Mechanism for Protecting Resources in a Newly Emerged Digital Ecosystem Technology Ilung Pranata, Geoff Skinner, and Rukshan Athauda University of Newcastle, School of Design, Communication and IT, University Drive, Callaghan NSW, Australia {Ilung.Pranata,Geoff.Skinner,Rukshan.Athauda}@newcastle.edu.au

Abstract. Digital Ecosystem (DE) is characterized as an open and dynamic environment where the interaction and collaboration between its entities are highly promoted. A major requirement to promote such intensive interaction and collaboration in a DE environment is the ability to secure and uphold the confidentiality, integrity and non-repudiation of shared resources and information. However, current developments of such security mechanisms for protecting the shared resources are still in their infancy. Most of the proposed protection frameworks do not provide a scalable and effective mechanism for engaging multiple interacting entities to protect their resources. This is even a greater issue when multiple resources are exchanged and shared in an open and dynamic environment. Therefore, we proposes a distributed mechanism for individual enterprises to manage their own authorization processes and resource access permissions with an aim to provide a rigorous protection of entities’ resources. Keywords: Authentication, authorization, digital ecosystem.

1 Introduction Since its first introduction in 2002, a new emerging concept of Digital Ecosystem (DE) has grasped numerous attentions from researchers, businesses, ICT professionals and communities around the world. This concept aimed at achieving a set of predetermined goals of Lisbon summit in March 2000 which primarily focuses on dynamic formation of knowledge based economy [1]. Further, the knowledge based economy will lead to a creation of more jobs and a greater social inclusion in sustaining the world economic growth [2]. DE is a multi-dimensional concept that encompasses several current technology models such as collaborative environment [3], distributed system [4], and grid technology [5]. The combination of concepts from these models provides the ability for a DE environment to deliver an open, flexible and loosely coupled resource sharing environment. On the other hand, this combination also develops several complicated security issues which need to be addressed before the full implementation of a DE concept. Unfortunately, the evaluation on DE security dimensions from the current literature signifies a number of deficiencies in its security architecture particularly in protecting the enterprise resources and information. There is a need for a comprehensive resource protection solution that is able to provide a Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 175–185, 2011. © Springer-Verlag Berlin Heidelberg 2011

176

I. Pranata, G. Skinner, and R. Athauda

strong and rigorous mechanism to safeguard the critical resources and further to reduce the possibility of information leakage to the unauthorized parties. A key challenge for enterprises who involved in a DE environment is to determine the right users who are able to access the services, resources, knowledge and information hosted by these enterprises. This challenge is occurred due to several reasons. First, the occurrences of multiple resources published and shared by each enterprise in a DE environment and second, the situation where various clients are able to access each individual resource. Due to these reasons, enterprises urgently need a mechanism that effectively manages their clients’ access control and authorization permissions with an aim to protect their resources. In this paper, we attempt to deliver a comprehensive framework allowing enterprises to protect their resources and information from any unauthorized use.

2 Related Work In a DE environment where multiple interacting entities exist, the required efforts to enforce a strong authentication and authorisation mechanism are extensive. We identify three core issues that appear to be the challenging tasks to enforce such mechanisms. First, as the DE community expands its size to incorporate more entities, the resource providers face a challenge to identify the legal entities that are able to access their resources. Second, the fact that each entity would have different set of access permissions to access multiple resources further complicates the issue. Third, it is probable that each resource provider would host multiple resources and services in a DE environment. This situation, in turn, creates a great issue to authorize the right entities to the right resources with the right permissions. Failure in assigning right permissions to the entities would compromise the usage of resources and would bring negative impact for resource provider. The current internet mechanisms are still far from adequate to provide a reliable authentication and authorisation processes for a DE environment. This view is reflected from our literature analysis over a number of internet mechanisms. The most prominent mechanism to manage the client credentials is through the implementation of Identity Provider (IdP) or Credential Provider [6, 7]. IdP mainly focuses at storing and providing the client credential to any resource providers for their client authentication process. On every authentication process, resource provider will request the client credential from its trusted IdP where it receives any access request from a client. In latter development, several technology standards such as SAML [8] and Liberty Alliance [9] are adopted in this mechanism to provide the federation mechanism of multiple entities for Single Sign On (SSO) services. Similarly, the Credential Server (CRES) [10] and the Grid Security Infrastructure (GSI) MyProxy [11] utilizes the IdP concept, and they further leverage its concept for large number of dispersed servers over a wide geographical area. Both mechanisms store the clients’ credentials in the local server; however, authentication of a remote client can be facilitated by requesting his credential from the trusted remote server. In both MyProxy and CRES mechanisms, the resource provider requests the client credential from the local server on every authentication process. The local server then creates a certificate token which contains client information. Subsequently, the

Distributed Mechanism for Protecting Resources

177

certificate token is sent to the resource provider as the acknowledgement of the authentic client. When the resource provider receives the token, it allows the client to access the resources based on the trust established with the publisher of token. The SSL/TLS technology [12] has been extensively used in e-commerce transactions for secure authentication and communication. This technology is designed with a highly reliance on the Certificate Authority (CA) to ensure the legitimacy of an entity. Therefore, SSL/TLS technology also presents a centralized credential management. Although these approaches could be deployed well in a DE environment, the conspicuous issue of single server failure must be carefully considered. In an event where the credential provider server is down, there possibly a chaos in a DE community due to the unavailability of credential services for client authentication. Apparently, our literature review identifies that several internet authorisation mechanisms take similar approach as its authentication mechanisms. The most prominent authorisation mechanisms, such as CAS [13], Akenti [14], and PMI [15], utilize a central server to assign the multiple access permissions to the clients individually although their implementation are differ between each other. These mechanisms also inherit several issues pertinent to the central management of authorisation permissions. First, the central management would face real issue with the bottleneck and failure on its servers. Security breach would occur if the central servers fail to perform their authorisation processes over the clients. Although it is possible to replicate the central server, the replication process will bring abundance administrative issues, considering a huge amount of data that needs to be replicated. Second, challenges occur when the central server attempt to assign the access permissions to the DE member entities. As a large number of resource providers that host one or more resources, the central server needs to register each resource and its access permissions individually. Further, this situation becomes even more challenging as a single resource could be associated with multiple different access permissions, and each client may have different access permissions assigned to him. Therefore, the central management is not practical when there is huge number of entities in a DE environment. Third, serious administration issues would occur as a DE environment grows in size and diversity due to the great benefits that they can achieve. A central server will be experiencing huge burden to manage all client and resource providers’ accounts and permissions even with the use of super computers or grid collections of computers. Several DE literatures clearly reveal that DE is characterized as an open environment on which a centralized structure is minimized. DE must be engineered to provide a high resilience infrastructure while avoiding single point of control and failure [16, 17]. Therefore, a completely distributed control mechanism is required that immune to the super control failure. It is evident that the aforementioned internet mechanisms are inappropriate to be implemented in a DE environment due to its centralized management. In this paper, we propose a solution to manage the authentication and authorization in a full distributed approach that focuses on each entity to manage the authentication and authorization mechanism with the utilization of a capability token. We termed our solution as Distributed Resource Protection Mechanism (DRPM) [18, 19]. In this paper, we enhance our solution by eliminating its reliance on the central credential server and further secure the mechanism by utilizing the Public Key Infrastructure [15].

178


3 Overview of DRPM 3.1 Identifying Entity through Client Profile The present mechanism for service discovery in a DE environment requires a client to search for resources by utilizing a semantic discovery portal through its browsers or rich applications [20]. This discovery portal would search and list all resources which are provided by DE resource providers. Once the client finds the resource, it then contacts the resource provider and requests for that resource. At this stage, resource provider does not know any information about this client and its intended purpose on the resource. This may put the resource at risk as it may contain highly sensitive information which must be protected from any misuse and malicious act. Therefore, it is crucial for a resource provider to understand its client’s information before any access to the resource is granted. Taking this into consideration, we adopted a method of creation for a client profile that aims to capture all required, but voluntarily provided, information about a client. The information which is contained in a client profile provides necessary data about who the client is and about their intentions and purpose for using the requested resources. The aim of implementing a client profile is to ensure the resource provider that resources are not going to the wrong entities and further impose the confidentiality and integrity of the resources. The use of client profiles also facilitates the auditing process on who is accessing a resource. For example, there may be a situation where a resource provider needs to trace back which client was delegated an access to the resource in case there was an incident involving a dispute or counterfeiting of the resource. In order to fully implement a client profile, it is necessary that a client registration portal is employed in DRPM. A client profile is generated through this registration portal. Further, resource providers are able to customize the registration portal to contain only the information which is important to them. New clients wishing to access a specific resource are initially redirected into this portal. If they wish to access the resource, they must continue to fill in all the necessary information required by the resource provider to produce a client profile. Once it is produced, the client profile is stored in the resource provider repository. Utilisation of this functional procedure and process provides an additional and enhanced method for determining who is accessing a particular resource at a particular time inside a DE environment. 3.2 Storing Permission in a Capability Token It is always a challenge to enforce client access permissions on the available resources within a DE environment. This challenge is due to the occurrences of a high amount of entities that actively interact in a DE environment. Further, these entities could also make the same request for a particular resource either at the same or at different time. To solve the issue of managing multiple resource access permissions on a diverse range of DE clients, we utilize and further evolve the concept of capability introduced by CAS server that is used in a Collaborative Environment. In CAS, capability is used to store all access rights of a user which are determined by a community policy. However, the implementation of the capability in our framework is slightly different to the capability implementation in a CAS server. In our framework, capability contains all the necessary


179

right permissions for each client to perform a set of operations on a particular resource. This capability is produced by the resource provider that hosts a particular resource. This capability would be used to grant the client access to the resources, and it further facilitates the authorization process for the clients. Once a client profile is created, a list of client authorization permissions are assigned into the capability token. The client’s access permissions and policies are expressed in XML [21] due to its simplicity, wide usability and self-descriptive characteristics. Our basic design of a capability token contains the client profile identifier, resource provider identifier, resource identifier and list of access permissions. A time-stamp can be implemented in the capability token to determine the validity period of a client when accessing the resources. In the event where the trustworthiness of a new client is equivocal, a short-life capability token can be issued. Once the trustworthiness of this client gradually increases, resource provider can replace the short-life token with longer time-stamp validity. Additionally, the Uniform Resource Locator (URL) of resources is embedded in the token to provide an automatic and seamless connection to resource servers. Once a capability token is created, it would be disseminated to the requesting client. Every time a client makes a request to the resource provider, the client sends back its initial configured capability to the resource provider. The resource provider then authenticates the client’s capability token and grants the access permissions based on the listed permissions obtained from the client’s capability.

4 Developing a Secure DRPM Workflow In this section, we present a secure DRPM which provides a strong authentication and authorization mechanism while upholding the confidentiality, integrity and nonrepudiation of resources. The following notation will be used to mathematically define the secure DRPM: • • • • • • • • • •

Cl: Client that request for the resources. RP: Resource Provider that host the resources. PKi: Public Key of i. SKi: Secret Key of i. Clcp: Capability token of client. Si(x)i: Signed object x with private key of i. E[x]j: Encrypted object x with public key of j. ATCl: Authentication Token of client. SyKi: Symmetric key passphrase of i i → j:{x1, …., xn}: A message sent from i to j with content xi to xn.

given that: •

PKi

SKi: Public key of i is only related to secret key of i therefore, Si(x)i can only be verified with PKi and E[x]i can only be decrypted with SKi.

180


4.1 Securing Registration Workflow The DRPM registration portal is used to generate a client profile during the initial resource provisioning. This registration portal also captures the client information and possibly his reasons for accessing the resources. The registration process comprises of three main stages: client registration, public key exchanges, and secure transfer of capability token. The resource provider endorsed certificate is utilized to identify the authentic resource provider based on its community endorsed public key certificate, which will be discussed in the next sub-section. The Public Key Infrastructure (PKI) is used to provide a secure communication between the client and resource provider. Figure 1 shows the principal workflow for securing three stages of registration process.

Fig. 1. DRPM secure registration workflow

The registration steps are detailed below: 1. A new client contacts the resource provider for requesting a resource (Cl → RP). Resource provider sends its WoT endorsed public key to the client (Cl ← RP:{PKRP}). Once the client determines and accepts the trustworthiness of the public key, he stores the resource provider trusted public keys and fills his information on the registration portal. 2. After the client information is filled, the registration portal will build a unique client profile which identify the client, and send this client profile to the repository server. 3. Resource provider then requests for client certificate and stores the client public key on its repository (Cl:{PKCl} → RP). If required, WoT verification could be performed on client certificate to ensure the trustworthiness of the client. 4. The resource provider generates a client capability token based on client’s allowed permissions. 5. Resource provider uses its own private key to sign the capability token (SKRP + Clcp = Si(Clcp)RP). SHA Algorithm is used to hash the capability token. This process enhances the integrity of capability token over the untrusted network.


181

6. Resource provider then uses client’s public key, received from step 3, to encrypt the signed message (PKCl + Si(Clcp)RP = E[Si(Clcp)RP]Cl) and send it to client endpoint (Cl ← RP:{E[Si(Clcp)RP]Cl}). 7. Client uses his own private key to decrypt the encrypted capability token (E[Si(Clcp)RP]Cl - PKCl = Si(Clcp)RP). This process further ensures the confidentiality of capability token. A capability token is breached if client cannot decrypt the message. 8. Client then uses resource provider public key to generate the capability token from the signed message (Si(Clcp)RP - PKRP = Clcp). This process further ensures that the client receives the capability token from the genuine resource provider unchanged. Note that at the final step of registration process, client will have his capability token and public key which were retrieved from the resource provider. The capability token and resource provider public key will then be stored in client repository for future communication or resource access. On another end-point, the resource provider stores the client’s public key in its own repository. We trust that the combination of both encryption and hashing mechanisms further uphold the confidentiality, integrity and non-repudiation of capability token during its transfer. 4.2 Fine-Grained Resource Access Workflow Once a client has been successfully registered with the resource provider, client will present his capability token to the resource provider on every access request. The capability token which contains client assertions and authorization permissions is primarily used as a base by the resource provider for granting the resource access. Resource provider utilizes client’s capability token to authenticate and authorize client access. Three foremost protection requirements for the resource access are the identification of resource provider, secured transfer of capability token, and authentication of a requesting client. A detailed workflow that ensures security protection on each resource access is provided in figure 2. The steps are as follow: 1. Client looks at his repository for his intended resource provider capability token. He then retrieves this capability token from client repository. The capability token contains the client access permissions and the resource URL. At this stage, the client also determines a symmetric pass key which will be shared with the resource provider and generate the Authentication Token which consists of symmetric pass key and capability token (Clcp + SyKcl = ATCl). 2. Client uses his private key to sign the capability token (SKCl + ATCl = Si(ATCl)Cl). The signing process is essential to uphold the non-repudiation of capability token. 3. Client then encrypts the signed capability token using resource provider public key (PKRP + Si(ATCl) = E[Si(ATCl)Cl]RP) and he sends the encrypted message to the resource provider (Cl:{E[Si(ATCl)Cl]RP} → RP). 4. When resource provider received the encrypted message, it uses its own private key to de-crypt the message and retrieve signed capability token (E[Si(ATCl)Cl]RP - SKRP = Si(ATCl)Cl).

182


Fig. 2. DRPM resource access protection

5. Resource provider then verifies the signature of capability token using the client public key (Si(ATCl)Cl – PKCl = ATCl). It then verifies the integrity of the capability token by generating the hash number from capability token using the SHA Algorithm. 6. Resource provider retrieves the access permissions listed in capability token. Note that, on the step 1 of the workflow the client determines a symmetric pass key. This pass key will be utilized to generate a symmetric key for further communication after the capability token authentication and authorization processes is valid. In an event where the capability token is stolen due to the man-in-middle attack, the unauthorized entity will still not be able to access the resource due to the symmetric key passphrase that is shared between the legitimate client and resource provider only. If there is a security breach on which resource provider generates a new pair of publicprivate keys, client would not be able to decrypt using his current resource provider public key. Therefore, a request needs to be made to obtain a new public key. PKI is extensively utilized during the DRPM resource workflow. The other party public key retained by both client and resource provider during the registration process is re-used to provide the confidentiality and integrity of capability token. PKI is primarily adopted during the initial handshake and capability token transfer. Due to the limitation of PKI which requires higher computation process, we suggest the utilization of symmetric key for transferring the data after the authentication and authorization process. The symmetric key can be incorporated into the capability token message before encrypting with the client’s private key. Resource exchange is then encrypted by this symmetric key over the untrusted network.

6 Implementation Strategies and Scalability Testing Due to the limitation on the length of this paper, this section provides a very brief review on the implementation strategies and the scalability performance of our proposed DRPM mechanism. Our DRPM prototype implementation was divided into two major applications: the resource provider application and the client application. The resource provider application was built of three main system components: listener


183

module, registration page and resource page. The main tasks of these components were to listen for any incoming connection from the client, to automatically create the client profile and capability token, to securely exchange the information, and to host multiple resources. In contrast, the client application was primarily being used to securely register and access the hosted resources. We tested our prototype for the scalability on its listener server component to handle multiple HttpWebRequest requests. This test was conducted by utilizing the Apache JMeter [21] tool that specializes on the web scalability and performance testing.

Fig. 3. DRPM listener component scalability testing

In our test bed, 1000 users were generated to access the listener component concurrently. Each user accessed the listener component for either registering or accessing the resources. Our test shows that the average elapsed time was 162 ms with the aggregate highest elapsed time of 327 ms for resource access process and the aggregate lowest elapsed time of 5 ms for registration process. It was shown that the highest elapsed time was primarily due to the encryption/decryption and hash verification of the capability token.

7 Conclusion In this paper we have highlighted the needs for protecting enterprise resources from any unauthorized use in a Digital Ecosystem (DE) environment. Further, we have also analysed the appropriateness of several existing security mechanisms for DE. After a thorough analysis, we found a number of deficiencies of the current mechanisms to promote a strong community protection. Therefore, we propose the Distributed Resource Protection Mechanism (DRPM) for DE to provide a comprehensive resource protection. DRPM can be classified as a new approach to facilitate the authorization process for enterprises that request for specific resources or information. DRPM emphasizes on the decentralized authorization mechanism that is performed by each resource provider. It is achieved by utilizing the client profile and capability

184


token for its authentication and authorization permissions. Several future works such as analysis of DRPM and scalability of the prototype are needed to ensure a strong protection for DE member entities. Further, investigation on an effective trust mechanism to improve the overall DRPM security is critically needed. Our proposal incorporates the Web of Trust (WoT) to actively engage the community to protect the resources. As trust is critical in DRPM to build the confidence of the entities to interact and sharing their resources, a close analysis on the applicability of WoT to develop an effective trust management is desired.

References 1. Nachira, F., Dini, P., Nicolai, A.: A network of digital business ecosystems for Europe: roots, processes and perspectives, European Commission, Bruxelles, Introductory Paper (2007) 2. Dini, P., Darking, M., Rathbone, N., Vidal, M., Hernandez, P., Ferronato, P., Briscoe, G., Hendryx, S.: The digital ecosystems research vision: 2010 and beyond, European Commisssion, Bruxelles, Position Paper (2005) 3. Ballesteros, I.L.: New Collaborative Working Environments 2020, European Commission, Report on industry-led FP7 consultations and 3rd Report of the Experts Group on Collaboration@Work (2006) 4. van Steen, M., Homburg, P., Tanenbaum, A.S.: Globe: a wide area distributed system. IEEE Concurrency 7, 70–78 (1999) 5. Czajkowski, K., Kesselman, C., Fitzgerald, S., Foster, I.: Grid Information Services for Distributed Resource Sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, HPDC-10 2001 (2001) 6. Koshutanski, H., et al.: Distributed Identity Management Model for Digital Ecosystems. Presented at the International Conference on Emerging Security Information, Systems and Technologies (Securware 2007), Valencia (2007) 7. Seigneur, J.M.: Demonstration of security through collaborative in digital business ecosystem. In: Proceedings of the IEEE SECOVAL Workshop, Athens, Greece (2005) 8. Hughes, J., Maler, E.: Security Assertion Markup Language (SAML) v. 2.0 Technical Overview, OASIS, Working Paper (2005) 9. Alliance, L.: Liberty Aliance Project (2011), http://www.projectliberty.org/ 10. Seigneur, J.M.: Demonstration of security through collaborative in digital business ecosystem. In: Proceedings of the IEEE SECOVAL Workshop, Athens, Greece (2005) 11. Novotny, J.: An online credential repository for the Grid: MyProxy. In: Proceedings of the IEEE Tenth International Symposium on High Performance Distributed Computing (HPDC 2010), San Fransisco, USA (2001) 12. Chou, W.: Inside SSL: The Secure Sockets Layer Protocol. IEEE Computer Society: IT Professional (2002) 13. Pearlman, L., et al.: A Community Authorization Service for Group Collaboration. In: Proceedings of the Third International Workshop on Policies for Distributed Systems and Networks, Monterey, USA (2002) 14. Thompson, M., et al.: Certificate-based access control for widely distributed resources. In: Proceedings of the 8th Conference on USENIX Security Symposium, Washington DC (1999) 15. Weise, J.: Public Key Infrastructure Overview, Sun BluePrints Online (2001)


185

16. Boley, H., Chang, E.: Digital Ecosystem: Principles and Semantics. Presented at the Inaugural IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2007), Cairns, Australia (2007) 17. Briscoe, G., Wilde, P.: Digital Ecosystems: Evolving Service-Oriented Architectures. In: Proceedings of the 1st International Conference on Bio Inspired Models of Network, Information and Computing Systems, New York, USA (2006) 18. Pranata, I., Skinner, G.: Managing enterprise authentication and authorization permissions in digital ecosystem. Presented at the 3th IEEE International Conference on Digital Ecosystems and Technologies (DEST), Istanbul, Turkey (2009) 19. Pranata, I., Skinner, G.: Digital ecosystem access control management. WSEAS Transactions on Information Science and Applications 6, 926–935 (2009) 20. Kennedy, J.: Distributed infrastructural service. In: Nachira, F., Dini, P., Nicolai, A., Le Louarn, M., Leon, L.R. (eds.) Digital Ecosystem Technology. European Commission: Information Society and Media (2007) 21. Apache JMeter (July 2011), http://jakarta.apache.org/jmeter/

Reservation-Based Charging Service for Electric Vehicles Junghoon Lee, Gyung-Leen Park, and Hye-Jin Kim Dept. of Computer Science and Statistics Jeju National University, 690-756, Jeju-Do, Republic of Korea {jhlee,glpark,hjkim82}@jejunu.ac.kr

Abstract. This paper designs a telematics service capable of providing electric vehicles with a reservation-based charging mechanism, aiming at improving acceptance ratio. By the telematics network, each vehicle retrieves the current reservation status of charging stations of interest and then sends a reservation request specifying its requirement on charging amount and time constraint. Receiving the request, the charging station checks if it can meet the requirement of the new request without violating the constraints of already admitted requests. In this admission test, the charging scheduler, which may work in the charging station or a remote data center, implements a genetic algorithm to respond promptly to the fast moving vehicle. The performance measurement result, obtained from a prototype implementation, shows that the proposed scheme can significantly improve the acceptance ratio for all range of the number of tasks and permissible peak load, compared with a conventional scheduling strategy. Keywords: Smart transportation, electric vehicle telematics, charging schedule, reservation service, acceptance ratio.

1

Introduction

Telematics means the integration of telecommunication and informatics, especially focusing on applications in vehicles. The in-vehicle telematics device is an onboard computing platform having computing power and wireless connection to an information server. Empowered by the ongoing development of wireless communication technologies, vehicles better remain connected to the global network, through which a lot of sophisticated services can be provided. The telematics device also provides a user interface to drivers and passengers, inviting a variety of telematics applications, which are necessarily location-based services built upon GPS technology. The examples of telematics services include real-time traffic information, path finding, and vicinity information retrieval. Nowadays, vehicle telematics is required to extend its service area to electric vehicles, or EV, in

This research was supported by the MKE (The Ministry of Knowledge Economy) Korea, under IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-(C1820-1101-0002)).


Reservation-Based Charging Service for Electric Vehicles

187

short. In addition to the classical services, the EV telematics system must consider the EV specific requirement such as online advance booking of charging spots, remote vehicle diagnostics, and time display for next charging [1]. Even though many researchers and developers are working to improve driving range while decreasing charging time, weight, and cost of batteries, EVs still need to be charged more often and it takes tens of minutes to charge an EV [2]. Moreover, drivers want to have their cars charged by a certain time instant, for example, before they depart for their offices, their homes, and the like. As a result, the charging station must coordinate or schedule multiple requests having different time constraints, charging amount, and power consumption dynamics. In addition, when the station is charging multiple vehicles at the same time, the power consumption may grow too much shortly beyond the permissible bound contracted with its utility company [3]. Considering the battery capacities ranging from 15 to 50 kW h, area-wide peak load can be also serious when a number of EVs start charging during a short time window. Without an appropriate distribution of EVs over charging stations, not only the waiting time can intolerably increase, but also the power consumption may exceed the permissible range, possibly resulting in the extra cost. The availability of charging station information can distribute and even assign EVs over multiple stations. The information retrieval from the vehicle on the move is one of the typical applications of vehicle telematics. After retrieving the charging station information through the telematics system via the wireless vehicle network, a driver can select when and where to charge. The telematics device, manipulated by the driver, attempts to make a reservation on a station and possibly changes the route by a new path plan. The telematics system collects the current status of each charging station generally via the wired network including the Internet. For the charging station’s side, it must be able to post its current prices and estimated waiting time considering the current queue length and confirmed reservations. The stations inherently try to serve as many vehicles as possible. In addition, they have some restrictions on the maximum permissible power, the number of chargers, and so on. As a result, it is necessary to schedule the requests from vehicles according to its scheduling policy and announce its status to EVs via the telematics system. In this regard, this paper is to design an EV telematics service capable of providing efficient charging station selection. For EVs, charging information is provided to let them make a charging plan, while the telematics application integrates the decision to the vehicle’s routing schedule. In addition, for stations, a charging scheduler is needed to estimate the waiting time and to maintain peak load below the power level contracted with the energy supplier. Here, scheduling is in most cases a very complex time-consuming problem greatly sensitive to the number of tasks. It is difficult to solve by conventional optimization schemes, and their severe execution time makes them impractical in the real system. Accordingly, we will develop a scheduling scheme based on a genetic algorithm [4], which is an efficient search technique based on principles of natural selection and genetics.

188

J. Lee, G.-L. Park, and H.-J. Kim

This paper is organized as follows: After issuing the problem in Section 1, Section 2 describes the background of this paper. Section 3 designs an EV telematics service for efficient EV charging. The performance measurement results are discussed in Section 4. Finally, Section 5 summarizes and concludes this paper with a brief introduction of future work.

2

Background

The smart grid is a next generation power network which combines information technology with the legacy power network to optimize energy efficiency [5]. EVs are one of the most important components of the smart grid, as their batteries are efficiently charged via the smart grid to replace petroleum-based transportation infrastructure which may create quite much air pollution. EVs need a nationwide power charging infrastructure [6], creating new business models such as management of diverse vehicle types, charging stations, and subsidiary services [7]. A charging facility can be installed not just in the commercial charging station, but we can further consider the service area such as universities, offices, public institutes, shopping malls, airport parking lots, and the like. Many vehicles will concentrate in those places and they must be served according to a welldefined reservation and scheduling strategy [8]. Moreover, as EVs are necessarily equipped with one or more vehicle network interfaces, they can easily interact with a charging scheduler and other telematics services which may reside even in a remote computing cluster [9]. Meanwhile, the telematics system can provide diverse information services to the drivers taking advantage of two-way communication between the drivers and services. Particularly, IEEE 802.11 WLAN and DSRC (Dedicated Short Range Communication) provide vehicle-to-vehicle communication, while cellular networks such as GSM (Global System for Mobile) support global ubiquitous connection. The service usually exploits the current location and underlying geographic information such as road networks and POIs (Point Of Interests). Many services will be available to EVs, for example, on-demand-information fuelling, remote vehicle diagnostics, interior pre-conditioning, green report generation for monthly EV miles, and the like [10]. Among these, efficient charging is most essential. Considering the already available telematics service framework by which EVs and stations can interact, the scheduling policy in the charging station is the most critical factor for guaranteeing reasonable waiting time to drivers as well as keeping the energy consumption below the contracted amount. As for energy consumption scheduling, our previous work has designed a power management scheme capable of reducing the peak power consumption [11]. It finds the optimal schedule for the task set consisting of nonpreemptive and preemptive tasks, each of which has its own consumption profile as in [12]. To deal with the intolerable scheduling latency on the large number of tasks and slots, the feasible combinatory allocations are precalculated in advance of search space expansion for preemptive tasks. Then, for every partial allocation just consist of nonpreemptive tasks, the scheduler maps the combination of each preemptive


189

task to the allocation table one by one, checking the peak power requirement. This scheme significantly reduces the scheduling time by pruning unnecessary branches in the search space and seems to work efficiently also for charging tasks. However, it doesn’t consider any other constraints such as the number of chargers, precedence relations, and permissible contracted amount. The speedup is not still enough for practical use, so further enhancement must be achieved. Besides, the current DSM (Demand Side Management) programs consider appliance scheduling in homes and buildings, while some of them can be applied to the charging scheduler [13,14].

3 3.1

Scheduling Scheme EV Service Architecture

Even though multihop vehicle-to-vehicle networks can connect vehicles and charging stations without cost, the connection is not stable. Hence, we assume that vehicles and stations are connected via a global cellular network. A telematics service works between drivers and charging stations to support efficient information exchange as shown in Figure 1. The information necessary for charging services includes estimated distance covered on current charging, availability and booking of charging station, location of charging station, and state of charging [10]. Basically, drivers generally have several options and decide the station according to his/her preference on waiting time, cost, and the like. To this end, the current reservation status of each charging station is posted in the telematics server so that drivers can retrieve this information. After selecting a station to contact, a driver sends a reservation request specifying its requirement. When receiving a request, the station must be able to check if it can accept the request. The station conducts this test and returns the result fast enough for the moving vehicle to decide and confirm its reservation.

Fig. 1. EV telematics architecture

190


Each requirement consists of vehicle type, estimated arrival time, the desired service completion time (deadline), charging amount, and so on. Figure 1 shows the road network of our target area, namely, Jeju city. The charging stations are registered in this map. Based on this road map and classical real-time traffic information, the vehicle can quite accurately estimate when it will arrive at a specific charging station. The station must be indispensably reachable with the remaining battery. Receiving the request, the scheduler prepares the power load profile of the vehicle type from the well-known vehicle specification database. Then, it checks whether the station can meet the requirement of the new request without violating the constraints of already admitted requests based on its own scheduling strategy. The result is delivered back to the vehicle, and the driver may accept the schedule, attempt a renegotiation, or choose another station. Entering the reserved station, the vehicle is assigned a charger and waits in the queue according to the schedule. Actually, an EV is connected to a power line during it is waiting. However, power supply can begin under the control of the scheduler by activating the connection relay. Hence, our scheduling model disregards the vehicle switch time, which is analogous to the context switch overhead. 3.2

Charging Scheduler

The reservation service can work in the station or high-performance server in the data center. The scheduler must decide whether a station can accept the request before the requesting vehicle passes by its vicinity. So, accuracy can be somewhat sacrificed for fast computation. Each charging operation can be modeled as a task. For a task, the power consumption behavior can vary according to the charging stage, remaining amount, vehicle type, and so on. The load power profile is practical for characterizing the power consumption dynamics along the battery charging stage [12]. Web portals like Google PowerMeter also centralize energy consumption data about their users and this profile can be exploited in generating a better charging schedule [14]. In the profile, the power demand is aligned to the fixed-size time slot, during which the power consumption is constant, considering the availability of automatic voltage regulators. The length of a time slot can be tuned according to the system requirement on the schedule granularity and the computing time. In a power schedule, the slot length can be a few minutes, for example, 5 minutes. This length coincides with the time unit generally used in the real-time price signal. Charging task Ti can be modeled by the tuple of < Ai , Di , Ui >. Tasks are practically nonpreemptive in charging stations. Even though it can be preempted in the single user case as in an individual home, the charging process continues to the end once it has started in the charging station. The charging order can be changed only before the charging operation begins. Ai is the activation time of Ti , Di is the deadline, and Ui denotes the operation length, which corresponds to the length of the consumption profile. Ai is the estimated arrival time of the vehicle. Each task can start from its activation time to the latest start time, which can be calculated by subtracting Ui from Di . When a start time is


191

selected, the profile entry is just copied to the allocation table one by one, as the task can be neither suspended nor resumed during its operation. The choice option is bounded by M , the number of time slots in the scheduling window, hence the time complexity of search space traversal for a single charging task is O(M ), making the total complexity O(M N ) for optimal schedules, where N is the number of tasks. It takes tens of minutes or sometimes a couple of hours in the average performance PC to generate an optimal schedule, which investigates all feasible schedules. As contrast, genetic algorithms are efficient search techniques based on principles of natural selection and genetics. They have been successfully applied to find acceptable solutions to problems in business, engineering, and science within a reasonable time bound. Each evolutionary step generates a population of candidate solutions and evaluates the population according to a fitness function to select the best solution and mate to form the next generation. Over a number of generations, good traits dominate the population, resulting in an improvement in the quality of the solutions. It must be mentioned that the genetic algorithm process can run for years and does not find any better solution than it did in the first part of the process. In the scheduling problem, a chromosome corresponds to a single feasible schedule, and is represented by a fixed-length string of integer-valued vector. To begin with, the value denotes the start time for charging tasks. As they cannot be suspended once they have begun, just the start time is enough to describe their behaviors in a schedule. Here, if the consumption profile for the task is (3, 4, 5, 2), and the vector element is 2, the allocation for this task will be (0, 0, 3, 4, 5, 2, ...). As a result, the allocation vector can be converted into the allocation table which has N rows and M columns. For an allocation, the scheduler can calculate the per-slot power requirement and the peak load. If the peak load exceeds the permissible bound, the fitness value for this allocation will be the lowest, and the iteration will discard this allocation. The fitness function evaluates the quality of an allocation and proceeds to the next step. The iteration consists of selection and recombination. Selection is a method that picks parents according to the fitness function. The Roulette Wheel selection gives precedence to the chromosome having a better fitness value for mating. Recombination, or crossover, is the process taking two parents and producing a child with the hope that the child will be a better solution. This operation randomly selects a pair of two crossover points and swaps the substrings from each parent. This step may generate the same schedule with the existing ones in the population. It is meaningless to have multiple instances of a single schedule. So, they will be replaced by new random chromosomes. The charging scheduler is subject to time constraint of each charging task. However, this constraint can be always met, as the scheduler selects the start time only within the valid range, namely, from Ai to Di − Ui . In addition, we can directly control the number of iterations according to the allowable scheduling time. We can find more accurate schedule with more scheduling time, or iterations.

192

4


Performance Measurement

This section implements the prototype of the proposed allocation scheme using Visual C++ 6.0. Our implementation runs on the platform equipped with Intel Core2 Duo CPU, 3.0 GB memory, and Windows Vista operating system. The experiment sets the schedule length, namely, M , to 20 time units. If a single time unit is equal to 5 min, the total schedule length will be 100 min, but the time scale is a tunable parameter. For a task, the activation time is selected randomly between 0 and M , while the operation length is also selected randomly, but it will be set to M if the finish time, namely, the sum of start time and the operation length, exceeds M . The deadline of a charging task has 1.5 times as large as the charging time on average. In addition, the power demand for each time slot has the value of 1 through 5. The power scale, for example, kw, is not explicitly specified in this experiment, as it is a relative-value term. The experiment consists of the measurement of an acceptance ratio as well as the effect of iterations. For every parameter setting, 50 task sets are generated. The scheduler runs on every request arrival. If n tasks are already accepted, the next request invokes scheduling of (n + 1) tasks. Hence, the schedulability of a task set is a critical performance metric. We define the acceptance ratio as the ratio of the number of accepted task sets to that of total task sets. If the peak load of the schedule generated by a scheduling scheme is less than the contracted power, the set is considered to be accepted. The uncoordinated scheduler is selected for performance comparison as in [12]. This scheme initiates the task as soon as the task is ready and makes it run without preemption. This approach employs no control strategy, but it is important as it provides a measure for a comparative assessment for the efficiency of other charging strategies. The first experiment measures the acceptance ratio according to the number of charging tasks submitted to a charging station. The experiment changes the number of tasks from 3 to 15, while the power requirement amount exponentially distributes with the same average. The contracted power is set to 20. Hence, if the peak load of a schedule for a given task set is less than 20, the charging station can accept the task set. As shown in Figure 2(a), the uncoordinated scheme misses tasks set even when the number of tasks is just 3, while it cannot accept any set from the point where the number of tasks is 9. On the contrary, the proposed scheme can accept all task sets until the number of tasks is 6, and then misses more sets according to the increase of the number of tasks. When there are 10 sets, the acceptance ratio falls below 10 %. The comparison of acceptance ratio between two schemes is a little bit meaningless, as the gap reaches 76 % when the task set includes 6 tasks. Anyway, the proposed scheme shifts the breakdown point by at least 4 tasks. That is, our scheme can service at least 4 more charging tasks. The second experiment measures the effect of the contracted power, namely, permissible peak load, to the acceptance ratio. In this experiment, the number of tasks is set to 10. According to Figure 2(b), when the permissible bound reaches 32, our scheme can accept all task sets. On the contrary, the uncoordinated scheme shows just 45 % acceptance ratio. Judging from the result, the proposed


1

1 "Proposed" "Uncoordinated"

"Proposed" "Uncoordinated"

0.8 Acceptance ratio

0.8 Acceptance ratio

193

0.6 0.4 0.2

0.6 0.4 0.2

0

0 4

6

8 10 Number of tasks

12

14

10

(a) Effect of # of tasks

15

20 25 30 Contracted peak

35

40

(b) Effect of contracted power

Fig. 2. Acceptance ratio according to the task set and contracted power

scheme enables the charging station manager to contract with at least 50 % less power amount for the same acceptance ratio, compared with the uncoordinated case. It can help to increase the business profits. Up to this, the number of iterations is 1000 and the number of populations is 25. The next experiment measures the effect of genetic algorithm-specific parameters to the acceptance ratio. Figure 3(a) plots the acceptance ratio according to the number of iterations for two cases where task sets have 8 tasks and 10 tasks, respectively. The further iteration improves the acceptance ratio, but its effect is not so significant for both cases as long as the number of iterations exceeds 1000. In addition, Figure 3(b) plots the execution time according to the number of iterations. The execution time is measured using Microsoft Windows GetT ickCount system call which has the 1 ms time granularity. Figure 3(b) shows that the execution time is exactly linear to the number of iterations. However, the difference between two sets having 8 tasks and 10 tasks is very small. By this, we can decide the reasonable iteration number according to the desired execution time bound and accuracy. 1

2.5 "Task=8" "Task=10"

"Task=8" "Task=10" Execution time (sec)

Acceptance ratio

0.8 0.6 0.4 0.2 0

2 1.5 1 0.5 0

0

500

1000

1500

Iteration

(a) Effect of iterations

2000

0

500

1000

1500

Iteration

(b) Execution time

Fig. 3. Acceptance ratio according to the genetic algorithm parameters

2000

194

5


Concluding Remarks

In this paper, we have designed a telematics service capable of providing an efficient reservation mechanism to EVs, which can replace fossil fuels and reduce gas emissions, aiming at fast penetration of them into our daily life. Charging stations first post their reservation status and estimated waiting time in the telematics server, and EVs retrieve this information to select one and sends a reservation request specifying its requirement on charging amount and time constraint. Receiving the request, the charging station checks if it can meet the requirement of the new request without violating the constraints of already admitted requests. In this admission test, the charging scheduler, which may work in the charging station or a remote data center, implements a genetic algorithm to respond promptly to the fast moving vehicle. The performance of our design has been measured by a prototype implementation in terms of acceptance ratio according to the number of tasks and permissible peak load. The result analysis shows that the proposed scheme remarkably improves the acceptance ratio for given parameter setting, compared with a conventional uncoordinated scheduling scheme, accepting at least 4 more charging tasks and possibly contracting 50 % less power amount for the same acceptance ratio. As future work, we are planning to extend our work to global peak load reduction among the set of charging stations connected to a single utility company. The provider sets pricing tariffs that differentiate rate in time and level, the global optimization very important in energy cost saving [13]. In addition, we can avoid a system-wide power shortage, which make it necessary to rebuild the cable system or more power plants. Such goals can be achieved by an efficient scheduling scheme.

References 1. Guille, C., Gross, G.: A Conceptual Framework for the Vehicle-to-grid (V2G) Implementation. Energy Policy 37, 4379–4390 (2009) 2. Markel, T., Simpson, A.: Plug-in Hybrid Electric Vehicle Energy Storage System Design. In: Advanced Automotive Battery Conference (2006) 3. Spees, K., Lave, L.: Demand Response and Electricity Market Efficiency. The Electricity Journal, 69–85 (2007) 4. Katsigiannis, Y., Georgilakis, P., Karapidakis, E.: Multiobjective Genetic Algorithm Solution to the Optimum Economic and Environmental Eerformance Problem of Small Autonomous Hybrid Power Systems with Renewables. In: IET Renewable Power Generation, pp. 404–419 (2010) 5. Gellings, C.W.: The Smart Grid: Enabling Energy Efficiency and Demand Response. CRC Press, Boca Raton (2009) 6. Morrow, K., Karner, D., Francfort, J.: Plug-in Hybrid Electric Vehicle Charging Infrastructure Review. Battelle Energy Alliance (2008) 7. Kaplan, S.M., Sissine, F.: Smart Grid: Modernizing Electric Power Transmission and Distribution; Energy Independence, Storage and Security. TheCapitol.Net (2009)


195

8. Schweppe, H., Zimmermann, A., Grill, D.: Flexible In-vehicle Stream Processing with Distributed Automotive Control Units for Engineering and Diagnosis. In: IEEE 3rd International Symposium on Industrial Embedded Systems, pp. 74–81 (2008) 9. Ipakchi, A., Albuyeh, F.: Grid of the Future. IEEE Power & Energy Magazine, 52–62 (2009) 10. Frost & Sullivan: Strategic Market and Technology Assessment of Telematics Applications for Electric Vehicles. In: 10th Annual Conference of Detroit Telematics (2010) 11. Lee, J., Park, G., Kim, S., Kim, H., Sung, C.: Power Consumption Scheduling for Peak Load Reduction in Smart Grid Homes. In: ACM Symposium on Applied Computing, pp. 584–588 (2011) 12. Derin, O., Ferrante, A.: Scheduling Energy Consumption with Local Renewable Micro-Generation and Dynamic Electricity Prices. In: First Workshop on Green and Smart Embedded System Technology: Infrastructures, Methods, and Tools (2010) 13. Mohsenian-Rad, A., Wong, V., Jatskevich, J., Leon-Garcia, A.: Autonomous demand-side management based on game-theoretic energy consumption scheduling for the future smart grid. IEEE Transactions on Smart Grid 1, 320–331 (2010) 14. Caron, S., Kesidis, G.: Incentive-based energy consumption scheduling algorithms for the smart grid. In: IEEE SmartGridComm (2010)

Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms Junghoon Lee1 , Hye-Jin Kim1 , Gyung-Leen Park1, Ho-Young Kwak2 , and Cheol Min Kim3 1

Dept. of Computer Science and Statistics 2 Dept. of Computer Engineering 3 Dept. of Computer Education Jeju National University, 690-756, Jeju-Do, Republic of Korea {jhlee,hjkim82,glpark,kwak,cmkim}@jejunu.ac.kr

Abstract. This paper designs and implements an intelligent ubiquitous sensor network architecture for agricultural and livestock farms which embrace a variety of sensors and create a great volume of sensor data records. For the sake of efficiently and accurately detecting the specific events out of the great amount of sensor data which may include not just erroneous terms but also correlative attributes, the middleware module embeds an empirical event patterns and knowledge description. For the filtered data, data mining module opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. Finally, the remote user interface for monitoring and control is implemented by on Microsoft Windows, Web, and mobile device applications. Keywords: Ubiquitous sensor network, middleware, rule-based data processing, event detection, control box interface.

1

Introduction

Nowadays, wireless sensor networks have been successfully applied to environmental and wildlife habitat monitoring [1], while its intelligent and efficient management improves productivity and revenue of the agricultural and livestock farms [2]. Sensor data, inherently quite different from the traditional data records, are created in the form of a real-time, continuous, ordered sequence of sensor readings. Here, the temporal order can be decided either implicitly by arrival time or explicitly by timestamp, so a data stream is defined as a continuous sequence of tuples. Structure of data items in a data stream can change in time. Moreover, many data streams can include the spatial tag not just the temporal order, possibly hosting a geographic application on the sensor network.

This research was supported by the MKE (The Ministry of Knowledge Economy), through the project of Region technical renovation, Republic of Korea.


Intelligent Ubiquitous Sensor Network for Agricultural and Livestock Farms

197

A sensor network can be viewed as a large database system which responds to the query issued from various applications [3,4]. For example, the SyncQuery language that expresses composable queries over streams, pointing out that composition of queries, and hence supporting views is not possible in the append-only stream model [5]. This language employs the tagged stream model in which a data stream is treated as a sequence of modifications over the given relation. Particularly, the sliding-window approach is generalized by introducing the synchronization principle that empowers SyncSQL with a formal mechanism to express queries with arbitrary refresh condition. Besides, this work includes an algebraic framework for SyncSQL queries, couple of equivalences and transformation rules, and a query-matching algorithm. The main task of ubiquitous sensor networks, or USN in short, is monitoring sensor values, deciding the control actions, and triggering appropriate actuators [6]. For example, if the current lightness drops below the permissible level, USN can turn on the light in the green house. Moreover, if the current CO2 level is higher than a specific bound, a ventilator is activated to refresh the air. To this end, a lot of sensors are installed over the wide target area and each of them reports its sensor values to the controller, creating a tremendous amount of data records. USN must handle the large volume of sensor records and analyze them. Here, more than one sensor records are correlated as they capture the same event, and the records have sequential or spatial correlation. Moreover, sensor values can have garbage and measurement errors. The instability of wireless networks also can jeopardize the correct analysis. Wrong reaction stemmed from wrong data analysis can burn actuator motors, waste power, and lead to many hazardous problems. In this regard, this paper is to design and implement a USN architecture for agricultural and livestock farms, aiming at efficiently and accurately handling great volume of sensor data obtained from a variety of sensor devices and generating a correct control action. Our implementation focuses on the data processing middleware that interacts with sensor nodes containing CO2 , temperature, humidity, lightness, and wind sensors. The system design opens an interface to define a rule to filter the raw data, correlate multiple streams, and decide the control action. Next, the remote user interface for monitoring and control USN is implemented by on Windows, Web, and mobile device applications. The rest of this paper is organized as follows: After issuing the problem in Section 1, Section 2 describes the background and related work, focusing on the target USN architecture. Section 3 describes raw data processing and middleware processing of the proposed system, respectively. Section 4 presents the user interface implementation details. Finally, Section 5 concludes this paper with a brief introduction of future work.

2

Background and Related Work

Under the research and technical project named Development of convergence techniques for agriculture, fisheries, and livestock industries based on the

198

J. Lee et al.

ubiquitous sensor networks, our project team has designed and developed an intelligent USN framework [7]. This framework provides an efficient and seamless runtime environment for a variety of monitor-and-control applications on sensor networks. The sensor node, built on the Berkeley mote platform, comprises sensors, microprocessor, radio transceiver, and battery [8]. Over the sensor network mainly exploiting the Zigbee technology, composite sensors detect events such as body heat change of a livestock via the biosensors attached to it as well as humidity, CO2 , and NH3 level via the environmental sensors. Each node runs the IP-USN protocol and implements corresponding routing schemes [9]. The sensor network and the global network, namely, the Internet, are connected through the USN gateway. At this stage, the system is to integrate a remote control model to provide remote irrigation and the activation of heater or fan.

Fig. 1. Agricultural USN framework

Our previous work has designed an intelligent data processing framework in ubiquitous sensor networks, implementing its prototype [7]. Much focus is put on how to handle the sensor data stream as well as the interoperability between the low-level sensor data and application clients. This work first designs systematic middleware which mitigates the interaction between the application layer and low-level sensors, for the sake of analyzing a great volume of sensor data by filtering and integrating to create value-added context information. Then, an agent-based architecture is proposed for real-time data distribution to forward a specific event to the appropriate application, which is registered in the directory service via the open interface. The prototype implementation demonstrates that this framework can not only host a sophisticated application on USN and but also autonomously evolve to new middleware, taking advantages of promising technologies such as software agents, XML, and the like. Particularly, cloud


199

computing can provide the high-speed data processing framework for sensor streams [10]. It must be mentioned that XML data stream processing is also of interest, as XML becomes a common part of information systems, including RFID (Radio Frequency IDentifier), ad-hoc sensor data collection, network traffic management, and so-called service-oriented architecture [11]. Generally, XML streams are created as a second-hand product obtained from information exchange in XML systems, rather than from the raw sensor values. XML data streams can be viewed as a sequence of XML documents, and each data item in the stream is a valid standalone XML document, which is independent of other items in the stream. Moreover, queries on data stream can support data mining and filtering. While the first evaluates queries that span over a long time period, processing a great deal of time-sequenced data, the second takes the data items from the stream matching the filtering condition. Anyway, processing XML has an attractive real-world motivation and our system will also take advantage of XML technologies for interactions between data processing modules.

3 3.1

Intelligent USN Architecture Raw Data Processing

Each sensor output must be converted to our daily-life values. First, the sensor board consistently supplies 2500 mV to the soil humidity sensor device, which will generate the voltage value of 250 through 1000 mV . Here, 250 mV corresponds to 0 % humidity while 1000 mV to 100 % humidity. Next, pyranometer sensors are used to measure the solar radiation flux density on a planar surface, generally in watts per meter square. According to the sensor device specification, 220 mV is detected on full sunlight, namely, 1100 W m−2 . Hence, by Eq. (1), we can obtain the solar radiation value. Sr = So × Cv = 200mV × 5.0W −2 /mV = 1100W −2,

(1)

where Sr is solar radiation, So is the sensor output, and Cv is the conversion factor having 5 W m−2 /mV . Anometer sensors, commonly used in a weather station instrument, measures wind speed and direction. The device calculates the wind direction based on the probed voltage values measured from different angles. The relationship between the difference and the phase angle is provided as shown in Figure 2 and the corresponding measurement value is estimated as in Eq. (2). Vout − 2431 (2) −6.8473 In addition, wind speed is measured by counting the number of rotations of a wind cup during the unit time. Namely, π Ws = × Nr , (3) t θ[deg] =

200

J. Lee et al.

Output voltage (mV)

2500 "OutputVoltage" 2000 1500 1000 500 0 0

45

90

135

180 225 Degrees

270

315

360

Fig. 2. Phase angle and wind direction

where Ws is the wind speed estimation and Nr is the number of rotations during time interval t. 3.2

Middleware Layer Processing

Middleware works between the sensor interface and high-end data analyzers as shown in Figure 3. To begin with, as the collected data may have erroneous readings and garbage values, the middleware is required to check the validity range of the collected data first and prevent multiple reactions to a single event. To this end, the duplicate detector module checks the event length and value changes for the real-time sensor data, storing the filtered event in the database. The controller logic can define the control target and reaction range to activate the predefined control action to a specific event. Now, from the database, the meaningful context is extracted by the sophisticated classification and time series analysis for the event-level sequences. The series of event patterns and interpreted knowledge are embedded in the analysis module to recognize abnormal conditions instantly. Moreover, our system opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. The inference engine defines a set of rules to detect events. To define a rule, each sensor and node is assigned a unique identifier, while max(), min(), average(), count(), and run() functions are provided for better event specification. Using this, we can specify several rules, for example, report an event when average temperature of node 123 is higher than 35, or turn on all fans installed in sensor node 452. Based on this rule-base, the middleware checks the validity of the sensor data and requests the retransmission if it has an error term. After calculating the difference from the previous sensor reading, the middleware detects an abnormal condition based on the empirically obtained event patterns and knowledge. This procedure is illustrated in Figure 4.


Data logging mgmt Data management

201

Sensor management Sensor group mgmt dynamic event

Query processing

MSMQ (MicroSoft Message Queue) Data conversion

excel extension, XML exchange

Error processing

exception check, error policy

Duplication processing

duplicate sensor data check

Event processing

event detection logic

Sensor interface Fig. 3. Middleware architecture Rule

Event detection Rule Sensor info Database manager Rule

Data mining

Sensor value Log

Rule−based value processing

Response & event Request

Actuator control process

Data manager

Sensor value

Data source (Sensor nodes)

Fig. 4. Event logic

4

User Interface and Control Action

Remote monitor-and-control keeps track of the environmental sensor values on temperature, humidity, lightness, and CO2 . It can turn on or off the power switch to each actuator device. For example, the temperature monitor tracks the current temperature of a specific position selected via the geographic map. According to the initiation command, the server module begins to collect and store sensor readings. During the lifetime of this operation, the event detection is carried out based on the criteria specified in the query. The client also retrieves the current temperature value to monitor the up-to-date temperature change. Figure 5 show the user interface implemented in this application. First of all, it displays the

202

J. Lee et al.

(a) Normal status

(b) Abnormal status detection Fig. 5. User interface

Fig. 6. Control box interface

map, location of sensors along with the current status of them. In addition, the series of sensor values are scrolled in the listbox, while a graph is created to plot the temperature change. Figure 5(a) indicates the normal status where no sensor value deviates from the given bound, and all nodes are marked blue. Whereas,


203

in Figure 5(b), one sensor node detects the value out of the normal range and this node turns red. In addition, a remote control interface is implemented in an embedded control box and a u-Multi smart mote. First, the embedded control box application interacts with the control system via TCP/IP. As a Microsoft Windows application, it sends a control command such as current status retrieval and specific control action trigger, as shown in Figure 6. In response to this command, the sensor network sends an acknowledgment back to the controller box. In addition, the sensor network can automatically notify the event of approaching the final permissible borderline of current sensor reading. This control box application is also implemented as a web program. Second, u-Multi smart mote is functionally similar to the control box except that its communication interface is SMS (Short Message Service) instead of TCP/IP and it is developed on a smaller user display. One of the most critical events is power breakage and the battery can survive tens of minutes to allow a failure recovery procedure, including notification to human managers.

5

Concluding Remarks

In this paper, we have designed and developed an intelligent ubiquitous sensor network targeting at agricultural and livestock farms which have a variety of sensors and create a great volume of sensor data records during the monitoring phase. For the sake of efficiently and accurately detecting the specific events out of the great number of sensor data which may include erroneous terms, correlative components, the middleware module embeds an empirical event patterns and knowledge description. It also interprets sensor-specific data to the actual values. For the filtered data, data mining module opens an interface to define the relationship between the environmental aspect and facility control equipments, set the control action trigger condition, and integrate new event detection logic. Finally, the remote user interface for monitoring and control USN is implemented in Windows, Web and mobile device applications. As future work, we are planning to design an advanced data inference engine for management information as well as sensor data [13]. The sophisticated data analysis will create a new type of management messages and those messages will make USN more intelligent.

References 1. Golab, L., Oszu, M.: Issues in Data Stream Management. ACM SIGMOD Record 32, 5–14 (2003) 2. Lee, J., Park, G., Kim, H., Kim, C., Kwak, H., Lee, S., Lee, S.: Intelligent Management Message Routing in Ubiquitous Sensor Networks. In: Int’l. Conference on Computational Collective Intelligence-Technologies and Applications (2011) 3. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: TinyDB: an Acquisitional Query Processing System for Sensor Networks. ACM Transactions on Database Systems 30 (2005)

204

J. Lee et al.

4. Madden, S., Franklin, M.: Fjording the Stream: An Architecture for Queries over Streaming Sensor Data. In: Proc. of the 2002 Intl. Conf. on Data Engineering (2002) 5. Ghanem, T., Elmagarmid, A., Larson, P., Aref, W.: Supporting Views in Data Stream Management Systems. ACM Transactions on Database Systems 35(1) (2010) 6. Culler, D., Estrin, D., Srivastava, M.: Overview of Sensor Networks. IEEE Computer 37, 41–49 (2004) 7. Lee, J., Park, G., Kwak, H., Kim, C.: Efficient and Extensible Data Processing Framework in Ubiquitous Sensor Networks. In: International Conference on Intelligent Control Systems Engineering, pp. 324–327 (2011) 8. http://www.tinyos.net 9. Cuevas, A., Urue˜ na, M., Laube, A., Gomez, L.: LWESP: Light-Weight Exterior Sensornet Protocol. In: IEEE International Conference in Computer and Communications (2009) 10. Kang, M., Kang, D., Crago, S., Park, G., Lee, J.: Design and Development of a Run-time Monitor for Multi-Core Architectures in Cloud Computing. Sensors 11, 3595–3610 (2011) 11. Ulrych, J.: Processing XML data streams: A survey. In: WDS Proc. Contributed Papers, pp. 218–223 (2008) 12. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: The Design of an Acquisitional Query Processor for Sensor Networks. In: ACM SINGMOD (2003) 13. Woo, H., Mok, A.: Real-Time Monitoring of Uncertain Data Streams Using Probabilistic Similarity. In: Proc. of IEEE Real-Time Systems Symposium, pp. 288–300 (2007)

Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks Heejung Byun1 and Jungmin So2 1

Dept. of Information and Telecommunication Engineering, Suwon University, Hwaseong-si, Gyeonggi-do, Korea [email protected] 2 Dept. of Computer Engineering, Hallym University, Chuncheon, Gangwon-do, Korea [email protected]

Abstract. This paper proposes a control-based approach to duty cycle adaptation for wireless sensor networks. The proposed method controls duty cycle through queue management in order to achieve high performance under variable traffic rates. To have energy efficiency while minimizing the delay, we design a feedback controller, which adapts sleeping interval time to traffic change dynamically by constraining the queue length at a predetermined value. Based on the control theory, we analyze the adaptation behavior of the proposed controller and demonstrate system stability. The simulation results show that the proposed method outperforms existing scheduling protocols by achieving more energy savings while minimizing the delay. Keywords: Wireless sensor networks, energy, delay, queue management, analytic analysis.

1

Introduction

Wireles sensor networking (WSN) has a wide range of applications that help to sense and monitor environmental attributes, such as target tracking, infrastructure security, fire detection and traffic control. These networks are usually deployed in an ad hoc manner in the network sharing the same communication medium. Typically, WSN is composed of a large number of distributed sensor nodes which are often battery-powered and required to operate for years with no human intervention after deployment. Therefore, a major problem in deploying WSNs is their dependence on limited battery power. Many research efforts in the recent years have focused on developing power saving methods for WSNs. These methods include power-efficient MAC layer protocols [1]-[9] and network layer routing protocols [10]-[11]. These protocols save energy but introduce extra end-to-end delay, i.e., sleep delay. In WSN, the delay has been a key factor to some delay-sensitive applications, such as health or military applications. Many researches have been proposed to achieve a good tradeoff between energy consumption and delay [5]-[9]. Adaptive listening [5] Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 205–214, 2011. c Springer-Verlag Berlin Heidelberg 2011

206

H. Byun and J. So

suggests the use of overhearing to reduce the sleep delay. DSMAC [6] dynamically changes each node’s duty cycle to meet applications’ demands so that a node increases its duty cycle by adding extra active periods when it requires less latency or when the traffic load increases. U-MAC [7] tunes its duty cycle based on a utilization function, which is the ratio of the actual transmission and receptions performed by the node over the whole active period. RL-MAC [8] optimizes active and sleep periods with the double aim of increasing throughput and saving energy based on MDP (Markov decision process). DutyCon [9] proposes a feedback controller which controls the duty cycle to guarantee an end-to-end communication delay while achieving the energy efficiency. To do this, DutyCon decomposes the end-to-end delay requirement problem into a set of single-hop delay requirement problems. However, DutyCon requires the time stamp on the sender side to calculate the time delay on the receiver and the use of this slack time of each packet results in a slow response to the traffic changes. In this paper, we propose an adaptive duty cycle control mechanism based on the queue management with the aims of energy saving and delay reduction. The queue states potentially imply the network status, so that we can infer the traffic variations or topology changes implicitly. Using the queue length and its variations of a sensor node, we present a control-based approach and design a distributed duty cycle controller, which adapts the sleeping interval time to the variable traffic rate. Based on the control theory, we derive the steady state and show system stability for the proposed controller.

2 2.1

Duty Cycle Control Network Modeling

We first introduce the following notations. – G = (M, L), a WSN where M is the node set and L is the link set of the network. – lm , the outgoing wireless link of node m (0 ≤ m ≤ M − 1) where M is the cardinality of M. – τc , the time period of duty cycle control. Specifically, the duration of time slot [n, n + 1). – Dlm , the link transmission rate at link lm . – qlm , the queue length of link lm . – cm , the time length of the sleep interval of node m. – wm , input process, the number of packets that arrive during time slot [n, n + 1) including the generated traffic of node m. The queue length of link lm at the (n + 1)-th iteration can be modeled as: qlm (n + 1) = [qlm (n) + wm (n) − Dlm (τc − km (n) · cm (n) − mod(τc , cm (n) + Ta ))]+ = [qlm (n) + wm (n) − Dlm km (n)Ta ]+

(1)

Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks

207

τc where km (n) = cm (n)+T and [·]+ = max(·, 0). The value of Ta stands for the a active period with a fixed size. Note that the value of cm (n) remains constant during iteration [n, n + 1). Active periods are of fixed size whereas the length of sleep periods depends on a value determined by the duty cycle controller. During a single control period, there may be multiple active times and the queued packets are transmitted during active times. We assume that the average network condition of a link does not change frequently. Thus, we simplify the value of Dlm to be stable during a control period [9].

2.2

Duty Cycle Controller Design

Based on the network model, we design the distributed duty cycle controller. We control the duty cycle of each node by dynamically adjusting the sleep interval time under variable traffic conditions. In each control period, the controller controls a node’s sleeping interval time using the local information available at the node. The proposed scheme does not need to gather information about its neighborhood’s state, such as suffered delay [7]-[9]. Using the local information, we propose a dynamic duty cycle controller to meet time-varying or spatially non-uniform traffic loads by constraining the queue length at a predetermined threshold: − qlm (n + 1)) − γ(qlm (n + 1) − qlm (n)) cm (n + 1) = cm (n) + β(qlth m

(2)

qlth m

where β and γ are the control parameters to be chosen, and is the queue threshold. The sleep interval time increases linearly as the queue length becomes smaller than the queue threshold. Meanwhile, the sleep interval time decreases as the forward difference of queue length becomes larger than zero because the increased forward difference of queue length means increased latency. The value of β and γ determines the stability of the controller. The range of β and γ for the stable performance is established using a stability analysis of a closed-loop system shown in the next section. The queue threshold can be set according to the application requirement. When the queue threshold is low, a node increases the duty cycle by adding active periods, resulting in low latency. On contrary, as the queue threshold becomes larger, the delay increases because the proposed controller increases the sleeping interval time in order to buffer the packets until the queue length reaches the queue threshold. Hence, for the delay-sensitive applications, the queue threshold can be set as a rather small value. Since each node can be assigned different duty cycle, the sender has to synchronize its duty cycle with the receiver such that receiver and sender node are active at the same time. Therefore, it needs to exchange its determined schedule with its neighbors. As in S-MAC [2], we assume that each node maintains a schedule table that stores the schedules of all its neighbors and the sensor nodes exchange schedules using ACK packet with their neighbors. 2.3

Stability Analysis

Based on the network model and duty cycle controller, we analyze the system stability in case of variable τc and fixed τc respectively. First we consider the

208

H. Byun and J. So

case of variable τc such as km = 1 for all iterations. Then the system can be represented by a discrete-time model where the duration of control period equals (cm (n) + Ta ): qlm (n + 1) = [qlm (n) + wm (n) − Dlm (τc − cm (n))]+ cm (n + 1) = cm (n) + β(qlth − qlm (n + 1)) − γ(qlm (n + 1) − qlm (n)) m

(3)

Let qlm s and cms denote the average steady-state solution of queue length qlm (n) and sleep interval time cm (n) of node m respectively. Note that in the neighborhood of the steady state, we ignore the saturation nonlinearity. From the asymptotic theory [12]-[14], we obtain the average steady points of the queue length and sleep interval time: qlm s = qlth m cms = τc −

wma D lm

(4)

where the value of wma denotes the average value of wm . Thus, from (4), we can see that the queue length converges to the desired threshold in the steady state and the sleeping interval time is adapted with consideration of traffic rate. Now we analytically show the stability of the proposed controller around the steady point. Let δqlm = qlm − qlm s δcm = cm − cms δwm = wm − wma Then (3) can be rewritten as: δqlm (n + 1) = δqlm (n) + δwm (n) + Dlm δcm (n) δcm (n + 1) = δcm (n) − βδqlm (n) − (β + γ)Dlm δcm (n) − (β + γ)δwm (n) (5) For the purpose of analytic simplicity, we concentrate on the networks where the traffic load is arbitrarily constant as average steady point. However, the results of this paper can be generalized to the stochastic traffic load. Let x(n) = [δqlm (n) δcm (n)]T . Then the characteristic polynomial of (5) can be obtained as: Φ(z) = z 2 + ((β + γ)Dlm − 2)z + 1 − γDlm

(6)

In order for the controller to be stable, Φ(z) should have all zeros within the unit circle. Hence the system is asymptotically stable if the control parameters satisfy the following relation: (β + 2γ)Dlm < 4

(7)

From now on, we consider the case of fixed τc . Then there may be multiple active times and km can be variable: +

qlm (n + 1) = [qlm (n) + wm (n) − Dlm km (n)Ta ] cm (n + 1) = cm (n) +

β(qlth m

− qlm (n + 1)) − γ(qlm (n + 1) − qlm (n))

(8) (9)

Queue-Based Adaptive Duty Cycle Control for Wireless Sensor Networks

209

From (8) and (9), the average steady points of the queue length and sleep interval time are derived: qlm s = qlth m Dlm τc cms = Ta −1 wma

(10)

Thus, we can see that the queue length converges to the desired threshold and the sleeping interval time is adapted in inverse proportion to the traffic rate. With the notation of δqlm , δcm , δwm , we rewrite (8) and (9) as: δqlm (n + 1) = f1 (δqlm (n), δcm (n)) δcm (n + 1) = f2 (δqlm (n), δcm (n))

(11) (12)

where f1 (δqlm (n), δcm (n)) = δqlm (n) + δwm (n) − ζ(δcm (n)) f2 (δqlm (n), δcm (n)) = δcm (n) − βδqlm (n) − (β + γ)(δwm (n) − ζ(δcm (n))) D

T τ

a c and ζ(δcm (n)) = Ta +δclm . m (n)+cms The qualitative behavior of a nonlinear system near a steady point can be determined via linearization with respect to that point. We approximate the nonlinear system as described in (11) and (12) by the following linear system:

δqlm (n + 1) = a11 δqlm (n) + a12 δcm (n)

(13)

δcm (n + 1) = a21 δqlm (n) + a22 δcm (n)

(14)

Rewriting this equation in a vector form, we obtain x(n + 1) = Ax(n) where

(15)

∂f1 ∂f1 a11 a12 lm ∂δcm A= = ∂δq ∂f2 ∂f2 a21 a22 ∂δqlm ∂δcm δqlm =0,δcm =0 Dlm Ta τc 1 (Ta +cms )2 = D Ta τc −β 1 − (β + γ) (Talm +cms )2

Then the system is asymptotically stable if the control parameters satisfy the following relation: (β + 2γ)Dlm Ta τc exit) return;

5.

t->entry_point(t->ws);

6.

notify_job_copletion(t);

7. 8.

} } Fig. 3. Pseudo-code of the shadow process

3.2 The Scheduling Coordinator and RTOS Dispatchers In order to concretely put this model in operation, the infrastructure needs to carry out two fundamental activities: realize the GSP and enforce its actions. In order to

336

E. Faldella and P. Tucci

decouple these two aspects, we introduce two dedicated software components, respectively, the scheduling coordinator and the RTOS dispatchers. The scheduling coordinator, which from the software point of view is a unique process for the system that can be instantiated arbitrarily on any of the m processors, implements the GSP taking all the scheduling decisions for the whole system. It perceives only a high-level and platform-independent view of the multiprocessor system. Its operation is decoupled by means of message passing interaction from the underlying shadow process model, which is handled by the RTOS dispatchers. More in detail, the following messages are envisaged: • create_task: sent by the scheduling coordinator to the m dispatchers when a realtime task is created through the create_real_time_task primitive in order to let the dispatchers instantiate the corresponding shadow process. • activate_task: sent by the scheduling coordinator to a dispatcher to activate (release or resume) a job on a specific processor. • execution_completed: sent by a dispatcher to the scheduling coordinator to notify the completion of a job when the corresponding shadow process invokes the notify_job_completion primitive. The state of the real-time tasks is rendered to the scheduling coordinator through two data structures, the task state vector and the processor state vector, that are updated as result of the message exchange with the RTOS dispatchers (Fig. 4). The first data structure keeps track of the evolution of each task, reporting, along with the temporal attributes, its current state (Table 1) and processor affinity. The second data structure reflects the current state (either IDLE or BUSY) of each processor, as well as the realtime task currently running on it when in the BUSY state.

Fig. 4. Overview of the restricted-migration scheduling infrastructure

The role of the RTOS dispatchers is twofold: enforce the actions issued by the scheduling coordinator (task creation and activation) and notify back jobs completion. From the software standpoint, the former is realized through a service routine, whose execution is triggered by the mailbox hardware. Correspondingly, the dispatcher performs the actions envisaged by the shadow process model, resuming the new task and possibly suspending the previously running task (if any) on the local RTOS. Analogously, the

Embedded Real-Time Applications on Asymmetric MPSoCs

337

notify_job_completion primitive, which is invoked by the shadow process after each job execution (Fig. 3), is modeled as a procedure which sends backwards the execution_completed message to the scheduling coordinator, allowing the GSP to proceed with further scheduling decisions. Table 1. Run-time states of real-time tasks State IDLE SCHEDULABLE RUNNING PREEMPTED

Description The task is waiting for next job release. A new job can be released but has not yet been activated (e.g., due to higher priority tasks). The task has been activated and is currently running on the processor specified by its affinity. Task execution has been pre-empted, after beginning on the affinity processor.

4 Evaluation Methodology Originally exploited as prototyping platforms for later implementation in ASIC, FPGAs have become feasible vehicles for final designs, enabling an agile integration of manifold hardware resources suitably interconnected via a customizable bus, as general-purpose processors (soft-cores), memory and peripheral devices. Currently available design tools leave high degrees of freedom to the designer, particularly as regards the inter-processor communication infrastructure and the memory layout. Customization options typically involve not only the choice of the memory technology, which can range from fast on-chip memory to external solutions, but also the interconnection topology, allowing to tightly couple a memory to a single core, avoiding any contention, or share it, sacrificing access times in favor of resource saving. The Altera NIOS-II soft-core has been chosen as reference architecture for the experimental evaluations, due to the flexibility of its integrated development environment that permits easy customization of different hardware templates transparently supported by the bundled μC/OS-II RTOS. The NIOS-II/f fast version we employed in our experiments can be further endowed with a write-back directly mapped data cache (D-cache), that permits to reduce bus contentions exploiting spatial and temporal locality of memory accesses. Lacking any hardware coherency support, explicit cache flushes and proper synchronization must be handled by software in order to guarantee coherency of memory shared by different cores. The message-passing infrastructure has been realized using the FIFO core provided by the Altera SoPC, realizing a 1-to-m bidirectional channel between soft-cores. Using an Altera Cyclone IV FPGA clocked at 50 MHz and combining different memory and cache layouts as shown in Table 2, we investigated four reference hardware templates based on NIOS-II/f cores: shared memory (TS), shared memory with D-cache (TSC), dedicated memory (TD), dedicated memory with D-cache (TDC). As regards the memory technology, we used internal MK9 SRAM blocks for the on-chip memory and an external SDRAM module for the shared memory. In order to preserve the memory consistency of the shadow process model in the TSC and TDC templates, explicit cache flushes are performed on job boundaries.

338

E. Faldella and P. Tucci Table 2. Configuration of the reference hardware templates

TS

TSC

TD

TDC

No

2 kB

Instuctions. cache Data cache

No

RTOS memory (Instructions + data)

External memory

On-chip memory

Tasks memory (Instructions)

External memory

On-chip memory

Tasks memory (Data)

2 Kb 2 kB

External memory

The goals of the experimental evaluation are twofold. Infrastructure overhead. Two key factors contribute to such overhead: (i) job activation overhead, i.e. the interval that elapses between the issue of an activate_task message by the GSP and the execution of the corresponding shadow process; (ii) job completion overhead, i.e. the interval that elapses between the completion of a job, the update of the working-set and the reception of the corresponding message by the GSP. The additional time taken by the GSP to carry out its scheduling decisions has not been accounted since it strongly depends on the particular GSP employed and is extensively discussed by the relative studies herein referred. Performance slowdown. Apart from the infrastructure overhead itself, we analyze how the run-time execution of application tasks is further biased by the hardware platform. The different hardware templates, in fact, are likely to differently respond to the workload of the real-time tasks, in particular to changes of number of cores simultaneously executing and their working-set size. Furthermore, the more or less frequent context switches and task migrations issued by the GSP can additionally contribute to the run-time duration. In order to account these additional contributes and determine the effective factors which influence them, we set-up an experimental test-bench which combines (Fig. 5) the four hardware templates (T) with 4 different number of cores (m), 6 working set sizes (S) , 4 pre-emption rates (P) and 4 migration rates (M, expressed in migrations per period), for a total of 1536 scenarios.

Fig. 5. Test-bench parameters

For each scenario, we perform the scheduling of a fixed number of 16 identical tasks, in which each job executes a CoreMark [20] instance in order to emulate some real workload on the working set. Task periods were chosen to be long enough to compensate duration variance due to the different platforms avoiding overrun conditions. We employ a regular scheduling pattern relying on a quantum-driven roundrobin scheme, in order to deliver a constant number of preemptions and migrations according to the configuration of each scenario. At each period the 16 tasks are arranged in m clusters and each cluster is scheduled on each core in round-robin using a P time-quantum (‘NO’ means that task jobs are sequentially executed). On the next period the pattern repeats shifting the clusters by M positions.


339

5 Experimental Results Figs. 6a and 6b show the two contributions to the infrastructure overhead. Each column reports the overhead measured for each hardware template in function of m, aggregating the average over the variation of S, P and M parameters, as, not surprisingly, they revealed to have a negligible influence on the infrastructure overhead. Job activation measurements show as both the TD and TDC templates exhibit an almost constant overhead as m increases, since the operations performed on the shared memory are minimal. On the other hand, the TS and TSC templates exhibit a worse scalability, in particular in the case of simultaneous activations on the cores, as both data and instruction ports contribute to the contention of the shared memory module when RTOS scheduling primitives are invoked. Furthermore, it might be also noted that for both the dedicated and shared cases, the relative templates involving data cache exhibit slightly higher overheads. The limited size of the data cache, in fact, is likely to cause a lag due to write-back of the stale cache lines prior to executing the dispatcher code, causing for such a short-length routine an opposite effect than expected. As regards the completion overheads, both TS and TD templates exhibit a very limited, yet expected, contribute. The corresponding templates involving data cache, instead, introduce a more consistent overhead (order of tenths of microseconds) required to invalidate and write-back the data cache in order to preserve the working-sets consistency. In this case, while the TDC template exhibits an almost linear behavior, the TSC template suffers of concurrent data and instruction cache contentions causing increased (≈ 2x) overheads in the 8-cores configuration. Cumulative infrastructure overheads are showed in Fig. 6c as sum of the two contributes. The dedicated templates exhibit an overall good scalability inducing small and almost constant overhead even in the 8-core configurations, while the shared templates demonstrate to be negatively influenced by the shared memory bottleneck.

Overhead [us.]

Overhead [us.]

300 200 100 0

500

25

400

20 15 10 5 0

TD

1 Core 2 Cores 4 Cores 8 Cores TDC TS TSC

Infrastructure overhead

Job completion overhead 30 Overhead [us.]

Job activation overhead 400

300 200 100 0

TD

1 Core 2 Cores 4 Cores 8 Cores TDC TS TSC

1 Core TD

2 Cores TDC

4 Cores TS

8 Cores TSC

Fig. 6. Infrastructure overhead due to job activation (a), completion (b) and cumulative (c)

In addition to the overhead directly introduced by the scheduling infrastructure, Figs. 7(a-d) show how run-time performance of application tasks is affected by preemptions. Each of the 4 charts reports the average time required to complete a whole job issuing preemptions at different rates (according to the P parameter) in function of m, under each hardware template. TD reveals to be the less influenced template incurring, in the {m=8 cores; P=1 ms} configuration, a slowdown of 1,8% (7 us) compared to the sequential execution case. In the corresponding template involving data cache (TDC), preemptions caused an higher relative increment of 6,9% (5 us.)

340


in the analogous configuration. The shared templates demonstrated to majorly suffer the influence of preemptions, in particular the TS exhibit a slowdown of 24,5% (98 us) in the {m=8 cores; P=1 ms} configuration while the introduction of data cache induce in the TSC template a slowdown of 30,8% (25 us). As a broader level consideration it might be noted that the effect of data cache on the preemption overhead has a lesser extent if compared to the speedup provided to tasks run-time. In order to provide a comparative evaluation of the overall run-time overhead factors, Figs. 8(a-d) show, for each hardware template, the relative slowdowns highlighting, at variations of W, the difference between the slowdown due to the hardware architecture and the slowdown due to the scheduling infrastructure. For each column, the lower colored part reports the ratio between the average run-time on the m-way multiprocessor configuration performing sequential jobs execution and the corresponding measurement on the uniprocessor configuration. The upper (red) part shows the surplus slowdown, introduced by the infrastructure, using the preemptive roundrobin execution with the tightest (P = 1 ms) quantum. It may be clearly noted that the slowdown introduced in the infrastructure is definitely marginal in the TD and TS templates when compared to the slowdown introduced by the multiprocessor hardware architecture. Such slowdown becomes comparable only in the TDC and TSC templates, highlighting how preemptions suffer a worse exploitation of caches. As a final remark it might be noted that neither of the considered graphs reports the effect of tasks migrations. In fact, in all of the combinations considered, the changes of the M parameter did not produce any remarkable effect on the measurements, therefore they have been omitted. Run-time - TD [us.]

Run-time - TDC [us.] 389 384 383 382

8 Cores 203 200 199 198

4 Cores

1 Core

2 Cores

1 Core

50 100 150 200 250 300 350 400 450 1 ms 5 ms 10 ms NO Preempt.

50 1 ms

Run-time - TS [us.] 426 413 400

8 Cores

1 Core

5 ms

2 Cores

1 Core

10 ms

NO Preempt.

100 NO Preempt.

88 84 81

8 Cores

4 Cores

50 100 150 200 250 300 350 400 450 500 550 1 ms

10 ms

Run-time - TSC [us.]

148 143 142 140 126 122 122 121

2 Cores

5 ms

498

229 213 210 207

4 Cores

77

65 63 63 62 62 60 60 59 60 59 59 58

4 Cores

139 137 136 136 119 117 117 116

2 Cores

73 73 72

8 Cores

50 1 ms

69 68 67 66 63 62 62 63 61 60 60

5 ms

106

76

10 ms

100 NO Preempt.

Fig. 7. Absolute run-time performances of TD (a), TDC (b), TS (c) and TSC (d) templates varying m and P parameters with W: 16 kB


Preemption overhead - TS 350% 325% 300% 275% 250% 225% 200% 175% 150% 125% 100% 75% 50% 25% 0%

341

Preemption overhead - TSC 75%

50%

25%

0% 512 B

1 Core

1 kB 2 kB 4 kB 8 kB 16 kB Working-set size 2 Cores 4 Cores 8 Cores

512 B 1 Core

Preemption overhead - TD 250% 225% 200% 175% 150% 125% 100% 75% 50% 25% 0%


Preemption overhead - TDC 25%

0% 512 B

1 Core


512 B 1 Core


Fig. 8. Relative slow-down of TD (a), TDC (b), TS (c) and TSC (d) templates varying W and m parameters

6 Concluding Remarks We presented the essential implementation details of a portable scheduling infrastructure which enables global scheduling of real-time tasks on asymmetric multiprocessor platforms, according to the restricted-migration model. We put the focus on the mechanisms which, regardless the particular scheduling policy employed, allow to arbitrarily perform job preemptions and task migrations on the mainstream embedded AMP platforms employing only elementary scheduling primitives offered by almost every RTOS. In order to decouple these low-level scheduling mechanisms from userdefinable high-level scheduling policies we presented a run-time approach, called shadow process model, which introduces two software components with the aim of managing separately the two aspects, handling the decoupling by means of message passing interaction. We experimentally evaluated the viability of our approach employing four reference FPGA-based multiprocessor templates combining different memory models and cache layouts, and analyzed both the overhead directly introduced by our infrastructure and the further consequences yielded on run-time performances, putting particular attention to the effect of scheduling decisions, i.e. preemptions and migrations, on the tasks run-time. In this regard we showed that the overhead introduced by the proposed infrastructure has a limited extent, especially in the hardware platforms which involve private memory for the RTOS. Furthermore we showed that job preemptions induce a slowdown which is smaller than the slowdown caused by the multiprocessor parallelism Task migrations, instead, showed to not cause any remarkable effect in the proposed approach.

342


As future research directions the experimental evaluations herein presented should be extended in order to contemplate more complex MPSoC architectures involving other communication and interaction paradigms such as network-on-chips, studying the viability of the approach on those hardware platforms which do not assume any shared memory. Furthermore, we consider to exploit the hardware configurability of FPGAs to replace the scheduling coordinator with an hardware implementation, freeing the soft-cores by the computational cost of the scheduling policy.

References 1. Lee, E.A.: What’s ahead for embedded software? Computer 33, 18–26 (2000) 2. Martin, G.: Overview of the MPSoC design challenge. In: 43rd ACM/IEEE Design Automation Conf., pp. 274–279 (2006) 3. Tumeo, A., et al.: A dual-priority real-time multiprocessor system on FPGA for automotive applications. In: DATE 2008, pp. 1039–1044 (2008) 4. Ben Othman, S. Ben Salem, A.K., Ben Saoud, S.: MPSoC design of RT control applications based on FPGA SoftCore processors. In: ICECS 2008, pp. 404–409 (2008) 5. Joost, R., Salomon, R.: Advantages of FPGA-based multiprocessor systems in industrial applications. In: IECON 2005, p. 6 (2005) 6. Baruah, S., Carpenter, J.: Multiprocessor fixed-priority scheduling with restricted interprocessor migrations. In: ECRTS 2003, pp. 195–202 (2003) 7. Funk, S., Baruah, S.K.: Restricting EDF migration on uniform multiprocessors. In: 2th International Conference on Real-Time Systems (2004) 8. Carpenter, J., Holman, P., Anderson, J., Srinivasan, A., Baruah, S., et al.: Handbook of Scheduling: Algorithms, Models, and Performance Analysis, pp. 30-1–30-19. Chapman and Hall/CRC, Boca Raton (2004) 9. Devi, C.U., Anderson, J.: Tardiness bounds under global EDF scheduling on a multiprocessor. Real-Time Systems 38, 133–189 (2008) 10. Lauzac, S., Melhem, R., Mosse, D.: Comparison of global and partitioning schemes for scheduling rate monotonic tasks on a multiprocessor. In: 10th Euromicro Workshop on Real-Time Systems, pp. 188–195 (1998) 11. Xie, X., et al.: Asymmetric Multi-Processor Architecture for Reconfigurable System-onChip and Operating System Abstractions. In: ICFPT 2007, pp. 41–48 (2007) 12. Monot, A., et al.: Multicore scheduling in automotive ECUs. In: ERTSS 2010 (2010) 13. Hung, A., Bishop, W., Kennings, A.: Symmetric multiprocessing on programmable chips made easy. In: DATE 2005, pp. 240–245 (2005) 14. Huerta, P., et al.: Exploring FPGA capabilities for building symmetric multiprocessor systems. In: SPL 2007, pp. 113–118 (2007) 15. Huerta, P., et al.: Symmetric Multiprocessor Systems on FPGA, pp. 279–283. IEEE Computer Society, Los Alamitos (2009) 16. Baruah, S.: The Non-preemptive Scheduling of Periodic Tasks upon Multiprocessors. Real-Time Systems 32, 9–20 (2006) 17. Kargahi, M., Movaghar, A.: Non-preemptive earliest-deadline-first scheduling policy: a performance study. In: MASCOTS 2005, pp. 201–208 (2005) 18. Calandrino, J., et al.: LITMUS^RT: A Testbed for Empirically Comparing Real-Time Multiprocessor Schedulers. In: RTSS 2006, pp. 111–126 (2006) 19. Faggioli, D., et al.: An EDF scheduling class for the Linux kernel. In: Real-Time Linux Workshop (2009) 20. The Embedded Microprocessor Benchmark Consortium: EEMBC Benchmark Suite

Emotional Contribution Process Implementations on Parallel Processors Carlos Domínguez, Houcine Hassan, José Albaladejo, Maria Marco, and Alfons Crespo Departamento de Informática de Sistemas y Computadores, Universidad Politécnica de Valencia, Valencia, Spain [email protected]

Abstract. An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor computation time should be reduced and the gained time is used for executing more complex processes. In this paper, the response time of the operating processes, in each attention cycle of the agent, is decreased by parallelizing the highly parallel processes of the architecture, namely, emotional contribution processes. The implementation of these processes has been evaluated in Field Programmable Gate Array (FPGA) and multicore processors. Keywords: FPGA, Multicore, Load balancing, Robotics, Agents, Real-time.

1 Introduction Robotic agents can solve problems in dynamic environments with uncertainty. The agents are supposed to have considerable autonomy to define their objectives and apply the appropriate strategies to reach them. Many agent architectures have been proposed, from the purely reactive to the purely deliberative ones, through hybrid solutions as well. One of the approaches which is widely studied by different authors is the emotional approach [7], inspired on the natural emotional agents. Various models of emotion have been proposed. Many researchers consider mainly the problem of the expression of the agent on emotional states, which is very useful in the communication of people with machines and between artificial agents [8]. Other researchers, however, consider the emotional process from a more general point of view, as a mechanism for the motivation of the agent behavior [9]. In this sense, RTEA – Real-time Emotional Agent [10], an emotional agent architecture for real time applications, has been developed. The RTEA architecture defines a set of operational processes: emotion, motivation and attention, which are executed together with the application processes that solve specific problems. An important parameter in RTEA, which limits the type of problems that the agent can solve, is the maximum frequency of its cycle of attention. Every cycle of attention, the processor of RTEA must complete all the operational processes – situation appraisal, emotion, motivation and attention - and additionally it must have a sufficient bandwidth to significantly advance the problem solving processes, which Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 343–352, 2011. © Springer-Verlag Berlin Heidelberg 2011

344

C. Domínguez et al.

are composed of both reactive processes and deliberative processes. So, the capacity of the processor is an important parameter to decide how to manage, with a predetermined risk level, solving a problem with given dynamics. A RTEA implementation has been developed to control a mobile service robot using a generalpurpose processor in which both, operative processes and application processes, run on a single core processor with a multitasking operating system. In this type of application, the agent has to deal with the resolution of a large set of simultaneous problems such as transport of objects, information point, security, cleaning, etc. The agenda of the agent can grow substantially. Since the agent must select their targets by running the operational processes of appreciation, emotion, motivation and attention, and these processes should be evaluated in every cycle of attention, the relative load of these processes may be important. In this paper we propose to reduce the relative load of the operational processes in the RTEA implementation, to thereby be able to increase the available bandwidth for application processes, or alternatively, to shorten the period of the cycle of attention and this way to be able to deal with dynamic problems with shorter required response time. Specifically in our application, we have improved the agent performance increasing the navigation and operation speed of the mobile robot. To reduce the process time of the operational processes we have considered two different alternatives for the processor. On the one hand, the utilization of a general purpose multicore processor, which can dedicate specific cores to run the operative processes of emotion, motivation and attention, and other cores to run the application processes by balancing the processes’ load between the different cores. On the other hand, the design of a special purpose processor for the operational processes on FPGA devices and the implementation of the full system on a hybrid processor, with general-purpose cores for the application processes and with special purpose cores for the operational processes. For the specific processor design, the operational processes have been characterized, and the parallel processes have been identified. We have described the emotional processor in VHDL. The project has evaluated both processor alternatives using a set of problems of varying complexity, considering the achievable benefits by a set of FPGA and multicore processors commercially available. The rest of the paper is organized as follows: section 2 reviews the state of the art of FPGA implementations. Section 3 describes the general characteristics of the RTEA architecture, and highlights the processes to execute in the proposed processor. Section 4 describes the design of both processor alternatives. Section 5 sets the evaluation and presents the results. Finally, section 6 summarizes the conclusions.

2 Related Work In [1] a review of Field Programmable Gate Array (FPGA) technologies and their contribution to industrial control applications is presented. To illustrate the benefits of using FPGAs in the case of complex control applications, a sensorless motor controller based on the Extended Kalman Filter, is studied. In [2], a coarse-grain parallel deoxyribonucleic acid algorithm for optimal configurations of an omnidirectional mobile robot with a five-link robotic arm performing fire extinguishment is presented. The hardware/software co-design, and System-on-a-Programmable-Chip technology

Emotional Contribution Process Implementations on Parallel Processors

345

on a FPGA are employed to implement the proposed algorithm to significantly shorten its processing time. A hardware–software coprocessing speech recognizer for real-time embedded applications is presented in [3]. The system consists of a standard microprocessor and a hardware accelerator for Gaussian mixture model emission probability calculation implemented on a FPGA. The GMM accelerator is optimized for timing performance by exploiting data parallelism. The development and implementation of a generalized backpropagation multilayer perceptron architecture described in VLSI hardware description language is proposed in [4]. By exploiting the reconfigurability of FPGAs, authors are able to perform fast prototyping of hardwarebased ANNs to find optimal application specific configurations, in terms of cost/speed/accuracy tradeoffs affecting hardware-based neural networks. A design environment for the synthesis of embedded fuzzy logic controllers on FPGAs, which provides a novel implementation technique, has been developed in [5]. This technique allows accelerating the exploration of the design space of fuzzy control modules, as well as a co-design flow that eases their integration into complex control systems. In [6] an embedded adaptive robust controller for trajectory tracking and stabilization of an omnidirectional mobile platform is proposed. This adaptive controller is then implemented into a high-performance field-programmable gate array chip using hardware/software codesign technique and system-on-a-programmable-chip design. Simulations show the effectiveness and merit of the proposed control method in comparison with a conventional proportional-integral feedback controller.

3 Model of the Agent In RTEA, the agent behavior is based on the concept of problem solving. A thought is an execution context of mental processes of observation, deduction, decision and action, related to the resolution of the problem generated by a desire. Every thought has a level of motivation, which is the basic parameter used in the negotiation of the attention and therefore it plays an essential role in the direction that the actual behavior takes. The mechanism of thoughts motivation in RTEA is emotional. An emotion is a process in which the evaluation of the current situation provides an emotional state, and this triggers as a response a motivation for a behavior related to the situation. Figure 1 shows the general flow of information in a RTEA agent. A main branch of this flow of information is covered by the application processes, and it connects the interface devices with the environment (sensors and motors) through two major ways, the reactive way and the deliberative one. This main branch develops and materializes the effective behavior of the agent, the one that has effects on their environment, giving responses to stimulus. A second main branch of the information flow is covered by the operative processes: emotion-motivation-attention and emotion-desire. These operational processes are reactive processes, i.e. with bounded and relatively short response times. They embody the emotional behavior of the agent that causes changes in its attitude. The response of the so called intrinsic emotions consists of establishing the level of the motivation of associated thoughts.

346


Device:

Process:

Concept:

Sensor

Sense

Perception

Thought:

Observe

Thought:

Concept:

Thought:

Deduce

Situation

Decide

Process:

Concept:

Thought:

Concept:

Appraise

Decision

Act

Action

Concept:

Concept:

Process:

Concept:

Process:

Device:

Dedication

Appraisal

React

Reaction

Motor

Motor

Process:

Process:

Attend

Arouse

Concept:

Process:

Concept:

Process:

Concept:

Process:

Motivation

Motivate

Emotion

Desire

Desire

Procure

Fig. 1. Flow control in RTEA agent situation appraisal

situation property [-oo,+oo]

contribute f([-1,+1])

situation

appraisal contribution [-1,+1]

situation property [-oo,+oo] appraisal contribution [-1,+1]

situation observation, deduction

emotion (emotional sensitivity)

ponder & add {[-1,+1]...}

appraisal [-1,+1]

contribute f([-1,+1])

emotional contribution [-1,+1]

emotional contribution [-1,+1]

ponder & add {[-1,+1]...}

emotional state

thought motivation (emotional control)

desire f([0,+1])

desire {0,1}

motivate f([0,+1])

motivation

state [0,+1] [0,+1]

Fig. 2. Emotional control

Figure 2 shows the branch of the emotional flow that are established, from the situations to the motivations. Emotional process concepts are represented as normalized real numbers: Situation Appraisal [-1, +1], Emotional Contribution [-1, +1], Emotional State [0, +1], Motivation of Thought [0, +1]. Note that different appraisals can contribute to the same emotional state given. For example: "fear to crash" emotion could consider appraisals like "distance" and "relative speed", so if even being a small distance the object moves away, the appreciation of the collision problem may decrease. Therefore, emotional contributions should be weighted, defining a unitary partition, so that the emotional state is always defined by its standardized range [0, +1]. Appraisal of the situation processes, emotional contribution and response are based on functions of appreciation, contribution and response respectively. It has been opted for sigmoidtype functions, because of their descriptive properties, due to the adaption to the models of appreciation, contribution and response we want to represent, with slight variations at the ends of the range that tend to asymptotic values and abrupt variations


347

around a inflection point in the center of the range. Specifically it has used sigmoid functions and hyperbolic tangents. A Basic Hyperbolic Tangent is found in (1).

y ( x) =

e x − e− x . e x + e−x

(1)

The hyperbolic tangent is an s-shaped curve, with rapid growth around its center point and two saturations, on the ends, following two asymptotes. To achieve speed up/slow activation and vary the intensity, a translation, a scaling and the application of offsets are applied. A parametric hyperbolic tangent is shown in (2).

y ( x) =

e ( x − x0 ) k x − e − ( x − x0 ) k x k y + y0 . e ( x − x0 ) k x + e −( x − x0 ) k x

(2)

In a first phase, we have identified 3 main parts of the emotional architecture, based on the above functions, whose critical processes could be executed in parallel: (1) The process of emotional motivation: it makes a subjective appreciation of the current situation of the problem, it activates an emotional state based on that appreciation and in response it motivates certain behaviors. (2) The attention process: it allocates processing resources to solving problems processes. (3) The set of reactive processes of the application: they require strict deadlines on the response of the agent. This paper will consider the process of emotional motivation (Central part of the Figure 2), and more specifically the contribution process, in which a set of appreciations of the current situation contribute to establish an emotional state.

4 FPGA and Multicore Agent Processes Design 4.1 Problem Definition The computational load is related to the complexity of the application problem and the period of the attention cycle, which is related to the dynamics of the application problem (another important parameter in the measure of its complexity). The complexity varies widely depending on the problem, so we are going to consider a specific example of application consisting of the problem of controlling a mobile robot for services in a public building. The assistant robot provides basic services (e. g., information, transportation of goods, surveillance of facilities and cleaning) in public buildings. In this application, users make requests of services, which are recorded in the agenda of the agent. The resolution of each of these problems is motivated by an emotional process. These emotional motivation processes are triggered in each attention cycle of the agent. A typical cycle of attention is between 0.1s and 1s depending on the dynamics of the problem. For the problem of transport, the number of running emotional processes may be around 10000. For a more complex problem that integrates various services, this number could reach 8 million processes. Because of the large number of processes generated to solve a problem, it has been selected a small part of the operative processes, particularly the emotional motivation system and it has been identified the possibilities of executing in parallel

348


these processes, to be impllemented on FPGAs and multicore, in order to invest the saved resources in the ex xecution of more processes that allow undertake m more complex problems. From th his analysis it was noticed that these processes are higghly parallel and that the implementation of a small subset of the emotional architectture processes could be carried d out on commercial multicore processors or FPGAss of medium performance. This article proposes a comparative study of the implementation of a subsett of these emotional processes of the agent in specific systems based on FPGA and multicore processors, dep pending on the complexity of the problem (emotioonal contributions executed per cycle of attention) in terms of the execution time of the emotional processes. To thiis end, it has been considered an emotional process dessign in C++ for multicores, as well w as a Matlab implementation. It has also been propoosed a VHDL design of the emo otional process on FPGAs. For evaluation purposes, it has been defined the computattional load of the emotional processor as the numberr of emotional contributions perr unit of time (MOPS, millions of operations per seconnd). The relationship between 1M MOPS and 1 MFLOPS is 240. 4.2 FPGA Implementatio on The block diagram of the implementation of one of the basic functions that comppose an emotion is shown in Fig gure 3. To compare the performance of previous solutiions with a semicustom hardwaare implementation, the function is implemented using the resources of the functions liibrary provided by the development tool for FPGAs, Alttera Quartus II 10.1.

Fig. 3. Block diagram m of a basic function composing an emotional contribution

The design and implemeentation of the agent's emotional processes through the use of FPGAs has been carried out in a modular way, using available library componeents in VHDL specification language. l Furthermore, it has also done a functioonal simulation to test the validity of the design. For their synthesis, different applicattion models of varying complex xity FPGA have been used, in order to analyze the leveel of parallelization achievable according to available resources. Then, a post-syntheesis simulation to verify that thee VHDL design could be implemented as logic blocks, has been performed. Besides, it i has been proceeded to "Placement and Routing" to get some good connections in n order to get top performance of the device operattion frequency. Finally the dessign has been validated on FPGA STRATIX III moodel (EP3SE50F780C2) of Alterra of medium performance.


349

4.3 Multicore Implementation Regarding the implementation of the emotional processes in C++ running on a multicore, several aspects have been considered. On the one hand, the agent software, which has been executed sequentially, consists of five main modules: belief, relationship, behavior, emotion and attention. The emotional module, which is highly parallel, as mentioned above, has been extracted to run on a multicore processor (where each core is a 3.3GHz i5 processor) and performance measures has been taken to compare them with the obtained results when executing the processes in FPGA specific processors. The execution of the emotional process on multicore systems has been performed at the process level, by using the operating system scheduler to balance the load.

5 Experimental Results Regarding multicore processors, 100 sets of 1000 tasks each have been defined, and using operating system services priorities of tasks have been assigned, as well as their assignment affinity to cores within the processor. Experiments have been executed on a multicore machine with 1, 2 and 4 cores, each core is a 3.33GHz Intel Core i5. For each processor configuration, two implementations of emotional contribution processes, on two different languages, to compare differences between them, have been executed. First, we have decided to implement the processes in Matlab since it is a widely used software in the automation and control community. Then, the implementation in C++ to analyze the overhead that the compiler could generate, has been developed. The results of these implementations can be seen in Figure 4.

Fig. 4. Multicore software implementation

The processes implemented in C++ provide better computing capabilities than the same processes implemented in Matlab. In the case of C++, the assignment of sets of emotional processes to the processor, when using 1, 2 and 4 cores, provide a computing capacity of around 25, 47 and 89 million operations per second respectively. As for Matlab, the number of operations per second is lower and around,

350


9, 17 and 30 MOPS. It can be observed that the results are even more favorable using C++, as the number of cores increases. In proportion, improvements in C++ implementation with respect to Matlab are for 1, 2 and 4 cores of about 2, 2.5 and 2.8 times better. Therefore, in the successive studies the comparisons between FPGA and Multicore will be performed with the C++ implementation of the emotional processes. The next experiment consists of comparing the results obtained with multicores with an optimized implementation of the contribution processes on a FPGA: STRATIX III (EP3SE50F780C2) device from ALTERA. Speed optimization, has been selected, since the bottleneck is enforced by the number of inputs and outputs of the FPGA and DSP (Digital Signal Processor) that the device incorporates. For the proposed design, the number of DSP circuits that have been able to operate in parallel for the STRATIX III is 7, with 4-stages machine segmentation. The results can be seen in Figure 5.

Fig. 5. Multicore and FPGA performances

For the application of the service robot carrying objects, the processing capacity of both implementations (FPGAs and multicore processors), has been evaluated. A set of 150 simulations of the transportation problem, of varying complexity (e.g., varying the number of pieces and routes), has been evaluated. Depending on the complexity of the problem, between 10000 and 8 million of emotional contributions in each cycle of attention of the agent can arise for each simulation. Taking into account that the attention cycle of the agent on high alert has been defined as 0.2 s. In summary, Figure 5 shows the average results obtained for the set of tested simulations. The FPGA Stratrix III has provided a processing capacity of about 14 MOPC - millions of emotional contribution operations per attention cycle of the agent. On the other hand, with the multicore processors, using 1 core processing capacity is on average of 5 MOPC, with 2 cores, 9.4 MOPC and with 4 cores 17.8 MOPC. For the specific service robotic application evaluated, with limited complexity (transport: from 10000 to 8 MOPC) it has been shown that it can be resolved with an FPGA of medium performance as Stratrix III (14 MOPC), by using the parallel design and proposed segmentation. However, for the multicore


351

processors, the application requires at least 2 cores (9.4 MOPC). In this case, other cognitive processes of the agent (deliberative, planning, learning) are being executed on the other cores that are not being used for the calculation of emotional contributions. Note that, for more complex problems, the number of cores that would be needed also would grow.

6 Conclusions The analyzed FPGA, even being a development system, allows a greater number of operations per attention cycle of the agent than the dual-core processor due to proposed parallelization and segmentation. Therefore, for the prototype of the service robot the choice of the FPGA may be more convenient to allow the multicore processor execute more emotional processes, otherwise the number of cores that the agent should provide to solve more complex problems would be insufficient. It is pointed out that the analyzed problem is of low level complexity, aspects such as the attention and application processes, that are not considered, would overload more the multicore processor and would worsen the chances of multicore problem solving. For more complex applications of the service robot (e.g., integrating multiple services simultaneously), the computing power required would be even more, so higher performance FPGAs should be analyzed in future works. In this case, FPGAs prices would start to be prohibitive for the development of the prototype. However, the current market trend is to have, in the near future, processors with a large number of cores (e.g., 32) for a very competitive price. Under these conditions, it could be dedicated a larger number of cores (e.g., 6) to parallelize a larger number of processes for more complex service applications. In this case, aspects such as the distribution of the load between cores should be analyzed.

References 1. Monmasson, E., Idkhajine, L., Cirstea, M.N., Bahri, I., Tisan, A., Naouar, M.W.: FPGAs in Industrial Control Applications. IEEE Trans. on Industrial Informatics 7(2) (2011) 2. Tsai, C.-C., Huang, H.-C., Lin, S.-C.: FPGA-based parallel DNA algorithm for optimal configurations of an omnidirectional mobile service robot performing fire extinguishment. IEEE Trans. on Ind. Electron. 58(3), 1016 (2011) 3. Cheng, O., Abdulla, W., Salcic, Z.: Hardware-software codesign of automatic speech recognition system for embedded real-time applications. IEEE Trans. on Ind. Electron. 58(3), 850–859 (2011) 4. Gomperts, A., Ukil, A., Zurfluh, F.: Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks for Online Applications. IEEE Trans. on Industrial Informatics 7(1) (2011) 5. Chia-Feng, J., Chun-Ming, L., Chiang, L., Chi-Yen, W.: Ant colony optimization algorithm for fuzzy controller design and its FPGA implementation. IEEE Trans. on Ind. Electron. 55(3), 1453–1462 (2008) 6. Huang, H.-C., Tsai, C.-C.: FPGA Implementation of an Embedded Robust Adaptive Controller for Autonomous Omnidirectional Mobile Platform. IEEE Trans. on Industrial Electronics 56(5), 1604–1616 (2009)

352


7. Damiano, L., Cañamero, L.: Constructing Emotions. Epistemological groundings and applications in robotics for a synthetic approach to emotions. In: AI-Inspired Biology (AIIB) Symposium, Leicester, UK (2010) 8. Moshkina, L., Arkin, R.C.: Beyond Humanoid Emotions: Incorporating Traits, Attitudes and Moods. In: IEEE Inter. Conference on Robotics and Automation (2009) 9. Sloman, A.: Some Requirements for Human-Like Robots: Why the Recent Over-Emphasis on Embodiment Has Held Up Progress. In: Sendhoff, B., Körner, E., Sporns, O., Ritter, H., Doya, K. (eds.) Creating Brain-Like Intelligence. LNCS, vol. 5436, pp. 248–277. Springer, Heidelberg (2009) 10. Domínguez, C., Hassan, H., Albaladejo, J., Crespo, A.: Simulation Framework for Validation of Emotional Agents. In: Arabnia, H.R. (ed.) The 2010 International Conference on Artificial Intelligence, Las Vegas, Nevada, USA. CSREA Press (2010)

A Cluster Computer Performance Predictor for Memory Scheduling Mónica Serrano, Julio Sahuquillo, Houcine Hassan, Salvador Petit, and José Duato Department of Computer Engineering (DISCA), Universidad Politécnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain [email protected], {jsahuqui,husein,spetit,jduato}@disca.upv.es

Abstract. Remote Memory Access (RMA) hardware allow a given motherboard in a cluster to directly access the memory installed in a remote motherboard of the same cluster. In recent works, this characteristic has been used to extend the addressable memory space of selected motherboards, which enable a better balance of main memory resources among cluster applications. This way is much more cost-effective than than implementing a full-fledged shared memory system. In this context, the memory scheduler is in charge of finding a suitable distribution of local and remote memory that maximizes the performance and guarantees a minimum QoS among the applications. Note that since changing the memory distribution is a slow process involving several motherboards, the memory scheduler needs to make sure that the target distribution provides better performance than the current one. In this paper, a performance predictor is designed in order to find the best memory distribution for a given set of applications executing in a cluster motherboard. The predictor uses simple hardware counters to estimate the expected impact on performance of the different memory distributions. The hardware counters provide the predictor with the information about the time spent in processor, memory access and network. The performance model used by the predictor has been validated in a detailed microarchitectural simulator using real benchmarks. Results show that the prediction accuracy never deviates more than 5% compared to the real results, being less than 0.5% in most of the cases. Keywords: cluster computers, memory scheduling, remote memory assignment, performance estimation.

1 Introduction Since their introduction, cluster computers have been improving their performance and lowering their implementation costs with respect to supercomputers. Nowadays, it is easy to find many of these type of computer organizations in the top positions of highperformance computer rankings such as TOP500 [1]. This transition has been possible as advanced microarchitectural techniques and interconnection solutions only available in supercomputers enter the consumer market (i.e., they are commoditized), which in Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 353–362, 2011. c Springer-Verlag Berlin Heidelberg 2011

354

M. Serrano et al.

turn allow new ways to improve the performance of current cluster designs while maintaining or even lowering their expenses. However, since cluster architectures are loosely coupled by design, there is not a standard commodity framework supporting the access to memory installed on remote nodes. Therefore, to cope with applications demanding large amounts of main memory (e.g., enterprise level databases and services, large computing intensive parallel applications, etc.), cluster systems must rely on slower OS-based solutions such as swapping on remote RAM disks or implementing software-based shared memory. This, in turn, reduces the competitivity advantages of this type of systems. So far, Remote Memory Access (RMA) hardware [2], which allows a given node to directly access remote memory, has been only available in supercomputer systems like BlueGene/L [3], BlueGene/P [4], or Cray XT [5]. Nevertheless, commodity implementations for cluster computers are already entering the market. For example, the HyperTransport consortium [6], which is composed by more than 60 members from the leading industry (AMD, HP, Dell, IBM, etc.) and universities, is extending the Hypertransport technology, enabling the development of cluster systems supporting remote memory accesses. This work focuses on a cluster prototype that implements the aforementioned Hypertransport extensions and whose nodes are linked using a fast interconnection network. In this context, we assume that the OS running in the nodes offers inter-node memory allocation capabilities that enable the assignment of remote memory portions to local applications. As these regions have different latencies, performance of a given application strongly depends on how its assigned memory is distributed among regions. Since each application contributes with its performance to the global performance, a memory scheduler that maximizes the global performance is required. This memory scheduler must be aware not only of the characteristics (i.e., latency, bandwidth) of the different memory regions but also of the executing applications’ memory requirements. For example, allocating a 25% of the available remote memory to a memory-intensive application could lead to worse performance results than allocating the whole remote memory to an application with good cache locality. To decide how to distribute the different memory regions among the running applications, the scheduler needs information about the expected performance of a given memory distribution. To obtain this information two solutions can be devised: i) to perform an off-line profiling of the benchmarks varying the memory distribution and ii) to dynamically predict the performance of the benchmarks by measuring during execution their utilization of the system resources. The first solution has been developed in a previous work [7], where we analyzed how the memory distribution impacts on the performance of applications with different memory requirements, and presented an ideal memory allocation algorithm (referred to as SPP) that distributed the memory space among applications to maximize global performance. The generalization of SSP to any number n of applications was published in [8], where we also present an efficient heuristic algorithm that approximates the performance results provided by SPP while reducing its complexity in a factor of (n − 1)!. Both algorithms consider a quality of service (QoS) parameter for each application in order to guarantee minimum performance requirements.

A Cluster Computer Performance Predictor for Memory Scheduling

355

In contrast to these works, this paper proposes a performance predictor that provides the information required by the memory scheduler. The main aim of proposed predictor is to be used by the memory scheduler to maximize the system performance while guaranteeing specific QoS requirements. To perform the predictions, 3 sample executions for every benchmark are required, each one considering that the complete working set of the benchmark is stored in a different memory region (i.e., L, Lb or R). Using these samples the performance of any other memory distribution is estimated. The proposed predictor is driven by a novel performance model fed by simple hardware counters (like those available in most current processors) that measure the distribution of execution time devoted to processor, memory, and network resources. Although the model can be implemented for any type of processor, this work considers in-order execution for simplicity reasons. The model has been validated by comparing its estimations with the performance values obtained by the execution of real benchmarks in the Multi2Sim simulation framework [9]. The results show that the dynamic predictor is very accurate, since its deviation with respect to the real results is always lower than 5%, and much lower in most of the cases. The remaining of this paper is organized as follows. Section 2 describes the system prototype. Section 3 details our proposed performance model. Section 4 validates the model by comparing its predictions with detailed cycle-by-cycle simulation results. Section 5 discusses previous research related to this work, and finally, Section 6 presents some concluding remarks.

2 Cluster Prototype A cluster machine with the required hardware/software capabilities is being prototyped in conjunction with researchers from the University of Heidelberg [2], which have designed the RMA connection cards. The machine consists of 64 motherboards each one including 4 quad-core 2.0GHz Opteron processors in a 4-node NUMA system (1 processor per node), and a 16GB RAM memory per motherboard. The connection to remote motherboards is implemented by a regular HyperTransport [10] interface to the local motherboard and a High Node Count HyperTransport [11] interface to the remote boards. This interface is attached to the motherboard by means of HTX compatible cards [12]. When a processor issues a load or store instruction, the memory operation is forwarded to the memory controller of the node handling that memory address. The RMA connection cards include their own controller, which handles the accesses to remote memory. Unlike typical memory controllers, the RMA controller has no memory banks directly connected to it. Instead, it relies on the banks installed in remote motherboards. This controller can be reconfigured so that memory accesses to a given memory address are forwarded to the selected motherboard. Since the prototype is still under construction, in order to carry out the experiments and validate the proposed performance model, the cluster machine has been modeled using Multi2Sim. Multi2Sim is a simulation framework for superscalar, multithreaded, and multicore processors. It is an application-only execution-driven microarchitectural simulator, which allows the execution of multiple applications to be simulated without booting a complete OS.

356

M. Serrano et al.

Fig. 1. Block diagram of the 2-node NUMA system model and RMA Table 1. Memory subsystem characteristics Characteristic

Description

# of processors L1 cache: size, #ways, line size L1 cache latency L2 cache: size, #ways, line size L2 cache latency Memory address space L Latency Lb Latency R Latency

2 per motherboard 64KB, 2, 64B 3 1MB, 16, 64B 6 512MB, 256MB per motherboard 100 142 410

In addition, the whole system has been scaled down to have reasonable simulation times. The scaled system consists of two motherboards, each one composed of a 2node NUMA system as shown in Figure 1. Each node includes a processor with private caches, its memory controller and the associated RAM memory. Table 1 shows the memory subsystem characteristics, where memory latencies and cache organizations resemble those of the real prototype. The RMA connection cards have been assumed with no internal storage capacity. Likewise, the Multi2Sim coherence protocol has been extended to model the RMA functionality.

3 Performance Model A system whose running applications can be executed using different memory distributions (L, Lb, R) needs a mechanism to determine which memory distribution should be assigned to each application. This section presents a methodology for predicting the


357

impact on performance of the different memory distributions, and then using the predictions to guide the assignment of memory regions to applications in order to meet memory constraints and reduce performance loss. This work assumes that the predictor evaluates seven possible memory distributions (three samples and four estimated cases) since this number of data points is enough to define sufficiently the performance of each application among the complete set of possible memory distributions [8]. To predict the performance (execution time) of a running application A when having a memory assignment {L = X, Lb = Y, R = Z}, an analytical method has been designed. Existing processors implement performance counters for debugging purposes which are readable by software. In this paper, these counters are utilized by an application-tomemory assignment prediction mechanism. The counters are used to track the number of cycles spent for each considered event during a full scheduling quantum. 3.1 Analytical Model The execution time of a given application can be estimated from two main components, as stated by equation 1. Tex = CDispatch + Cmem

(1)

stalls

Each Cx is the number of processor cycles spent on a type of activity. As the dispatch width has been assumed to be 1, the execution time can be expressed as the sum of the number of dispatched instructions plus the number of cycles stalled due to memory accesses. In the devised system, stalls due to a full load-store queue (LSQ) are critical for performance, mainly in those benchmarks having a high rate of memory accesses. On the other hand, dispatch stage remains stalled during the execution of a load instruction. This includes both the accesses to private caches (i.e. L1 and L2) and to the main memory, with their respective access times as well as the delays related to the network or structural hazards. To project the IPC, the performance model breaks down the memory components of the execution time into memory region-dependent and memory region-independent components: Cmem

stalls

= CL + CLb + CR + Cprivate

caches

+ CLSQ

iwidth

(2)

CL , CLb , and CR refer to the cycles spent on each memory region, that is, Local, Local to Board, respectively. Each C includes the cycles due to several activities related to this memory region. In particular, stalls due to the following reasons have been taken into account: Main memory access time. This time includes both the cycles spent in the data read from the main memory and the message traffic through the memory network. Delayed hit. This type of stall occurs when the memory access cannot be performed because the accessed block is already locked by another memory instruction, that is, a new block is being brought.

358

M. Serrano et al.

Write concurrency. This type of stall happens because concurrent accesses to the same block in a given cache are not allowed if one of them is a write. Full LSQ. Dispatch stage is stalled because there is no free entry in the LSQ. The remaining components of the equation can be considered as a constant k for every memory region. The region-independent components are the following: Private caches access time. Number of cycles spent in accessing the first and second level caches of the system. These accesses are region-independent since no memory module is accessed. LSQ issue width limitation. Only a load or a store can be issued at a given cycle. So, if a load instruction is ready to be issued and there is an access conflict between a load and a store, they are issued in program order, and the youngest instruction will retry the next cycle. The final equation used by the performance predictor is 3: Tex = CDispatch + CL + CLb + CR + k

(3)

3.2 Estimating Performance The model assumes that the implemented target machine provides the required performance counters to obtain the values for the components of equation 3. Notice that network traffic is taken into account, so congestion is also quantified. The predictor requires to run each benchmark three times to gather the required values to project performance. Each sample will correspond to all the memory accesses in one single region, that is, i) all the accesses to local memory region (i.e. Tex,L=100% ), ii) all the accesses to the other node in the local motherboard memory region (i.e. Tex,Lb=100% ), and iii) all the accesses to remote memory region (i.e. Tex,R=100% ): Sample 1 (L = 100%, Lb = 0%, R = 0%): Tex,L=100% = CL:L=100% + k Sample 2 (L = 0%, Lb = 100%, R = 0%): Tex,Lb=100% = CLb:Lb=100% + k Sample 3 (L = 0%, Lb = 0%, R = 100%): Tex,R=100% = CR:R=100% + k To predict the execution time for a given memory distribution, the predictor calculates a weighted execution time, Tex weighted , from the three samples. It takes each not null memory region component C of each of the samples and multiplies it by the fraction f of accesses of the destination memory region: Tex

weighted

= CL,L=100% · (fL ) + CLb,Lb=100% · (fLb ) + CR,R=100% · (fR ) + k (4)

For any given memory distribution, equation 4 can be used to predict its execution time given the gathered components for the three samples. This provides a mechanism to identify the optimal memory distribution at which to run a given execution phase with minimal performance loss. So this prediction will be an input for the memory scheduler. Table 2 analyzes an example of prediction for the benchmark FFT, where the execution time of the memory distribution (50%, 50%, 0) is obtained from the three samples. The estimated execution time is equal to 2774807.8 and the real detailed cycle-by-cycle simulation execution time is 2774931, so the model has obtained an estimation which deviates less than 0.005% with respect to the target value.


359

Table 2. Performance predictor working example C f C Sample1 Sample2 Sample3 k tex

weighted

44687 62236 166757

0.5 0.5 0

pond

22343.5 31118 0 2721346.3 2774807.8

Fig. 2. Model Validation. Detailed cycle-by-cycle simulation vs model.

4 Validating the Model This section analyzes the prediction accuracy. We have proceed by making experiments for the four benchmarks with the eight memory distributions: i)(100%, 0%, 0%), ii)(50%, 50%, 0%), iii)(0%, 100%, 0%), iv)(75%, 0%, 25%), v)(50%, 25%, 25%), vi)(50%, 0%, 50%), vii)(25%, 0%, 75%), viii)(0%, 0%, 100%). Then, we have taken the components of the three samples (i, iii, and viii) and have applied the model to each benchmark to obtain the execution time for each of the remaining memory distributions. Finally, the Instructions Per Cycle (IPC) has been calculated for each case. Figure 2 shows the comparison of the simulated performance results (sim) against the values calculated by the performance predictor (model). Both model and detailed cycle-by-cycle simulation curves are overlapped, since the model provides a deviation lower than 5% in the worst case, being near to 0% for some of the benchmarks, for instance, FFT.

360

M. Serrano et al.

5 Related Work Previous research works have addressed the problem of performance prediction to characterize and classify memory behavior of applications to predict their performance. Zhuravlev et al [13] estimated that factors like memory controller, memory bus and prefetching hardware contentions contribute more to overall performance degradation than cache space contention. To alleviate these factors they minimize the total number of misses issued from each cache. To that end they developed scheduling algorithms that distribute threads such that the miss rate is evenly distributed among the caches. In [14] authors propose a classification algorithm for determining programs cache sharing behaviors. Their scheme can be implemented directly in hardware to provide dynamic classification of program behaviors. They propose a very simple dynamic cache partitioning scheme that performs slightly better than the Utility-based Cache Partitioning scheme while incurring a lower implementation cost. In [15] a fast and accurate shared cache aware performance model for multi-core processors is proposed. The model estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation. The average throughput prediction error of the model was 1.57 In [16] the authors apply machine learning techniques to predict the performance on multi-core processors. The main contribution of the study is enumeration of solo-run program attributes, which can be used to predict paired-run performance. The paired run involves the contention for shared resources between co-running programs. The previous research papers are focused on multicore or CMP processors however the work proposed in this paper is focused on cluster computers dealing with the problem of predicting the application behaviour using remote memory in order to allow a scheduler to improve system performance. Other research papers found in the bibliography dealing with remote memory allocation are mainly focused on memory swapping. Shuang et al. design a remote paging system for remote memory utilization in InfiniBand clusters [17]. In [18], the use of remote memory for virtual memory swapping in a cluster computer is described. Midorikawa et al. propose the distributed large memory system (DLM), which is an userlevel software-only solution that provides very large virtual memory by using remote memory distributed over the nodes in a cluster [19]. These papers use the remote memory for swapping over cluster nodes and present their system as an improvement of disk swapping. On the contrary, our research aims at predicting system performance depending on different assignment configurations of remote memory to applications. The predictions will be used by a memory scheduler to decide dynamically which is the best configuration to enhance system performance.

6 Conclusions This paper has presented a performance predictor which is able to estimate the execution time for a given memory distribution of an application. We first carried out a study to


361

determine the events considered by our model, and classified them as memory-region dependent and independent. The model assumes that the number of cycles spent in each considered event is obtained from some hardware counters of the target machine. The devised predictor has been used to estimate the performance of different memory distributions for four benchmarks. The accuracy of the prediction has been validated, since the deviation of the model with respect to the real results is always lower than 5% and very close to 0% in several studied cases. This study constitutes the first step of a deeper work in the ground of memory scheduling. The performances estimated by the predictor will feed a memory scheduler which will dynamically choose the optimum target memory distribution for each application concurrently running in the system in order to achieve the best overall performance of the system. Acknowledgements. This work was supported by Spanish CICYT under Grant TIN2009-14475-C04-01, and by Consolider-Ingenio under Grant CSD2006-00046.

References 1. Meuer, H.W.: The top500 project: Looking back over 15 years of supercomputing experience. Informatik-Spektrum 31, 203–222 (2008), doi:10.1007/s00287-008-0240-6 2. Nussle, M., Scherer, M., Bruning, U.: A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication. In: International Conference on Parallel Processing, pp. 220–227 (September 2009) 3. Blocksome, M., Archer, C., Inglett, T., McCarthy, P., Mundy, M., Ratterman, J., Sidelnik, A., Smith, B., Almási, G., Castaños, J., Lieber, D., Moreira, J., Krishnamoorthy, S., Tipparaju, V., Nieplocha, J.: Design and implementation of a one-sided communication interface for the IBM eServer Blue Genesupercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 120. ACM, New York (2006) 4. Kumar, S., Dózsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M., Blocksome, M., Faraj, A., Parker, J., Ratterman, J., Smith, B.E., Archer, C.: The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer. In: ICS, pp. 94–103 (2008) 5. Tipparaju, V., Kot, A., Nieplocha, J., Bruggencate, M.T., Chrisochoides, N.: Evaluation of Remote Memory Access Communication on the Cray XT3. In: IEEE International Parallel and Distributed Processing Symposium, pp. 1–7 (March 2007) 6. HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision (October 3, 2008) 7. Serrano, M., Sahuquillo, J., Hassan, H., Petit, S., Duato, J.: A scheduling heuristic to handle local and remote memory in cluster computers. In: High Performance Computing and Communications (2010) (accepted for publication) 8. Serrano, M., Sahuquillo, J., Petit, S., Hassan, H., Duato, J.: A cost-effective heuristic to schedule local and remote memory in cluster computers. The Journal of Supercomputing, 1–19 (2011), doi:10.1007/s11227-011-0566-8 9. Ubal, R., Sahuquillo, J., Petit, S., López, P.: Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. In: Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing (2007) 10. Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro 23(2), 66–76 (2003)

362

M. Serrano et al.

11. Duato, J., Silla, F., Yalamanchili, S.: Extending HyperTransport Protocol for Improved Scalability. In: First International Workshop on HyperTransport Research and Applications (2009) 12. Litz, H., Fröening, H., Nuessle, M., Brüening, U.: A HyperTransport Network Interface Controller for Ultra-low Latency Message Transfers. In: HyperTransport Consortium White Paper (2007) 13. Zhuravlev, S., Blagodurov, S., Fedorova, A.: Addressing shared resource contention in multicore processors via scheduling. In: Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 129–142 (2010) 14. Xie, Y., Loh, G.H.: Dynamic Classification of Program Memory Behaviors in CMPs. In: 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects in conjunction with the 35th International Symposium on Computer Architecture (2008) 15. Xu, C., Chen, X., Dick, R.P., Mao, Z.M.: Cache contention and application performance prediction for multi-core systems. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 76–86 (2010) 16. Rai, J.K., Negi, A., Wankar, R., Nayak, K.D.: Performance prediction on multi-core processors. In: 2010 International Conference on Computational Intelligence and Communication Networks (CICN), pp. 633–637 (November 2010) 17. Liang, S., Noronha, R., Panda, D.K.: Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device. In: CLUSTER, pp. 1–10. IEEE, Los Alamitos (2005) 18. Werstein, P., Jia, X., Huang, Z.: A Remote Memory Swapping System for Cluster Computers. In: Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 75–81 (2007) 19. Midorikawa, H., Kurokawa, M., Himeno, R., Sato, M.: DLM: A distributed Large Memory System using remote memory swapping over cluster nodes. In: IEEE International Conference on Cluster Computing, pp. 268–273 (October 2008)

Reconfigurable Hardware Computing for Accelerating Protein Folding Simulations Using the Harmony Search Algorithm and the 3D-HP-Side Chain Model César Manuel Vargas Benítez, Marlon Scalabrin, Heitor Silvério Lopes, and Carlos R. Erig Lima Bioinformatics Laboratory, Federal University of Technology - Paraná, Av. 7 de setembro, 3165 80230-901, Curitiba (PR), Brazil [email protected], [email protected], {hslopes,erig}@utfpr.edu.br

1

Introduction

Proteins are essentials to life and they have countless biological functions. They are synthesized in the ribosome of cells following a template given by the messenger RNA (mRNA). During the synthesis, the protein folds into an unique three-dimensional structure, known as native conformation. This process is called protein folding. Several diseases are believed to be result of the accumulation of ill-formed proteins.Therefore, understanding the folding process can lead to important medical advancements and development of new drugs. Thanks to the several genome sequencing projects being conducted in the world, a large number of new proteins have been discovered. However, only a small number of such proteins have its three-dimensional structure known. For instance, the UniProtKB/TrEMBL repository of protein sequences has currently around 16.5 million records (as in july/2011), and the Protein Data Bank – PDB has the structure of only 74,800 proteins. This fact is due to the cost and difficulty of unveiling the structure of proteins, from the biochemical point of view. Computer Science has an important role here, proposing models and computation approaches for studying the Protein Folding Problem (PFP). The Protein Folding Problem (PFP) can be defined as finding the threedimensional structure of a protein by using only the information about its primary structure (e.g. polypeptide chain or linear sequence of amino acids) [9]. The three-dimensional structure is the folding (or conformation) of a polypeptide as a result of interactions between the side chains of amino acids that are in different regions of the primary structure. The simplest computational model for the PFP problem is known as Hydrophobic-Polar (HP) model, both in two (2D-HP) and three (3D-HP) dimensions [5]. Although simple, the computational approach for searching a solution

This work is partially supported by the Brazilian National Research Council – CNPq, under grant no. 305669/2010-9 to H.S.Lopes and CAPES-DS scholarships to C.M.V. Benítez and M.H. Scalabrin.


364

C.M. Vargas Benítez et al.

for the PFP using the HP models was proved to be N P -complete [3]. This fact emphasizes the necessity of using heuristic and massively parallel approaches for dealing with the problem. In this scenery, reconfigurable computation is an interesting methodology due to the possibility of massive parallel processing. However, this methodology has been sparsely explored in molecular biology applications. For instance, [7] present a methodology for the design of a system based on reconfigurable hardware applied to the protein folding problem, where different strategies are devised to achieve a significant reduction of the search space of possible foldings. Also, [12] presents a methodology for the design of a reconfigurable computing system applied to the protein folding problem using Molecular Dynamics (MD). [13] propose a complete fine-grained parallel hardware implementation on FPGA to accelerate the GOR-IV package for 2D protein structure prediction. [4] present a FPGA based approach for accelerating string set matching for Bioinformatics research. A survey of FPGAs for acceleration of high performance computing and their application to computational Molecular Biology is presented by [11]. The main focus of this work is to develop approaches for accelerating protein folding simulations using the Harmony Search algorithm and the 3D-HP-SC (three dimensional Hydrophobic-Polar Side-Chain) model of proteins.

2

The 3D-HP Side-Chain Model (3D-HP-SC)

The HP model divides the 20 proteinogenic amino acids into only two classes, according to their affinity to water: Hydrophilic (or Polar) and Hydrophobic. When a protein is folded into its native conformation, the hydrophobic amino acids tend to group themselves in the inner part of the protein, in such a way to get protected from the solvent by the polar amino acids that are preferably positioned outwards. Hence, a hydrophobic core is usually formed, especially in globular proteins. In this model, the conformation of a protein (that is, a folding) is represented in a lattice, usually square (for the 2D-HP) or cubic (for the 3D-HP). Both 2D-HP and 3D-HP models have been frequently explored in the recent literature [9]. Since the expressiveness of the HP models is very poor from the biological point of view, a further improvement of the model is to include a bead that represents the side-chain (SC) of the amino acids [8]. Therefore, a protein is modeled by a backbone (common to any amino acid) and a side-chain, either Hydrophobic (H) or Polar (P). The side-chain is responsible for the main chemical and physical properties of specific amino acids. The energy of a conformation is an inverse function of the number of adjacent amino acids in the structure which are non-adjacent in the sequence. To compute the energy of a conformation, the HP model considers that the interactions between hydrophobic amino acids represent the most important contribution for the energy of the protein. Li et al. [8] proposed an equation that considers only three types of interactions (not making difference between types of side-chains). In this work we use a more realistic approach, proposed by [2], to compute the

Reconfigurable Computing for Protein Folding Using Harmony Search

365

energy of a folding, observing all possible types of interactions, as shown in Equation 1. H=

n

HH · ⎛

δrHH

i=1,j>i

+ ⎝BP ·

n i=1,j=i

ij

+

⎞ δrBP ⎠ + ij

n

BB ·

i=1,j>i+1

HP ·

n i=1,j>i

⎛

+ ⎝BH ·

δrBB ij

δrHP ij

δrBH ⎠

i=1,j=i

+

⎞

n

P P ·

n

ij

(δrP P )

i=1,j>i

ij

(1)

In this equation, HH , BB , BH , BP , HP , P P are the weights of the energy for each type of interaction, respectively: hydrophobic side-chains (HH), backbonebackbone (BB), backbone-hydrophobic side-chain (BH), backbone-polar sidechain (PH), hydrophobic-polar side-chains (HP), and polar side-chains (PP). In a chain of n amino acids, the distance (in the three-dimensional space) between ∗∗ . For the ith and j th amino acid interacting with each other is represented by rij the sake of simplification, in this work we used unity distance between amino ∗∗ = 1). Therefore, δ is an operator that returns 1 when the distance acids (rij between the ith and j th elements (either backbone or side-chain) for each type of interaction is the unity, or 0 otherwise. We also used an optimized set of weights for each type of interaction, defined by [2]. During the folding process, interactions between amino acids take place and the energy of the conformation tends to decrease. Conversely, the conformation tends to converge to its native state, in accordance with the Anfinsen’s thermodynamic hypothesis [1]. In this work we consider the symmetric of H such that PFP is understood as a maximization problem.

3

Harmony Search Algorithm

The Harmony Search (HS) meta-heuristic is inspired by musician skills of composition, memorization and improvisation. Musicians use their skills to pursuit a perfect composition with a perfect harmony. Similarly, the HS algorithm use its search strategies to pursuit for the optimum solution to an optimization problem. The pseudo-code of the HS algorithm is presented in Algorithm 1 [6]. The Harmony Search (HS) algorithm starts with a Harmony Memory of size HM S, where each memory position is occupied by a harmony of size N (musicians). Each improvisation step of a new harmony is generated from the harmonies already present in the harmony memory. If the new harmony generated is better than the worst harmony in the harmony memory, it is replaced with the new. The steps to improvise and update the harmony memory are repeated until the maximum number of improvisations (M I) is achieved. The HS algorithm can be described by five main steps, detailed below [6]1 : 1. Initialization and Setting Algorithm Parameters: In the first step, as in any optimization problem, the problem is defined as an objective function 1

For more information see the HS repository: http://www.hydroteq.com

366


Algorithm 1. Pseudo-code of the Harmony Search algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

Parameters: HMS, HMCR, PAR, MI, FW Start Objective Function f (x), x = [x1 , x2 , ..., xN ] Initialize Harmony Memory xi , i = 1, 2, ..., HM S Evaluate each Harmony in HM: f (xi ) cycle ← 1 while cycle < MI do for j ← 1 to N do if random ≤ HMCR then {Rate of Memory Consideration} xj ← xij , with i ∈ [1, HM S] {chosen randomly} if random ≤ PAR then {Pitch Adjusting Rate} xj ← xj ± r × F W {with r random} end if else {Random Selection} Generate xj randomly end if end for Evaluate new harmony generated: f (x ) if f (x ) is better than worst harmony in HM then Update Harmony Memory end if cycle ← cycle + 1 end while Results and views End

to be optimized (line 3), which can or cannot be constrained. Originally, Harmony Search was designed for solving minimization problems [6]. The four main parameters of the algorithm are also defined here: Harmony Memory size – HM S, the Harmony Memory Consideration Rate – HM CR, the Pitch Adjusting Rate – P AR, and the Maximum number of Improvisations – M I. 2. Harmony Memory Initialization: The second step is the initialization of the Harmony Memory (HM) with a number of harmonies randomly generated (line 4). The Harmony Memory is the vector in which the best harmonies found during execution are stored. Each harmony is a vector representing a possible solution to the problem. 3. Improvise a New Harmony: In the third step, a new harmony is improvised based on a combination of several other harmonies found in HM (between lines 8–17). For each variable of the new harmony, a harmony of HM is arbitrarily selected by checking the corresponding probability of using or not this value (HM CR). If another harmony is used, the value of this variable will have small adjustments (Fret Width – F W ) according to a probability (P AR). If the value of another harmony is not used, a random value within the range of allowed values is assigned. Thus, the parameters HM CR and P AR are responsible for establishing a balance between exploration and exploitation in the search space.


367

4. Update Harmony Memory: In the fourth step, each new improvised harmony is checked to see if it is better than the worst harmony from HM (lines 19–21). If so, the new harmony replaces the worst one in HM. 5. Verification of the Stopping Criterion: In the fifth step, the end of each iteration is checked to discover if the best harmony meets the stopping criterion, usually a maximum number of improvisations (M I). If so, the execution is completed. Otherwise, it returns to the second step until reaching the stopping criterion.

4

Methodology

This section describes in detail the implementation of the Harmony Search algorithm for the PFP using the 3D-HP-SC model of proteins. Four versions were developed: a desktop computer version and three different FPGA-based implementations. The FPGA-based versions were developed in VHDL (Very High Speed Integrated Circuit Hardware Description Language) and implemented in a FPGA (Field Programmable Gate Array) device. Two of these versions also used an embedded processor (Altera’s NIOS II), as part of its hardware design. On the other hand, software implementations (i.e. for both NIOS II and the desktop computer) were developed in ANSI-C programming language. The first hardware-based approach is a version for the 32-bit NIOS II embeddedprocessor, and simply reproduces the software implemented on the desktop computer. The second hardware-based approach is a version for NIOS II with a dedicated hardware block, specifically developed for computing the fitness function, as shown in Figure 1). The HS algorithm runs on the NIOS II processor and the block called “Fitness Calculation System” works as a slave of the NIOS II. The processor is responsible for initializing the Harmony Memory, improvising new harmonies, updating the Harmony Memory and, finally, distributing the individuals (also called as harmonies) to the slave block. The slave, in turn, is responsible for computing the fitness function for each individual received. The internal structure of this block is described later. FPGA

clk Sequence

rst harmony

Harmony Search Algorithm (NIOS II)

reset Enable busy

Fitness Calculation System

Results MUX *Energy *Colisions *Fitness

fitness

Fig. 1. Functional block diagram of the folding system with NIOS II embeddedprocessor

368


The third hardware-based approach is fully implemented in hardware and does not use an embedded processor, as shown in Figure 2. The block called “Harmony Search Core” performs the HS algorithm. The Harmony Memory initialization is performed producing a new harmony for each position of the Harmony Memory. Each variable of each new harmony is independent of the others. Therefore, each new harmony is generated in one clock pulse using a set of N random number generators, where N is the number of variables in the harmony. Once the Harmony Memory is loaded with the initial harmonies, the iterative process of optimization of the HS algorithm is started. At each iteration, four individuals (harmonies) are evaluated simultaneously (in parallel), thus expecting an improvement in performance. In the improvisation step of the algorithm, the process of selection of each variable of the new harmony is performed independently. This procedure is done in only N clock pulses, as before. After that, the updating of the Harmony Memory is performed by inserting the new harmonies in their proper positions. The following positions are shifted, discarding the four worst harmonies. To find the insertion position, the position of the worst harmony in the Harmony Memory is always maintained in a latch. Each variable to be replaced is treated simultaneously. Once the optimization process is completed, the best harmony found is transferred from the Harmony Memory to the “Fitness Calculation System” block in order to display all relevant information about the conformation represented by this harmony. The chronometer block measures the total elapsed processing time of the system. The multiplex block selects output data among the obtained results (energy of each interaction, number of collisions, fitness and the processing time to be shown in a display interface). The random number generator is implemented using the Maximum Length Sequence (MLS) pseudo-random number approach. MLS is an n-stage linear shift-register that can generate binary periodical sequences of maximal period length of L = 2n − 1. In this work, we used n=7 or 4 for all probability values mentioned in the Algorithm 1, and n = 5 for generate variables of the new harmonies in the improvisation process. Figure 3 shows a functional block diagram of the “Fitness Calculation System” that has three main elements: a three-dimensional conformation decoder, a coordinates memory and a fitness computation block. By calculating the energy of each different type of interactions and the number of collisions between the elements (side-chains and backbone), the fitness of the conformation is obtained. The blocks that perform such operations are described as follows: Harmony Representation: The encoding of the candidate solutions (harmonies of the HS algorithm) is an important issue and must be carefully implemented. Encoding can have a strong influence not only in the size of the search space, but also in the hardness of the problem, due to the establishment of unpredictable cross-influence between the musicians of a harmony. There are several ways of representing a folding in an individual, as pointed by [9]: distance matrix, Cartesian coordinates (absolute coordinates), or relative internal coordinates. In this work we used the relative internal coordinates, because it is the


369

FPGA clk rst

Harmony Memory

Harmony Data

Amino acid Sequence

rst

Write Enable

Address

Fitness Calculation System 1

harmony_1

clk rst


harmony_4

clk rst

reset

MLS Random Number Generator

clk

Harmony Search Core


Enable busy fitness_1

Fitness Calculation System 4 fitness_4

enable clk rst

Results MUX *Energy *Colisions *Fitness *Chronometer

processing time Chronometer

Fig. 2. Functional blocks of the proposed folding system without NIOS II embeddedprocessor

Fitness Calculation System

Harmony

Fitness Computation

3D Conformation Decoder

clk rst clk_mem

Element Coordinates (x i , yi , zi )

Coordinates Memory

Coordinates (x i , yi , zi )

Interactions calculation

Collisions detection

Energy Calculation

Fitness Calculation Fitness

Fitness

Interactions Energy

Fig. 3. Fitness computing system

Collisions

370


most efficient for the PFP using lattice models of proteins. In this coordinates system, a given conformation of the protein is represented as a set of movements into a three-dimensional cubic lattice, where the position of each amino acid of the chain is described relatively to its predecessor. As mentioned in Section 2, using the 3D-HP-SC model, each amino acid of the protein is represented by a backbone (BB) and a side-chain, either hydrophobic (H) or polar (P). Using the relative internal coordinates in the three-dimensional space, there are five possible relative movements for the backbone (Left, Front, Right, Down and Up), and other five for each side-chain (left, front, right, down, up). It is important to know that the side-chain movement is relative to the backbone. The combination of these possible movements gives 25 possibilities. Each possible movement is represented by a symbol which, in turn, is represented using a 5-bit binary format (number of bits needed to represent the alphabet of 25 possible movements, between 0 and 24). The invalid values (value ≥ 25) are replaced by the largest possible (value = 24). Considering a folding of a namino acids long protein, a harmony of n − 1 musicians will represent the set of movements of the backbone and side-chain of a protein in the three-dimensional lattice. For a n-amino acids long protein, the resulting search space is 25(n−1) possible foldings/conformations. Three-Dimensional Conformations Decoder: The harmony, representing a given conformation, has to be converted into Cartesian coordinates that embeds the conformation in the cubic lattice. Therefore, a progressive sequential procedure is necessary, starting from the first amino acid. The coordinates are generated by a combinational circuit for the whole conformation. These coordinates are stored in the “Coordinates Memory” which, in turn, provides the coordinates of all elements (backbone and side-chains) in a parallel output bus. The algorithm for the decoding process (harmony → conformation) is as follows. The harmony is read and decoded into a vector using the set of possible movements. In the next step, the elements of the first amino acid are placed in the three-dimensional space. For each movement, four steps are done. First, the direction of the movement is obtained from the next movement and the direction of the movement of the predecessor amino acid. The backbone coordinates are obtained similarly from predecessor amino acid. The next step consists in determining the coordinates of the side-chain of the amino acids from the movement and coordinates of the backbone. Finally, the coordinates obtained in this process are stored in the “Coordinates Memory”. Figure 4(left) shows a conformation for a hypothetical 4-amino acids long protein, where the Cartesian coordinates of each element are represented as xi (row), yi (column), zi (depth), and obtained from the relative movement of the current amino acid and position of its predecessor. Blue balls represent the polar residues and the red ones, the hydrophobic residues. The backbone and the connections between elements are shown in gray. The search space for the protein represented in this figure has 25(n−1) = 253 = 15625 possible conformations. Here, the folding is formed by three movements: Ul→Dl→Dl. In this figure, the backbone and the side-chain of the first amino acid of the chain are also


371

Fig. 4. Left: Example of relative 3D movements of a folding. Right: Diagram representing the possible iterations between the elements of a protein chain.

indicated, where the backbone and the side-chain are set to the origin of the coordinates system (0,0,0) and (0, -1, 0), respectively. Fitness Function: In this work, we used a simplified fitness function based on that formerly proposed by [2]. Basically, this function has two terms: f itness = H − (N C · P enaltyV alue). The first is relative to the free-energy of the folding (H, see Equation 1) and the second is a penalty term that decreases the fitness value according to the number of collisions in the lattice. The term Energy takes into account the number of hydrophobic bonds, hydrophilic interactions, and interactions with the backbone. Also, the number of collisions (considered as penalties) and the penalty weight are considered in this term. This penalty is composed by the number of points in the three-dimensional lattice that is occupied by more than one element (N C - number of collisions), multiplied by the penalty weight (P enaltyV alue). The blocks named “Interactions calculation”, “Collisions detection” and “Energy calculation”, compute the energy of each type of interaction (see Figure 4(right) for a visual representation), the number of collisions between elements and the free-energy (H), respectively. Finally, the block called “Fitness Calculation” computes the fitness function. It is important to note that, in the current version of the system, due to hardware limitations, all energies are computed using a sequential procedure, comparing the coordinates of all elements of the protein. As the length of sequences increase, the demand for hardware resources will increase accordingly.

5

Experiments and Results

All hardware experiments done in this work were run in a NIOS II Development kit with an Altera Stratix II EP2S60F672C5ES FPGA device, using a 50MHz internal clock. The experiments done for the software version were run in a desktop computer with a Intel processor Core2Quad at 2.8GHz, running Linux. In the experiments reported below, the following synthetic sequences were used [2], with 20, 27, 31, 36 and 48 amino acids, respectively: (HP)2 PH2 PHP2 HP

372


Table 1. Comparative performance of the several approaches n

tp (s) tpN IOS tpN IOS−HW tpSW tpHW

20

557.3

54.0

6.5

1.6

27

912.8

75.0

7.7

3.0

31

1186.8

87.3

7.9

4.0

36

1460.5

107.7

9.4

5.0

48

2414.9

174.8

13.44 10.0

H2 P(PH)2 ; H3 P2 H4 P3 (HP)2 PH2 P2 HP3 H2 ; (HHP)3 H(HHHHHPP)2 H7 ; PH (PPH)11 P; HPH2 P2 H4 PH3 P2 H2 P2 HPH3 (PH)2 HP2 H2 P3 HP8 H2 . In this work, no specific procedure was used for adjust the running parameters of the HS algorithm. Factorial experiments and self-adjusting parameters [10] of algorithms are frequently used in the literature, but these issues fall outside the focus of the work. Instead, we used the default parameters suggested in the literature. The running parameters used in this work are: M I = 100000, HM S = 20, P AR = 30%, F W = 5 and HM CR = 90%. It is important to recall that the main objective of this work is to decrease the processing time of protein folding simulations by using the 3D-HP-SC model. Each developed approach was applied to the sequences mentioned before. Results are shown in Table 1. In this table, the first column identifies the sequence length; columns tpN IOS , tpN IOS−HW , tpSW and tpHW show the processing time for each approach. Where, tpN IOS , tpN IOS−HW , tpSW and tpHW represent, respectively, the total elapsed processing time for the NIOS II, NIOS II with the “Fitness Calculation System” block, the software and the hardware-based system without embedded processor approach. Overall, the processing time, for any approach, is a function of the length of the sequence, possibly growing exponentially as the number of amino acids of the sequence increases. This fact, by itself, strongly suggests the need for highly parallel approaches for dealing with the PFP. In order to facilitate the comparison of performance between the approaches, Figure 5 presents the speedups obtained, where: – Spa = tpN IOS−HW /tpN IOS : speedup of the NIOS II with the “Fitness Calculation System” block relative to the NIOS II approach; – Spb = tpSW /tpN IOS−HW : speedup of the software relative to the NIOS II with the “Fitness Calculation System” block; – Spc = tpN IOS−HW /tpHW : speedup of the hardware-based system without embedded processor approach relative to the NIOS II with the “Fitness Calculation System” block; – Spd = tpSW /tpHW : speedup of the hardware-based system without embedded processor approach relative to the software for desktop computers.


35

373

Spa Spb Spc Spd

30

Speedup

25 20 15 10 5 0 20

25

30

35

40

45

n Fig. 5. Comparison of speedups between the approaches

The NIOS II version presented the worst performance (i.e. the highest processing time) amongst all implementations. Its processing time was larger than the software approach due to the slow frequency of the internal clock (comparing with the desktop processor). It is also observed that the NIOS II with the “Fitness Calculation System” block achieved significant speedup when compared to the NIOS II approach, ranging from 10x to 13x, depending on the length of the sequence, mainly due to the number of clock cycles needed to execute each instruction in the NIOS II processor. The hardware-based system without the embedded processor showed the best performance, mainly due to the several levels of parallelism, namely, in the Harmony Memory initialization, in the improvisation and in the parallelization of several fitness function evaluations. It is observed that this approach was significantly better when compared to the remaining hardware-based approaches, achieving a speed-up ranging from 17x to 34x, also depending on the length of the sequence. When compared with the software approach, it is observed that this approach achieved speedups ranging from 1.5x to 4.1x. The speedup decreases as the length of the sequences grows, due to the sequential procedure used to compute the energy for each type of interaction (as mentioned in Section 4).

6

Conclusions and Future Works

The PFP is still an open problem for which there is no closed computational solution. As mentioned before, even the simplest discrete model for the PFP requires an N P -complete algorithm, thus justifying the use of metaheuristic methods and parallel computing. While most works used both 2D and 3D-HP models, the 3D-HP-SC is still poorly explored (see [2]), although being a more expressive model, from the biological point of view. Improvements will be done in future versions with the hardware-based system without the embedded processor approach, such as the full parallelization of

374


the energy computation. Also, future works will investigate hardware versions of other evolutionary computation approaches, such the Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO) or the traditional Genetic Algorithm (GA) applied to the PFP, so as to develop parallel hybrid versions and different parallel topologies. Regarding the growth of hardware resources usage, future work will consider the use of larger devices or multi-FPGA boards. Overall, results lead to interesting insights and suggest the continuity of the work. We believe that the use of reconfigurable computing for the PFP using the 3D-HP-SC model is very promising for this area of research.

References 1. Anfinsen, C.B.: Principles that govern the folding of protein chains. Science 181(96), 223–230 (1973) 2. Benítez, C.M.V., Lopes, H.S.: Hierarchical parallel genetic algorithm applied to the three-dimensional HP side-chain protein folding problem. In: Proc. of the IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 2669–2676 (2010) 3. Berger, B., Leighton, F.T.: Protein folding in the hydrophobic-hydrophilic HP model is NP-complete. Journal of Computational Biology 5(1), 27–40 (1998) 4. Dandass, Y.S., Burgess, S.C., Lawrence, M., Bridges, S.M.: Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics 9(197) (2008) 5. Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M., et al.: Principles of protein folding - a perspective from simple exact models. Protein Science 4(4), 561–602 (1995) 6. Geem, Z.W., Kim, J.-H., Loganathan, G.V.: A new heuristic optimization algorithm: Harmony search. Simulation 76(2), 60–68 (2001) 7. Armstrong Junior, N.B., Lopes, H.S., Lima, C.R.E.: Preliminary steps towards protein folding prediction using reconfigurable computing. In: Proc. 3rd Int. Conf. on Reconfigurable Computing and FPGAs, pp. 92–98 (2006) 8. Li, M.S., Klimov, D.K., Thirumalai, D.: Folding in lattice models with side chains. Computer Physics Communications 147(1), 625–628 (2002) 9. Lopes, H.S.: Evolutionary algorithms for the protein folding problem: A review and current trends. In: Smolinski, T.G., Milanova, M.G., Hassanien, A.-E. (eds.) Computational Intelligence in Biomedicine and Bioinformatics. SCI, vol. 151, pp. 297–315. Springer, Heidelberg (2008) 10. Maruo, M.H., Lopes, H.S., Delgado, M.R.B.: Self-adapting evolutionary parameters: Encoding aspects for combinatorial optimization problems. In: Raidl, G.R., Gottlieb, J. (eds.) EvoCOP 2005. LNCS, vol. 3448, pp. 154–165. Springer, Heidelberg (2005) 11. Ramdas, T., Egan, G.: A survey of FPGAs for acceleration of high performance computing and their application to computational molecular biology. In: Proc. of the IEEE TENCON, pp. 1–6 (2005) 12. Sung, W.-T.: Efficiency enhancement of protein folding for complete molecular simulation via hardware computing. In: Proc. 9th IEEE Int. Conf. on Bioinformatics and Bioengineering, pp. 307–312 (2009) 13. Xia, F., Dou, Y., Lei, G., Tan, Y.: FPGA accelerator for protein secondary structure prediction based on the GOR algorithm. BMC Bioinformatics 12, S5 (2011)

Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms Ahmed Shamsul Arefin1, Mario Inostroza-Ponta2, Luke Mathieson3, Regina Berretta1,4, and Pablo Moscato1,4,5,* 1

Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia 2 Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Chile 3 Department of Computing, Faculty of Science, Macquarie University, Sydney Australia 4 Hunter Medical Research Institute, Information Based Medicine Program, Australia 5 ARC Centre of Excellence in Bioinformatics, Callaghan, NSW, Australia {Ahmed.Arefin,Regina.Berretta,Pablo.Moscato}@newcastle.edu.au, [email protected], [email protected]

Abstract. Novel analytical techniques have dramatically enhanced our understanding of many application domains including biological networks inferred from gene expression studies. However, there are clear computational challenges associated to the large datasets generated from these studies. The algorithmic solution of some NP-hard combinatorial optimization problems that naturally arise on the analysis of large networks is difficult without specialized computer facilities (i.e. supercomputers). In this work, we address the data clustering problem of large-scale biological networks with a polynomial-time algorithm that uses reasonable computing resources and is limited by the available memory. We have adapted and improved the MSTkNN graph partitioning algorithm and redesigned it to take advantage of external memory (EM) algorithms. We evaluate the scalability and performance of our proposed algorithm on a well-known breast cancer microarray study and its associated dataset. Keywords: Data clustering, external memory algorithms, graph algorithms, gene expression data analysis.

1

Introduction

The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. A number of algorithms, techniques and applications have been proposed to obtain useful information from various types of biological networks. Data clustering is perhaps the most common and widely used approach for the global network analysis. It helps to uncover important functional modules in the network. Numerous clustering algorithms for analyzing biological networks have been developed. These traditional algorithms/tools work well on moderate size networks and can produce *

Corresponding author.

Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 375–386, 2011. © Springer-Verlag Berlin Heidelberg 2011

376

A.S. Arefin et al.

informative results. Interestingly, the size and number of the biological networks are continuously growing due to extensive data integration from newly discovered biological processes and by novel microarray techniques that also consider ncRNAs. To handle the large-scale networks, existing algorithms are required to scale well and need to be re-implemented using cutting-edge software and hardware technologies. In this work, we have enhanced and re-implemented a graph-based clustering algorithm known as MSTkNN, proposed by Inostroza-Ponta et al. [1], to tackle the task of clustering large-scale biological networks. Given a weighted undirected graph (G) (or in its special case, given a non-negative square matrix of distances among a set of objects, i.e. a complete weighted graph) the MSTkNN algorithm starts by building a proximity graph. It is defined as having the same set of nodes as the original graph, but has as the set of edges, the intersection of the edges of the minimum spanning tree (MST(G)) and a the k-nearest neighbor graph (kNN(G)). Gonzáles et al. [2] also used this proximity graph, with k= ⎣ln(n)⎦ where n is the number of nodes. In the MSTkNN algorithm, the value of k is determined automatically and a recursive procedure partitions the graph until a stopping criteria stops this recursive partition of a cluster [3]. MSTkNN does not require any fixed parameter (e.g., predetermined number of clusters) and it performs better than some other known classical clustering algorithms (e.g., K-Means and SOMs) in terms of homogeneity and separation [3] in spite of not using an explicitly defined objective function. In addition, it performs well even if the dataset has clusters of different mixed types (i.e. MSTkNN is not biased to “prefer” convex clusters). We propose here a different approach to allow the basic idea inherent to the MSTkNN to be practically applicable on large datasets. In the worst-case situation, the input is a similarity/dissimilarity matrix at the start of the computation and, for a very large data set, this matrix may not fit in the computer’s internal memory (inmemory) or even in the computer’s external memory (EM). In order to overcome this problem, given G, we compute and store only a qNN graph (with q=k+1) of the similarity matrix and compute its MST (i.e. MST(qNN)). Additionally, we annotate each edge of MST(qNN)) with a non-negative integer value which is a function of the relative distance between the two nodes of that edge and their nearest neighbors. Finally, we recursively partition the MST(qNN) using this set of annotations on the edges to produce the clusters. Unlike the MSTkNN in [1], we compute the MST only once, instead of at each recursive step and we show that our clustering result is still the same to the previous proposed algorithm. We have implemented our proposed algorithm by adapting the EM algorithmic approaches presented in [4-6], which give us an excellent performance improvement over the previous implementation. EM algorithms are very efficient when most of the data needs to be accessed from external memory. This approach improves the running time by reducing the number of I/Os between in-memory and the external memory. Further details on EM algorithms can be found in [7]. Additionally, we now have the benefits of employing parallel and distributed computing to calculate the similarity/distance matrix and computing the qNN graph that has made our data preprocessing reasonably fast on large data sets.

Clustering Nodes in Large-Scale Biological Networks

2

377

Related Work

Several graph-based clustering algorithms/tools have been developed in the past years and the advantages of them to analyse biological networks are clearly demonstrated in several publications [1, 8-9]. We can see graph-based clustering as a general domain of problems in which the task is often seen as an optimization problem (generally defined on a weighted graph). Given the graph, it is partitioned using certain predefined conditions. Each partition that represents a subgraph/ component of the graph is either further partitioned or presented as a cluster based on certain stopping criteria and guided by an objective function. In Table A.1, we present a brief list of the known graph-based clustering algorithm/tools for biological data sets along with the maximum test data set sizes in the relevant published literature. It is clear from the table that traditional graph-based clustering algorithms can serve as a primary/first tool for analyzing biological networks. However, new algorithms, designed with more advanced technologies, are necessary to deal with larger data sets. Surprisingly, EM algorithms, which are very convenient for handling massive data sets, have not yet been applied for clustering biological networks. We have found only few attempts in the published literature that exploit EM algorithms in bioinformatics, all of them seem to be related to sequence searching [10-11]. There exist several graph-based EM algorithms [12-13] that could be further investigated for their applicability on biological networks. In this work, we have adapted the EM computation of minimum spanning trees (EM MST) [4] and connected components (EM CC) [5-6]. These algorithms are capable of handling sparse graphs with up to billions of nodes.

3 3.1

Methods The Original MSTkNN Algorithm

The original MSTkNN algorithm, presented in [3], takes an undirected complete graph (G) and computes two proximity graphs: a minimum spanning tree (GMST) and a k-nearest neighbor graph (GkNN), where the value of k is determined by: k = min{ ⎣ln(n)⎦ ; min k/GkNN is connected}

(1)

Subsequently, the algorithm inspects all edges in GMST. If for a given edge (x,y) neither x is one of the k nearest neighbors of y, nor y is one of the k nearest neighbors of x, the edge is eliminated from GMST. This results in a new graph G ′ = GMST – {(x,y)}. Since GMST is a tree, after the first edge is deleted, G ′ becomes a forest. The algorithm continues applying the same procedure to each subtree in G ′ (with a value of k re-adjusted (k= ⎣ln(n)⎦ ), where n is now the number of nodes in each subtree), until no further partition is possible. The final partition of the nodes of G ′ induced by the forest is the result of the clustering algorithm.

378

3.2

A.S. Arefin et al.

MSTkNN+: The Modified MSTkNN Algorithm

The original MSTkNN algorithm requires (n × (n − 1) / 2) distance values (between all pairs of the n elements) as the input. For a large data set, this could be too large to fit in the computer’s in-memory and, for even larger values of n, it may not even fit in external memory. Even if we can store the distance matrix in the external memory, the computational speed will slow down dramatically because of the increased number of I/O operations. Therefore, we modified this step and instead of creating the complete graph from the distance matrix, we compute a q-nearest neighbor graph (GqNN), where q (= ⎣ln(n) ⎦ +1). This procedure reduces the input graph size, but still creates a reasonable clustering structure of the data set. The value of the q is determined from the inclusion relationship [2] of the GMST and the family of the nested sequence of graphs (GkNN, where k > ln(n)). Then, we compute the MST of the GqNN graph. We will call it GMSTp. We first annotate each edge in GMSTp according to the following procedure. For each edge (a,b) in E(GMSTp ) we assign an integer value p to the edge as follows: let, f(a,b) be the index of b in the sorted list of nearest neighbors of a in GqNN. The value of p is given by, p = min {f(a,b), f(b,a)}

(2)

We define the maximum value of p in the MSTp (or any of its components) as pmax and then, we partition the GMSTp with the following criteria: C1. If p > ⎣ln(n) ⎦ ; we remove the edge,

C2. If pmax < ⎣ln(n) ⎦ ; remove the edges with weight pmax – 1, and;

C3. If pmax= 1 or pmax = ⎣ln(n) ⎦ ; do not remove any edge, the result is a “cluster”.

The final output of our algorithm is a set of partitions or clusters of the input data. The algorithm does not require any pre-determined value for q but it is obviously possible to change the threshold from ⎣ln(n) ⎦ to any other user-defined parameter. The algorithm can be understood as a recursive procedure (see below): Algorithm 1. PRE-MSTkNN+ (D: distance matrix)

1: Compute GqNN. 2: Compute GMSTp = MST(GqNN). Algorithm 2. PRUNE-MSTkNN+ (GMSTp)

1: G ′ = Partition GMSTp , using the criteria C1, C2 and C3 described above. 2: c = connectedComponent( G ′ ) 3: If c > 1 then 4: Gcluster= Uic=1 PRUNE-MSTkNN+(components( Gi′ )) 5: End if 6: Return Gcluster


379

The function connectedComponent() gives the number of components in G ′ and the function components() identifies and returns each of the components. Unlike the original algorithm in [1], we compute the MST only once (at the beginning), instead of at each recursive step. This change also gives a significant speed-up in terms of run-time performance over the previous algorithm. The following Lemma proves that this approach is sound (i.e., a partitioned subtree also represents an exact MST of the relevant component in the complete graph): Lemma 1. Let T be a minimum spanning tree for a weighted graph G. Then if we select an edge e from T and partition the graph according to the subgraphs induced by the subtrees induced by excluding e from T, these subtrees are also minimum spanning trees for the subgraphs. Proof. Let T be a minimum spanning tree for a graph G. Let T be partitioned into two subtrees A and B with vertex and edge sets V (A), V (B), E(A) and E(B) respectively. Furthermore, let V (A) ∩ V(B) = φ and V (A) ∪ V(B) = V (G) and let A and B be connected by a single edge e in T. Now consider the graph G[V(A)] and let T ′ be a minimum spanning tree for G[V (A)]. We define the weight function w of a spanning tree to be the sum of the weights of the edges of the tree, and extend this in the natural way to any subtree. Then, w(T) = w(A) + w(B) + w(e). Now, assume that w( T ′ ) < w(A). Then, we could replace the subtree A with T ′ , and join it to B using e. As V(A) and V(B) are disjoint we cannot introduce any cycles, therefore T ′ joined with B via e must be a tree, and further, a spanning tree for G. However, this new tree must have weight less than w(T), contradicting the minimality of T. Therefore, T ′ cannot exist.

The main advantage of this algorithm over the all other MST-based graph clustering algorithms (for example [8-9]) is that it prunes the MST edges using the local connectivity, instead of using the exact distance between the two nodes in an edge (e.g., deleting the longest edge). Our algorithm can produce better results in terms of local connectivity (i.e., homogeneity) which is a desirable characteristic in clustering biological networks. 3.3

Implementation

The Test Environment. The computational tests were performed on a 16 node cluster computer (with Intel Xeon 5550 processors, 2.67 GHz speed, 8 cores) and the programs were written in C++ with the support of STL, STXXL1 and BOOST2 library and compiled using the g++ 4.4.4 compiler on a Linux OS, kernel ver. 2.6.9. Parallel /Distributed NN graph computation. To compute the distance matrix we use a message-passing interface (MPI) to distribute the data set (row-wise) into P parallel processors and then initiate the parallel computation of the distance metric, in each of them using Open MP (Multi-Processing). The method for efficiently distributing the computation of upper/lower triangle of the symmetric similarity matrix will be discussed later. 1 2

http://stxxl.sourceforge.net/ http://www.boost.org/

380

A.S. Arefin et al.

The EM MST and CC computation. We compute the MST using the EM MST algorithm in [4]. The I/O complexity of this algorithm is O(sort(m)·log(n/M)), where n is the number of nodes of the original graph, m is number of edges and M number of nodes that fit into computer’s internal memory, respectively, and the sort(m) is the time required for sorting the m edges. After partitioning the MST, we identify the connected components using the EM connected component algorithm in [5-6]. The I/O complexity of this algorithm is O(m·log(log(n))). Unlike other clustering tools, we store the connected components/clusters in external memory and only keep the list of the components in computer’s in-memory. This eliminates the excessive use of the inmemory even when there are a large number of components or clusters. Additionally, we tuned the implementations of the adapted algorithms [4-6] for better performance with denser graphs.

4 4.1

Results Data Description

We used two different data sets to demonstrate the performance of our proposed EM algorithm MSTkNN+. The first data set is used to illustrate the algorithm and contains a distance matrix between 10 Australian cities. The second data set is a breast cancer gene-expression data set from a study by van de Vijver et al. [14]. This microarray dataset contains the expression of 24,158 probe sets in 295 primary breast cancer patients. The data set also contains the clinical metastasis information (in terms of years to relapse) for all the patients. We also create a third larger dataset from van de Vijver et al. [14] as follows. First, we filter the probes sets using Fayyad and Irani's algorithm [15]. This step is supervised and aims at finding differentially expressed probe sets in the samples labeled “metastasis” versus the ones labeled “nonmetastasis”. This does not mean that these patients had no relapse. Instead, we indicate with “non-metastasis” that the patients had no relapse within five years after the initial diagnosis, but indeed there is a presence of a metastasis during the duration of the study, up to 14 years in one case. Next, we use a feature selection algorithm to refine the selection of probe sets using the (alpha-beta)-k-Feature set methodology [16]. After we selected features based on this method, we obtain a set of 876 probes sets. Finally, we produce a new large data set by subtracting the expression values of each possible pair of probes. These unique probe pairs are termed as metafeatures as in Rocha de Paula et al. [17]. Subsequently, we have an artificial data set with 384,126 elements, including all the filtered probes and all metafeatures. 4.2

Application on the City Distance Data Set

Our first application is on a distance matrix that we have created by taking distances among 10 Australian cities. The data set is given is Table A.2. We first create a qNN graph from the data set (See Table A.3) for q = 3 and an MSTp, where we annotate each edge with an integer value (p) as described in equation (2). For example (See Figure 1(a) and Table A.3), Adelaide is the third nearest neighbor of Melbourne


381

Fig. 1. (a) The MSTp created from 10 Australian cities (actual locations of the cities in the map are schematic). The edge between “Albany” and “Adelaide” is the candidate for deletion as the neighborhood value p > ⎣ln(10)⎦ = 2 (b) In the first iteration of MSTkNN+ the edge between “Katherine” and “Adelaide” is the next candidate to delete as p > ⎣ln(7)⎦ = 1 , where the number of elements in that component is 7 (c) Final clustering result.

and Melbourne is the first nearest neighbor of Adelaide. Therefore, we give a weight of 1 (the minimum) to the edge that connects Adelaide and Melbourne. Finally, we prune the MST edges using the criteria C1, C2 and C3 on each of the components. The result of our algorithm is presented in Figure 1(c). 4.3

Application on the Breast Cancer Data Set

Our second application is on a dataset on breast cancer. It contains the gene expression values measured on 24,158 probe sets for 295 primary breast cancer patients [14]. We first compute a similarity matrix using Pearson’s correlation and create a qNN graph that contains 24,158 vertices and 265,738 edges. Next, we create the MSTp. Finally, we apply our proposed algorithm to partition the MSTp to obtain the clusters (see Figure 2).

Fig. 2. Visualization of the clusters from the breast cancer data set in [12]. Some genes of interest are highlighted.

382

A.S. Arefin et al.

Additionally, we used iHOP3 to find the importance of the genes that are in the central regulatory positions of some of the clusters (see Figure 3). Our results show that many of the genes that are in the central position seem to have been already discussed in breast cancer and its progression course (see Table 1). Additionally, the genes with less number of published papers can also be further investigated based on their conspicuous position in the clustering and adjacency relation with the genes that have already been implicated in breast cancer. Table 1. The number of published literature associated with some of the observed most central genes and results using iHOP and Pubmed for searching the name of the gene and its aliases together with the words “breast” and “cancer” (ordered by gene symbol, highly referenced genes are in bold face) Gene Symbol COPS8 CPNE1 ESR1 EST FGF8 FOXA1 GATA3 GPR35 HAPLN3 HIC2 LOC729589 MTNR1A NCOA7 PLEKHA3 PLK1 SPAST

4.4

Gene Name COP9 constitutive photomorphogen copine I estrogen receptor 1 mitogen-activated protein kinase 8 fibroblast growth factor 8 forkhead box A1 GATA binding protein 3 G protein-coupled receptor hyaluronan and proteoglycan link 3 hypermethylated in cancer hypothetical LOC729589 melatonin receptor 1A nuclear receptor coactivator 7 pleckstrin homology domain 3 polo-like kinase 1 spastic paraplegia 4

Breast 10 1 17,352 165 27 60 219 0 1 13 0 194 1 0 49 0

Cancer 57 3 28,250 879 156 120 1399 2 1 122 0 1193 3 2 458 3

Application on an Expanded Breast Cancer Data Set with 384,126 Vertices and 4,993,638 Edges

Finally, we apply our proposed algorithm (MSTkNN+) on a large-scale “artificial” data set that is an expanded version of the breast cancer data set [14]. This data set has 384,126 elements (the values of 383,250 metafeatures together with the values of 876 probe sets obtained by filtering the original data set). Additionally, we also include the clinical metastasis information as a “phenotypical dummy probe set”. As previously described, we first create the qNN graph containing 384,126 vertices and 4,993,638 edges. Next, we apply MSTkNN+ to find the clusters. Due to the limitation of the existing visualization tools, it is impossible to provide a picture of the complete clustering. Instead, we present the group of metafeatures that closely cluster with the “phenotypical dummy probe set” (years for relapse), zooming in a part of that naturally strikes many as very interesting (see Figure 3). We find one metafeature (BCAR1SLC40A1) that has better correlation with the metastasis information than either the individual probe sets alone (e.g., genes BCAR1 or SLC40A1, see Figure 4). 3

http://www.ihop-net.org/UniPub/iHOP/


383

Fig. 3. The visualization (partial) of the cluster that contains the clinical metastasis information as a phenotypical gene. The rectangular shaped nodes indicate that the genes in these metafeatures share a common biological pathway (identified using GATHER4).

Fig. 4. The metafeature (BCAR1-SLC40A1) shows better correlation with the clinical metastasis values of each patient with respect to the feature (i.e., the BCAR1, Breast Cancer Anti-estrogen Resistance 1, or SLC40A1, Ferroportin-1) alone

It is also interesting to note the presence of SLC40A1 in three of the metafeatures co-expressed with the time to relapse values (clinical metastasis “dummy probe set”). Jiang et al. suggested that “breast cancer cells up-regulate the expression of iron importer genes and down-regulate the expression of iron exporter SLC40A1 to satisfy their increased demand for iron” [18]. This data indicates that, for those tumors that may relapse (and for which a different genetic signature may need to be found), the joint expression of BCAR1 and Ferroportin may be associated to time to relapse. Similarly, other identified metafeatures could also be further investigated. 4.5

Performance Comparisons

We have compared the solutions of our clustering approach against K-Means, SOM, CLICK and the original MSTkNN [1], using the homogeneity and separation indexes that give us an idea of how similar the elements in a cluster and dissimilar among the clusters, respectively (See Table 5). We used the implementation of the K-Means, SOM and CLICK available in the Expander tool5 and the implementation of the MSTkNN in [1] is obtained from http://cibm.newcastle.edu.au. The averages of homogeneity (Havg) and separation (Savg) were computed as in [19] and the Pearson’s correlation is used as the metric for computing the similarity matrix.

4 5

http://gather.genome.duke.edu/ http://www.cs.tau.ac.il/~rshamir/expander/

384

A.S. Arefin et al.

Table 2. Performance comparisons with K-Means, SOM, CLICK and original the MSTkNN approach in terms of homogeneity and separation Data Algorithms Breast Cancer K-Means SOM Filtered n=876 CLICK MSTkNN MSTkNN+ Complete n=24,158

Expanded n=384,126

^ †

Param. K=41 3× 3 -

K-Means, SOM, CLICK MSTkNN MSTkNN+ K-Means, SOM, CLICK, MSTkNN MSTkNN+(ours) -

Havg 0.521 0.501 0.538 0.287 0.288

Savg -0.186 -0.015 -0.281 0.386 0.389

#Clust. 41 9 8 41 45

Time (min) ~1 ~ 0.2 ~ 0.5 ~ 0.5 ~ 0.3

Mem. (MB) ~ 250 ~ 200 ~ 250 ~ 250 ~ 156

-

-

-

-

-

0.429 0.430

0.390 0.398

732 745

~ 12 ~ 5^

~ 8,100 ~ 650†

-

-

-

-

-

0.630

0.410

2,587

~ 15^

~ 1,500†

Does not include the time for computing the similarity matrix. Internal memory consumption can be pre-defined with EM environment parameters.

From Table 2, we can clearly see that MSTkNN succeeds in producing small, precise clusters from the filtered expression data (n=876). Even for the same number of clusters it gives better performance (i.e., higher homogeneity and lower separation values) than K-Means (even when we intentionally set K=41 in K-Means), SOM and CLICK. Proposed MSTkNN+, showed better performance in terms of homogeneity, time and memory usage, but the separation value was slightly increased. For the complete breast cancer data set (n=24,158), only the MSTkNN and our proposed algorithm were able to cluster the data set with high and low in-memory usage, respectively. The other algorithms were incomputable and were running indefinitely on the test machine. Finally, for the expanded breast cancer data set (n=384,126), only our proposed algorithm’s implementation MSTkNN+ could successfully cluster the whole data set in 15 minutes and using reasonable amount of in-memory.

5

Conclusion and Future Work

In this paper, we have proposed a significant improvement to the existing MSTkNN based clustering approach. Our implementation is faster (due to parallel/distributed pre-processing and algorithmic enhancement) and more memory efficient and scalable (due to the EM implementation) than the one in [1]. The clusters identified by our approach are meaningful, precise and comparable with other state-of-the-art algorithms. Our future work includes the design and implementation of a nearest neighbor-based MST algorithm so that we can eliminate the prohibitive computation of the similarity matrix when the data set is terribly large. Finding the nearest neighborhood of a point in space is being widely researched and one way to do so is to produce a kdtree. Other approaches, such as a GPU based similarity matrix computation can be an aid to accelerate the clustering process.


385

References 1. Inostroza-Ponta, M.: An Integrated and Scalable Approach Based on Combinatorial Optimization Techniques for the Analysis of Microarray Data, PhD thesis, The University of Newcastle, Australia (2008) 2. Gonzalez-Barrios, J.M., Quiroz, A.J.: A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics and Probability Letters 62(3), 23–34 (2003) 3. Inostroza-Ponta, M., Mendes, A., Berretta, R., Moscato, P.: An integrated QAP-based approach to visualize patterns of gene expression similarity. In: Randall, M., Abbass, H.A., Wiles, J. (eds.) ACAL 2007. LNCS (LNAI), vol. 4828, pp. 156–167. Springer, Heidelberg (2007) 4. Dementiev, R., Sanders, P., Schultes, D., Sibeyn, J.: Engineering an external memory minimum spanning tree algorithm. In: 3rd IFIP Intl. Conf. on Theoretical Computer Science, pp. 195–208 (2004) 5. Sibeyn, J.: External Connected Components. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 468–479. Springer, Heidelberg (2004) 6. Schultes, D.: External memory spanning forests and connected components, Technical report (2004), http://algo2.iti.kit.edu/dementiev/files/cc.pdf 7. Vitter, J.S.: External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys 33 (2001) 8. Xu, Y., Olman, V., Xu, D.: Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree. Bioinformatics 18(4), 526–535 (2002) 9. Grygorash, O., Zhou, Y., Jorgensen, Z.: Minimum Spanning Tree Based Clustering Algorithms. In: Proc. of the 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2006), pp. 73–81. IEEE Computer Society, Washington, DC, USA (2006) 10. Doowang, J.: An external memory approach to computing the maximal repeats across classes of dna sequences. Asian Journal of Health and Information Sciences 1(3), 276–295 (2006) 11. Choi, J.H., Cho, H.G.: Analysis of common k-mers for whole genome sequences using SSB-tree. Japanese Society for Bioinformatics 13, 30–41 (2002) 12. Chiang, Y., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., et al.: Externalmemory graph algorithms, In. In: SODA 1995: Proceedings of the Sixth Annual ACMSIAM, pp. 139–149. Society for IAM, Philadelphia (1995) 13. Abello, J., Buchsbaum, A.L., Westbrook, J.R.: A functional approach to external graph algorithms. Algorithmica, 332–343 (1998) 14. van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., et al.: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347(25) (2002) 15. Fayyad, U.M., Irarni, K.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: IJCAI, pp. 1022–1029 (1993) 16. Cotta, C., Sloper, C., Moscato, P.: Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data. In: Raidl, G.R., Cagnoni, S., Branke, J., Corne, D.W., Drechsler, R., Jin, Y., Johnson, C.G., Machado, P., Marchiori, E., Rothlauf, F., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2004. LNCS, vol. 3005, pp. 21–30. Springer, Heidelberg (2004) 17. Rocha de Paula, M., Ravetti, M.G., Rosso, O.A., Berretta, R., Moscato, P.: Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer’s disease. PLoS ONE 6(e17481) (2011) 18. Jiang, X.P., Elliot, R.L., Head, J.F.: Manipulation of iron transporter genes results in the suppression of human and mouse mammary adenocarcinomas. Anticancer Res. 30(3), 759–765 (2010) 19. Shamir, R., Sharan, R.: CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. In: Proc. of ISMB, pp. 307–316 (2000)

386

A.S. Arefin et al.

Appendix Table A.1. A list of known graph-based clustering algorithms/tools for biological networks6 Name cMonkey GTOM SAMBA CAST NNN EXCAVATOR HCS MSTkNN CLICK Ncut-KL TribeMCL MPI-MCL

Approaches Bi-clustering Topological overlap Neighborhood search Affinity search Mutual NN search MST Minimum cut Intersect MST- kNN Mincut Mincut, MCL MCL, dist. comp.

Language R R C/C++ Matlab Java C, Java Matlab,LEDA Java C/C++ C/C++ Fortran, MPI

Max. test data (n) 2,993 4,000 4,177 6,000 6162 6,178 7,800 14,772 29,600 40,703 80,000 125,008

Table A.2. A distance matrix in KMs for 10 Australian cities7 Canb. 0 240 473 967 3102 3141 1962 865 2838 3080

Canberra Sydney Melb. Adelaide Perth Darwin Katherine Hobart Albany Bunbury

Syd. 240 0 713 1163 3297 3153 2030 1060 3046 3282

Melb 473 713 0 654 2727 3151 1892 601 2436 2690

Adel. 967 1163 654 0 2136 2620 1330 1165 1885 2118

Perth 3102 3297 2727 2136 0 2654 1995 3017 392 156

Darwin 3141 3153 3151 2620 2654 0 1291 3743 2828 2788

Kath. 2870 2882 2885 2364 2562 271 0 3478 2702 2688

Hobart 865 1060 601 1165 3017 3743 2470 0 2678 2951

Albany 2838 3046 2436 1885 392 2828 1993 2678 0 279

Bunb. 3080 3282 2690 2118 156 2788 2688 2951 279 0

Table A.3. Three nearest neighborhood (q=3) for 10 Australian cities City/ q = Canberra Sydney Melbourne Adelaide Perth Darwin Katherine Hobart Albany Bunbury

6

7

1 Sydney (240) Canberra (240) Canberra (473) Melbourne (654) Bunbury (156) Katherine (271) Darwin (1291) Melbourne (601) Bunbury (279) Perth (156)

2 Melbourne (473) Melbourne (713) Hobart (601) Canberra (967) Albany (392) Adelaide (2620) Adelaide (1330) Canberra (865) Perth (392) Albany (279)

3 Hobart (865) Hobart (1060) Adelaide (654) Sydney (1163) Adelaide (2136) Perth (2654) Melbourne (1892) Sydney (1060) Adelaide (1885) Adelaide (2118)

Details about the methods and test environments can be found in the relevant publications. Computed using the distance tool at http://www.geobytes.com/citydistancetool.htm

Reconfigurable Hardware to Radionuclide Identification Using Subtractive Clustering Marcos Santana Farias1 , Nadia Nedjah2 , and Luiza de Macedo Mourelle3 1

Department of Instrumentation, Nuclear Engineering Institute, Brazil [email protected] 2 Department of Electronics Engineering and Telecommunications, State University of Rio de Janeiro, Brazil [email protected] 3 Department of Systems Engineering and Computation, State University of Rio de Janeiro, Brazil [email protected]

Abstract. Radioactivity is the spontaneous emission of energy from unstable atoms. Radioactive sources have radionuclides. Radionuclide undergoes radioactive decay and emits gamma rays and subatomic particles, constituting the ionizing radiation. The gamma ray energy of a radionuclide is used to determine the identity of gamma emitters present in the source. This paper describes the hardware implementation of subtractive clustering algorithm to perform radionuclide identification. Keywords: Radionuclides, data classification, reconfigurable hardware, subtractive clustering.

1

Introduction

Radioactive sources have radionuclides. A radionuclide is an atom with an unstable nucleus, i.e. a nucleus characterized by excess of energy, which is available to be imparted. In this process, the radionuclide undergoes radioactive decay and emits gamma rays and subatomic particles, constituting the ionizing radiation. Radionuclides may occur naturally but can also be artificially produced [10]. So, radioactivity is the spontaneous emission of energy from unstable atoms. Correct radionuclide identification can be crucial to planning protective measures, especially in emergency situations, by defining the type of radiation source and its radiological hazard [6]. The gamma ray energy of a radionuclide is a characteristic of the atomic structure of the material. This paper introduces the application of a method for a classification system of radioactive elements that allows a rapid and efficient identification to be implemented in portable systems. Our intention is to run a clustering algorithms in a portable equipment to perform identification of radionuclides. The clustering algorithms consume high processing time when implemented in software, mainly on processors of portable use, such as micro-controllers. Thus, a custom implementation for reconfigurable hardware is a good choice in embedded systems, which require real-time execution as well as low power consumption. Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 387–398, 2011. c Springer-Verlag Berlin Heidelberg 2011

388

M.S. Farias, N. Nedjah, and L. de Macedo Mourelle

The rest of this paper is organized as follows: first, in Section 2, is demonstrated the principles of nuclear radiation detection. Later, in Section 3, we review briefly existing clustering algorithms and we concentrate on the subtractive clustering algorithm. In Section 4, we describe the proposed architecture for cluster centers calculator using the subtractive clustering algorithm. Thereafter, in Section 5, we present some performance figures to assess the efficiency of the proposed implementation. Last but not least, in Section 6, we draw some conclusions and point out some directions for future work.

2

Radiation Detection

The radioactivity and ionizing radiation are not naturally perceived by the sense organs of human beings and can not be measured directly. Therefore, the detection is performed by analysis of the effects produced by radiation when it interacts with a material. There are three main types of ionizing radiation emitted by radioactive atoms: alpha, beta and gamma. The alpha and beta are particles that have mass and are electrically charged, while the gamma rays, like x-rays, are electromagnetic waves. The emission of alpha and beta radiation is always accompanied by the emission of gamma radiation. So most of the detectors is to gamma radiation. Gamma energy emitted by a radionuclide is a characteristic of the atomic structure of the material. The energy is measured in electronvolts (eV). One electronvolt is an extremely small amount of energy so it is common to use kiloelectronvolts (keV) and megaelectronvolt (MeV). Consider, for instance, Cesium-137 (137 Cs) and Cobalt-60 (60 Co), which are two common gamma ray sources. These radionuclides emit radiation in one or two discreet wavelengths. Cesium-137 emits 0.662 MeV gamma rays and Cobalt60 1.33 MeV and 1.17 MeV gamma rays. These energy are known as decay energy and define the decay scheme of the radionuclide. Each radionuclide, among many others, has a unique decay scheme by which it is identified [10]. When these emissions are collected and analyzed with a gamma ray spectroscopy system, a gamma ray energy spectrum can be produced. A detailed analysis of this spectrum is typically used to determine the identity of gamma emitters present in the source. The gamma spectrum is characteristic of the gammaemitting radionuclides contained in the source [11]. A typical gamma-ray spectrometry system (Fig. 1) consists of a scintillator detector device and a measure system . The interaction of radiation with the system occurs in the scintillator detector and the measurement system interprets this interaction. The scintillator detector is capable of emitting light when gamma radiation transfers to him all or part of its energy. This light is detected by a photomultiplier optically coupled to the scintillator, which provides output to an electrical signal whose amplitude is proportional to energy deposited. The property of these detectors provide an electrical signal proportional to the deposited energy spectrum allows the generation of the gamma energy spectrum by a radioactive element (histogram). To obtain this spectrum is used a

Reconfigurable Hardware to Radionuclide Identification

389

Fig. 1. Gama Spectrometry System - main components

multichannel analyzer or MCA. The MCA consists of an ADC (Analog to Digital Converter) which converts the amplitude of analog input in a number or channel. Each channel is associated with a counter that accumulates the number of pulses with a given amplitude, forming a histogram. These data form the energy spectrum of gamma radiation. Since different radionuclides emit radiation at different energy distributions, the spectrum analysis can provide information on the composition of the radioactive source found and allow the identification. Figure 2 shows a spectrum generated by simulation, to a radioactive source with of 137 Cs and 60 Co. The x-axis represents the channels for a 12-bit ADC. In such representation, 4096 channels of the MCA correspond to 2.048 MeV in the energy spectrum. The first peak in channel 1324 is characteristic of 137 Cs (0.662 MeV). The second and third peaks are energies of 60 Co.

50 45 40 35

Counts

30 25 20 15 10 5 0

0

500

1000

1500

2000 2500 Channels

3000

3500

Fig. 2. Energy spectrum simulated by a source with

4000

137

4500

Cs and

60

Co

390


The components and characteristics of a gamma spectrometry system (the type of detector, the time of detection , the noise of the high-voltage source, the number of channels, the stability of the ADC, temperature changes) can affect the formation of spectrum and quality of the result. For this reason it is difficult to establish a system for automatic identification of radionuclides, especially for a wide variety of these. Equipment that are in the market, using different algorithms of identification and number of radionuclides identifiable, do not have a good performance [6].

3

Clustering Algorithms

Clustering algorithms partition a collection of data into a certain number of clusters, groups or subsets. The aim of the clustering task is to group these data into clusters in such a way that similarity between members of the same cluster is higher than that between members of different clusters. Clustering of numerical data forms the basis of many classification algorithms. Various clustering algorithms have been developed. One of the first and most commonly used clustering algorithms is based on the Fuzzy C-means method (FCM). Fuzzy C-means is a method of clustering which allows one piece of data to belong to two or more clusters. This method was developed by Dunn [1] and improved by Hathaway [7]. It is commonly used in pattern recognition. Yager and Filev [2] introduced the so-called mountain function as a measure of spatial density around vertices of a grid, showed in the function (1) M (vi ) =

n

2

e−αxj −xi ,

(1)

j=1

where α > 0, M is the mountain function, calculated for the ith vertex vi during the first step, N is the total number of data, which may be simple points or samples, that is assumed to be available before the algorithm is initiated. Norm × | denotes the Euclidean distance between the points used as arguments and xj is the current data point or sample. It is ensured that a vertex surrounded by many data points or samples will have a high value for this function and, conversely, a vertex with no neighboring data point or sample will have a low value for the same function. It should be noted that this is the function used only during the first step with all the set of available data. During the subsequent steps, the function is defined by subtracting a value proportional to the peak value of the mountain function. A very similar approach is the subtractive clustering (SC) proposed by Chiu in [3]. It uses the so-called potential value defined as in (2). Pi =

n j=1

2

e−αxj −xi , where α =

4 ra

(2)

wherein, Pi is the potential value i-data as a cluster center, xi the data point and ra a positive constant, called cluster radius.


391

The potential value associated with each data depends on its distance to all its neighborhoods. Considering (2), a data point or sample that has many points or samples in its neighborhood will have a high value of potential, while a remote data point or sample will have a low value of potential. After calculating potential for each point or sample, the one, say x∗i , with the highest potential value, say Pi∗ , will be selected as the first cluster center. Then the potential of each point is reduced as defined in (3). This is to avoid closely spaced clusters. Until the stopping criteria is satisfied, the algorithm continues selecting centers and revising potentials iteratively. ∗ 2

Pi = Pi − Pi∗ e−βxi −xi ,

(3)

4/rb2

In (3), β = represents the radius of the neighborhood for which significant potential revision will occur. The data points or samples, that are near the first cluster center, say x∗i , will have a significantly reduced density measures. Thereby, making the points or samples unlikely to be selected as the next cluster center. The subtractive clustering algorithm can be briefly described by the following 4 main steps: – Step 1: Using (2), compute the potential Pi for each point or sample, 1 ≤ i ≤ n; – Step 2: Select the data point or sample, x∗i , considering the highest potential value, Pi∗ ; – Step 3: Revise the potential value of each data point or sample, according to (3). Find the new maximum value maxPi ; – Step 4: If maxPi ≤ Pi∗ , wherein is the reject ratio, terminate the algorithm computation; otherwise, find the next data point or sample that has the highest potential value and return to Step 3. The main advantage of this method is that the number of clusters or groups is not predefined, as it is in the fuzzy C-means method, for instance. Therefore, this method becomes suitable for applications where one does not know or does not want to assign an expected number of clusters ´ a priori. This is the main reason for choosing this method for the identification of radionuclides.

4

Proposed Architecture

This section provides an overview of the macro-architecture and contains information on the broad objectives of the proposed hardware. The hardware implements the subtractive clustering algorithm. Subtractive clustering algorithm was briefly explained in the Section 3. The implementation of this algorithm in hardware is the main point to develop a classification system of radioactive elements. For referencing, this hardware it will call hsc, hardware to subtractive clustering. This component processes all the arithmetic computation, described in the Section 3, to calculate the potential of each point in the subtractive clustering

392


algorithm. It has two components (exp1 and exp2 ) to compute the exponential 2 value e−αxi −xj and one component to sum (adder). The other component of this macro-architecture is called slc, component to storage, loading and control, which provides to the hsc the set of samples for the selection of cluster centers and stores the results of the calculated potential of each sample. This component also has the controller of the hsc. Figure 3 shows the components of the described macro-architecture.

Fig. 3. Macro-architecture components - SLC and HSC

The slc is a controller based on state machine. It includes a dual port memory md that provides the data that has to be clustered and the memory mp that allows for the bookkeeping of the potential associated with each clustered data. The data in this case could be provided by an ADC that belongs to a typical gamma-ray spectrometry system. The registers xmax , xi and xIndex maintain the required data until components exp1 and exp2 have completed the related computation. We assume the xmax value is available in memory md at address 0. The xmax is the biggest value found within the data stored in md. This register is used to the data normalization. The two exp components, inside hsc, receive, at the same time, different xj values from the dual port memory md. So the two modules start at the same time and thus, run in parallel. This sample for each component exp are two distinct values xj from two subsequent memory addresses. 2 After the computation of e−αxi −xj by exp1 and exp2 , component adder sums and accumulates the values provided at its input ports. This process is repeated until all data xj , 1 ≤ j ≤ N , are handled. Thus, this calculation determines the first Pi value to be stored in memory mp. After that, the process


393

is repeated to compute the potential values of all data points in memory md. At this point the first cluster center, i.e. the sample with maximum potential value, has been found. The slc component works as a main controller of the process. Thus, the trigger for initiating the components exp1 and exp2 occurs from the signal StartExp sent by slc. The proposed architecture allows the hardware to subtractive clustering hsc can be scaled by adding more of these components in parallel to the computation 2 of the factors e−α||xj −xi || . This provides greater flexibility to implement the hardware. Figure 4 shows how new components hsc are assembled in parallel. Each component hsc calculates in parallel the potential of a point i, the value Pi of the function 3. For this reason each module hsc must to receive and record a value of xi to work during the calculation of the potential of a point. Since these values are in different addresses of the memory, this registry value xi has to be done at different time because the memory can not have your number of ports increased as the number of components hsc is increased. To be not necessary to increase the number of control signals provided by the component slc when new components hsc are added, the component hsc itself has to send some control signals for the subsequent.

Fig. 4. Macro architecture with hsc components in parallel

These signals are to load the value xi (LEXi ) and start the reduction potential of each point (StartP ot), as showed in (3). Moreover, each component hsc should receive the signal EndAdd which indicates the end of the operation on the component Adder of the subsequent component hsc. This ensures that the main control (slc) only receives these signals after all components hsc in parallel complete their transactions at each stage, allowing the hardware can be reconfigured without changes in the main control. Figure 5 shows the effect of this scaling, simulating different processing times in the hsc modules. The n components hsc, implemented in parallel, compute the potential of n points of the set of samples. As explained before, the recording of the value xi has to be done in different period to be used in the calculation of the potential.

394


Fig. 5. Control signals with scaled architecture

It is shown in figure 5 that the first component hsc receives the signal LEXi to load xi from slc control and after this, it sends the signal LEXi to hsc subsequent. Only after all of the hsc have recorded its value xi , the signal to start the components exp (StartExp) is sent with the first pair of values xj in the dual bus BD. The internal architecture of the module exp1 and exp2 permits the calculation 2 of the exponential value e−αxi −xj . The exponential value was approximated by a second-order polynomial using the least-squares method [8]. Moreover, this architecture computes these polynomials and all values were represented using fractions, as in (4). e

−αx

Na = Da

Nv Dv

2

Nb + Db

Nv Dv

+

Nc Dc

(4)

Na Nc Nv b wherein, factors D , N Db and Dc are some pre-determined coefficients. Dv is a equivalent to variable (αx) in the representation. For high precision, the coefficients were calculated within the range [0, 1[, [1, 2[, [2, 4[ and [4, 8]. These coefficients are shown respectively in the quadratic polynomials of (5).

e−(αx) ∼ =

⎧ v ⎪ P[0,1[ ( N ⎪ ⎪ Dv ) = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ v ⎪ P[1,2[ ( N ⎪ Dv ) = ⎪ ⎪ ⎪ ⎪ ⎨

773 2500

569 5000

Nv Dv Nv Dv

2

2

−

372 400

−

2853 5000

Nv Dv

Nv Dv

+

9953 10000

+

823 1000

2

Nv Nv 67 2161 4565 v P[2,4[ ( N ) = 2500 − 10000 ⎪ D D Dv + 10000 v v ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2

⎪ ⎪ Nv Nv 234 835 ⎪ P[4,8[ ( Nv ) = 16 ⎪ − 10000 ⎪ D 10000 D Dv + 10000 v v ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ v P[8,∞[ ( N Dv ) = 0

(5)


395

The accuracy of these calculated values, i.e. the introduced error not bigger than 0.005, is adequate to properly obtain the potential values among the data provided during the process of subtractive clustering. The absolute error introduced is shown in Fig. 6. Depending on the data, this requires that the number of bits to represent the numerator and denominator have to be at least twice the maximum found in the data points provided. −3

6

x 10

5

Absolute Error

4

3

2

1

0

0

2

6

4

8

10

X

Fig. 6. Absolute error introduced by the approximation

The architecture of the Fig. 7 presents the micro-architecture of components exp1 and exp1 . It uses four multipliers, one adder/subtracter and some registers. These registers are all right-shifters. The controller makes the adjustment of the binary numbers with shifts to the right in these registers in order to maintain the frame of binary numbers after each operation. This is necessary to keep the results of multiplication with the frame of bits used without much loss of precision. The closest fraction is used instead of a simple truncation of the higher bits of the product. In this architecture, multipliers mult1 , mult2 , mult3 and mult4 operate in parallel to accelerate the computation. The state machine in the controller triggers these operations and controls the various multiplexers of the architecture. The computation defined in (4) is performed as described hereafter. – Step 1: Compute N V × N V , N B × N V , DV × DV and DB × DV ; – Step 2: Right-shift registers to render the frame of bits to the original size and in parallel with that, compute A = N A × N V × N V , C = N B × N V × DC, D = DB × DV × N C and E = DB × DV × DC; – Step 3: Add of C+D and, in parallel with that, compute B = DA×DV ×DV ; A – Step 4: Add B + C+D E .

396


Fig. 7. Architecture of EXP Modules to compute the exponential

5

Results

The data shown in figure 2 were obtained using a simulation program called Real Gamma-Spectrum Emulator. These data are in spreadsheet format of two columns, where the first column corresponds to the channel and the second is the number of counts accumulated in each channel. To validate the method chosen (subtractive clustering), the algorithm was implemented with Matlab, using the simulated data. As seen in the introduction, these data simulate a radioactive source consists of 137 Cs and 60 Co. To apply the subtractive clustering algorithm in Matlab, data provided by the simulation program has to be converted into one-dimensional data in one column. For example, if channel 1324 to accumulate 100 counts, means that the value 1324 should appear 100 times as input. only in this way the clustering algorithm is able to split the data into subgroups by frequency of appearance. In a real application this data would be equivalent to the output of AD converter of a gamma spectrometry system, as shown in the introduction. In the spectrum of Fig. 2, one can see three peaks. The first one in the channel 1324 is characteristic of 137 Cs (0.662 MeV). The second and third peaks correspond the energy of 60 Co. The circular black marks near the first and second peaks show the result of applying the subtractive clustering algorithm on the available data with Matlab software. These circular marks are center of the found clusters. These found clusters are very near (one channel to the left) of the signal peaks, the expected result. With the configuration to the algorithm in Matlab, the third peak was not found. This result can change with an adjust of the radius ra in 2. This is enough to conclude that the data provided belongs to a radioactive source with 137 Cs and 60 Co and the subtractive cluster method can be used to identify these radionuclides.


397

As the proposed architecture is based on the same algorithm, is expected to find the same result. The initial results show that the expected cluster center can be identified as in Matlab specification. The hardware takes about 12660 clock 2 cycles to yield one sum of exponential values ( nj=1 e−αxi −xj ). Considering n points in the avaiable data set, the identification of the first cluster center would take n times that amount. Finding the center of the second cluster is faster. It should take about 13000 clock cycles. This result can change with the data and depends of the amount of adjustment required to the right in the shift registers during the process.

6

Conclusions

This paper describes the implementation of subtractive clustering algorithm to radionuclide identification. The results shows the expected cluster center can be identified with a good efficiency. In data from the simulation of signals of radioactive sources, after conformation of the signal and its conversion into digital, the cluster center represents the points that characterize the energy provided by a simulated radionuclides. The identification of these points can sort the radioactive elements present in a sample. With this method it was possible to identify more than one cluster center, which would recognize more than one radionuclide in radioactive sources. These results reveal that the proposed hardware to subtractive cluster can be used to develop a portable system for radionuclides identification. This system can be developed and enhanced integrating the proposed hardware with a software to be executed by a processor inside the FPGA, bringing reliability and faster identification, an important characteristics for these systems. Following this work, we intend to develop the portable system and also a software-only implementation using an embedded processor or a micro-controller to compare it with the hardware-only solution developed.

References 1. Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3, 32–57 (1973) 2. Yager, R.R., Filev, D.P.: Learning of Fuzzy Rules by Mountain Clustering. In: Proc. of SPIE Conf. on Application of Fuzzy Logic Technology, Boston, pp. 246– 254 (1993) 3. Chiu, S.L.: A Cluster Estimation Method with Extension to Fuzzy Model Identification. In: Proc. IEEE Internat. Conf. on Fuzzy Systems, pp. 1240–1245 (1994) 4. Navabi, Z.: VHDL - Analysis and Modeling of Digital Systems, 2nd edn. McGraw Hill, New York (1998) 5. The MathWorks, Inc.: Fuzzy Logic Toolbox - For Use With MATLAB. The MathWorks, Inc. (1999) 6. ANSI Standard N42.34: Performance Criteria for Hand-held Instruments for the Detection and Identification of Radionuclides (2003)

398


7. Hathaway, R.J., Bezdek, J.C., Hu, Y.: Generalized fuzzy C-means clustering strategies using Lp norm distances. IEEE Transactions on Fuzzy Systems 8, 576–582 (2000) 8. Rao, C.R., Toutenburg, H., Fieger, A., Heumann, C., Nittner, T., Scheid, S.: Linear Models: Least Squares and Alternatives. Springer Series in Statistics (1999) 9. Santi-Jones, P., Gu, D.: Fractional fixed point neural networks: an introduction. Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, Essex (2008) 10. Knoll, G.F.: Radiation Detection and Measurement, 2nd edn. John Wiley & Sons, Chichester (1989) 11. Gilmore, G., Hemingway, J.: Practical Gamma Ray Spectrometry. John Wiley & Sons, Chichester (1995)

A Parallel Architecture for DNA Matching 1

1

Edgar J. Garcia Neto Segundo , Nadia Nedjah , and Luiza de Macedo Mourelle 1

2

Department of Electronics Engineering and Telecommunications, Faculty of Engineering, State University of do Rio de Janeiro, Brazil 2 Department of Systems Engineering and Computation, Faculty of Engineering, State University of do Rio de Janeiro, Brazil

Abstract. DNA sequences can be often showed in fragments, little pieces, found at crime scene or in a hair sample for paternity exam. In order to compare that fragments with a subject or target sequence of a suspect, we need an efficient tool to analyze the DNA sequence alignment and matching. So DNA matching is a bioinformatics field that could find relationships functions between sequences, alignments and them try to understand it. Usually done by software through databases clusters analysis, DNA matching requires a lot of computational resources, what may increase the bioinformatics project budget. We propose the approach of a hardware parallel architecture, based on heuristic method, capable of reducing time spent on matching process.

1 Introduction Despite discoveries about DNA done a couple of years ago [13], computers were unable to provide enough performance to do some specific tasks. In fact, the biological application feature implies a prohibitive computational cost. Advances in computation allow the scientists to make use of informatics techniques to solve biological problems, or improve actual methods. The field that combines knowledge with biological answers is called bioinformatics or computational Biology, and it involves finding the genes in the DNA sequences of various organisms, developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences, clustering protein sequences into families of related sequences and the development of protein models, aligning similar proteins and generating phylogeny trees to examine evolutionary relationships [13]. One of the main challenges in bioinformatics consists of aligning DNA strings and understanding any functional relationships that may have between them. In this purpose, algorithms are specifically developed to reduce time spent in DNA matching process, evaluating similarity degree between them. These algorithms are usually based on dynamic programming and might work well, in a fair time and cost for short sequences, but, commonly takes more time as the strings gets bigger. Massively implemented in software, algorithms for DNA alignment compare a query sequence with a subject sequence, often stored in a public database, running a global or local search in subject string to find the optimal alignment of two sequences. NeedlmanWunsh [9] and Smith-Waterman [16] algorithms are well known algorithms for DNA Y. Xiang et al. (Eds.): ICA3PP 2011 Workshops, Part II, LNCS 7017, pp. 399–407, 2011. © Springer-Verlag Berlin Heidelberg 2011

400

E.J. Garcia Neto Segundo, N. Nedjah, and L. de Macedo Mourelle

alignment. The former is based on global search strategy and the latter uses local search. While global search based methods work hard in all the search space, local search based methods attempt to reduce this space, find small similarities that are expanded in next stages. Consequently, local search based techniques are more appropriate to locate sections, wherein global search based alignment algorithms usually fail. The major advantage of the methods based on dynamic programming are the commitment to discover the best match. However, that commitment requires huge computational resources [2, 4]. DNA matching algorithms based on heuristics [19] emerged as an alternative to dynamic programming in order to remedy to the high computational cost and time requirements. Instead of aiming at the optimal alignment, heuristics based methods attempt to find a set of acceptable or pseudo-optimal possible solutions. Ignoring unlikely alignments, these techniques have improved the performance of DNA matching [3]. Among heuristics based methods BLAST [1, 7] and FASTA [5, 10, 11] stand out. Both of them have well defined procedures for the three main stages of aligning algorithms: seeding, extending and evaluating. BLAST is the fastest algorithms known so far [12, 14]. In this paper, we focus of this algorithm and propose a massively parallel architecture suited for hardware implementation of DNA Matching using BLAST algorithm. The main objective of this work is the acceleration of the aligning procedure. The rest of this paper is organized as follows: First, in Section 2, we briefly describe how the BLAST algorithm operates and report on its main characteristics. Then, in Section 3, we focus on the description of the proposed architecture. Subsequently, in Section 5, we draw some conclusions and point out some new directions for future work.

2 The BLAST Algorithm The BLAST (Basic Local Alignment Search Tool) [1] algorithm is a heuristic search based method that seeks words of length w that score at least t, called the threshold, when aligned with the query. The scoring process is performed according to predefined criteria that are usually prescribed by geneticists. This task is called seeding, where BLAST attempts to find regions of similarity to start its matching procedure. This step has a very powerful heuristic advantage, because it only keeps pairs whose matching score is larger than the pre-defined threshold t. Of course, there is some risk of leaving out some worthy alignments. Nonetheless, using this strategy, the search space decreases drastically, and hence accelerating the convergence of the matching process. After identifying all possible alignments locations or seeds and leaving out those pairs that do not score at least the prescribed threshold, the algorithm proceeds with the extension stage. It consists of extending the alignment words to the right and to the left within both the subject and query sequences, in an attempt to find a locally optimal alignment. Some versions of BLAST introduce the use of a wildcard symbol, called the gap, which can be used to replace any mismatch [1]. Here, we do not allow gaps. Finally, BLAST try to improve score of high scoring pairs, HSP, through a second extension process and the dismissal of a pair is done when the corresponding score does not reach a new pre-defined threshold. HSPs that meet this criterion will be reported by BLAST as final results, provided that they do not exceed the cutoff pre-

A Parallel Architecture for DNA Matching

401

scribed value, which specifies for number of descriptions and/or alignments that should be reported. This last step is called evaluating. BLAST employs a measure based on a well-defined mutation scores. It directly approximates the results that would be obtained by any dynamic programming algorithm for optimizing this measure. The method allows for the detection of weak but biologically significant sequence similarities. The algorithm is more than one order of magnitude faster than existing heuristic algorithms. Compared to other heuristicsbased methods, such as FASTA [5], BLAST performs DNA and protein sequence similarity alignment much faster but it is considered to be equally as sensitive. BLAST is very popular due to availability of the program online at the National Center for Biotechnology Information (NCBI), among other sites.

3 The Proposed Marco-architecture Although well-known, BLAST implementations are usually done using software [15]. While software implementations are of low cost, they often yield a low throughput. On the other hand, dedicated hardware implementations usually impose a much higher cost but they provide a better performance. The main motivation of this work is to propose hardware that implements the steps of the BLAST algorithm so that as to achieve a reduced response time and thus a high throughput. In this purpose, we explore some important features of BLAST to massively parallelize the execution of all independent tasks. The parallel architecture presented in this section is designed to execute the ungapped alignment using the BLAST procedure [1]. This is done for nucleotides of DNA sequences. A nucleotide can be one of four possibilities: A (Adenine), T (Thymine), C (Cytosine) and G (Guanine). Thus, a nucleotide may be represented using two bits. In order to improve the nucleotide comparison, we use two identical matching components, one for the most significant bits, and the other one for less significant bit. These components operate synchronously and in parallel. This should accelerate the comparison process up to twice the speed of a simple a bit-at-a-time comparison. The macro architecture of the aligning hardware of Fig. 1 shows the query and subject sequences (QS ans SS) are thus stored into four incoming registers, wherein respectively for LSW and MSW stand for Least and Most Significant Word. In this figure and throughout the figures of this paper, the components that appear in the background in gray color are the one that operate on the LSW of the query and subject sequence. We will use this architecture to show the computational steps in BLAST. 3.1 Seeding Intuitively, an alignment of two sequences consists of some comparisons followed by evaluations, using a kind of pointers that point at the start and end positions in the query and subject sequences. Our parallel hardware takes advantage of this idea and performing the same task in parallel.

402


Fig. 1. The macro-architecture of the aligning hardware

The hardware architecture for this step depends on a parameter to set the required velocity and sensibility of the alignment process. The query sequence is divided into words, as illustrated in Fig. 2. The words are logic mappings of the bits of the query sequence. Let w be the size of words to be formed and n and m the total size of the query sequence QS and subject sequence SS respectively. Then, the query sequence would be subdivided into n-w words where the ith word is formed by (QSi, QSi+1, QSi+2, …, QSi+w−1). Similarly, the subject sequence would be subdivided into m words where the jth word is formed by (SSj, SSj+1, SSj+2, …, SSj+w−1). The size of the considered words is determined by parameter w. Each cycle, the subject sequence is shifted by one, and compared to query sequence accordingly. The algorithm sensibility depends on the value of w: for small values of w, one expects to generate many words and thus the hardware becomes more sensitive but slower that when for larger values of w. word1 D

SET

CLR

Q

Q

Bit 0 word0 D

SET

CLR

D

Q

Q

Bit 0

SET

CLR

Q

D

Q

SET

CLR

Q

D

Q

SET

CLR

Q

D

Q

SET

CLR

Q

Q

Bit 1 Bit 2 Bit 3 Bit 4 Register - Query Sequence MS D

SET

CLR

Q

Q

D

SET

CLR

Q

Q

D

SET

CLR

Q

Q

D

SET

CLR

D

Q

Q

Bit 1 Bit 2 Bit 3 Bit 4 Register - Query Sequence LS

SET

CLR

Q

Q

Bit 5 word2 D

SET

CLR

Q

Q

Bit 5

Fig. 2. Illustration of the seeds identification process


403

Finally, words are compared with subject sequence. This comparison grads the matching process based on predefined score table. Words that score below the threshold t are discarded. The remaining words are called seeds. For each seed, we create a block, parallelizing that to going through next algorithm steps. As usual for DNA string considers a seed only identical strings fragments between subject and query sequences, so our hardware find identical string and discards everything else.

Fig. 3. Illustration of the comparison during the seeding process

Some VHDL [8] features, such as the generate construction enable the description of repetitive and conditional structures in circuit specifications. Generic declarations allowed us to parameterize the size and structure of a circuit. Thus, for each seed, we generate an extension block, which is described in the next section, and thus having all the blocks performing in parallel for all several found seeds. 3.2 Extension In this step, each seed will be analyzed again in an attempt to improve the score. In order to that to happen, we stretch the alignment between the query-seed and the subject sequence, stored in a register. The extension is done to both the left and right directions, starting from the position where the exact match occurred. In the current extension step, we look either for exact matches, or whatever matches that meet the threshold constraints. The positions of the extended words that generated a hit are bookkept using a tag. This Tag is formed using two data: register position and offset, as shown in Fig. 4, wherein the parte of the tag ss indicates a position in the subject sequence register and wf indicates the relative position of the current word.

404


Fig. 4. Tags formation for the found seeds

For further processing, these tags are stored into a FIFO, and the sent to a processor, which will perform the comparison the scoring task. For each word generated in the seeding step, we have one comparison block, creating one tag and thus that inputs a dedicated FIFO. Therefore, the required work is done in a totally parallel manner until it reaches the load balancer. The extension does not stop until the accumulated total score of the high scoring pairs (HSP), begins to decrease. When extension should stop depends on a predefined parameter, called drop-off. In our implementation, though, extension stops when mismatch is found.

Fig. 5. Extension, comparison and scoring dynamics

A tag is treated by one of extension processors, which first computes the absolute position of the subsequence corresponding to this tag. After that it fetches from the subject and query registers the contents of the next position, which are either to the left or to the right of the subsequence being processed. Subsequently, it compares while scoring the subsequence of bits. Thereafter, the processor updates or discards the tag. The updated tags are stored into the memory for final evaluation and result output. So, the extension processor, whose architecture is shown in Fig. 6, performs very simple but several tasks. As explained the tasks need to be done sequentially. The right and the left extension as started immediately when a tag for a seed is generated, assuming that there exist an idle processor.


405

Fig. 6. The extension processor architecture

In order to process several tags in parallel, we opted to include several processors that operate in parallel. As we presume that there the seed generation process will yield faster a processor when processing a given tag, as it has to extend the subsequence to the left and the right, which can be time consuming. For this purpose, we decide to include a FIFO between the seeding stage and the extension processor. This would control the fast inputs of tags and their slow processing. Note that the left and right extensions are lead separately and in parallel. The width of the FIFO is determined by the size of the tags while it depth is derived form the number of included processors. As there are more FIFOs than processors, we use a load balancer that dispatches tags to processors. This component monitors the content of the FIFOs and selects the next tag to be processed. It always withdraws tags from the FIFO that has more unavailable entries. The main purpose of the load balancer is to avoid full FIFO state to occur because when this happens the seeding process associated with the full FIFO must halt until a new entry becomes available.

406


3.3 Evaluating Once a tag has been processed and updated by the extension processor, it is then evaluated, comparing the obtained score against the second threshold. The final results of the DNA alignment process are those subsequences whose associated tags scored above this predefined threshold. The final outputs are presents in the form of tags. Note that this stage is implemented by a simple binary comparator of two signed integers which are the score associated with the considered tag and the threshold.

4 Conclusion In this paper, we presented reconfigurable parallel hardware architecture for DNA alignment. So it exploits the inherent advantages of reconfigurable hardware, such as availability and low cost. The proposed architecture is easily scalable for different query subject and word size. Moreover, the overall architecture is inherently parallel, resulting in reduced signal delay propagation. A parameterized VHDL code was written and simulated on ModelSim XE III 6.4 [6]. Future work consists of evaluating the characteristics of such an implementation on FPGA [17] and how it performs in a real-case DNA alignment.

References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990) [2] Baldi, P., Brunak, S.: Bioinformatics: the machine learning approach, 1st edn. MIT Press, Cambridge (2001) [3] Baxevanis, A.D., Francis Ouellette, B.F.: Bioinformatics: a practical guide to the analysis of genes and proteins, 1st edn. Wiley Interscience, Hoboken (1998) [4] Giegerich, R.: A systematic approach to dynamic programming in bioinformatics. Bioinformatics 16(8), 665–677 (2000) [5] Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985) [6] ModelSim, High performance and capacity mixed HDL simulation, Mentor Graphics (2011), http://model.com [7] Mount, D.W.: Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press (2004) [8] Navabi, Z.: VHDL: Analysis and modeling of digital systems, 2nd edn. McGraw Hill, New York (1998) [9] Needlman, S.B., Wunsh, S.B.: A general method applicable to the search of similarities in amino acid sequence of two protein. J. Mol. Biol. 48, 443–453 (1970) [10] Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 85(8), 2444–2448 (1988) [11] Pearson, W.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11(3), 635–650 (1991)


407

[12] Pearson, W.: Comparison of methods for searching protein sequence databases. Protein Science 4(6), 1145 (1995) [13] Searls, D.: The language of genes, vol. 420, pp. 211–217 (2002) [14] Shpaer, E.G., Robinson, M., Yee, D., Candlin, J.D., Mines, R., Hunkapiller, T.: Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. Genomics 38(2), 179–191 (1996) [15] Oehmen, C., Nieplocha, J.: ScalaBLAST: A scalable implementation of BLAST for highperformance data-intensive bioinformatics analysis. IEEE Transactions on Parallel & Distributed Systems 17(8), 740–749 (2006) [16] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) [17] Wolf, W.: FPGA-based system design. Prentice-Hall, Englewood Cliffs (2004)

Author Index

Aalsalem, Mohammed Y. II-153 Abawajy, Jemal II-165, II-235, II-245, II-266 Abdelgadir, Abdelgadir Tageldin II-225 Abramson, David I-1 Adorna, Henry II-99 A. Hamid, Isredza Rahmi II-266 Ahmed, Mohiuddin II-225 Albaladejo, José II-343 Anjo, Ivo II-1 Ara´ ujo, Guido I-144 Arefin, Ahmed Shamsul II-375 Arshad, Quratulain II-153 Athauda, Rukshan II-175 Atif, Muhammad I-129 Aziz, Izzatdin A. I-433 Backes, Werner I-27 Bahig, Hatem M. II-321 Bahig, Hazem M. II-321 Baldassin, Alexandro I-144 Bardino, Jonas I-409 Based, Md. Abdul II-141 Bellatreche, Ladjel I-158 Ben´ıtez, César Manuel Vargas II-363 Benkrid, Soumia I-158 Berretta, Regina II-375 Berthold, Jost I-409 Bichhawat, Abhishek I-218 Brezany, Peter I-206 Buyya, Rajkumar I-371, I-395, I-419 Byun, Heejung II-205 Cabarle, Francis George II-99 Cachopo, Jo˜ ao I-326, II-1 Carmo, Renato I-258 Carvalho, Fernando Miguel I-326 Chang, Hsi-Ya I-282 Chang, Rong-Guey I-93 Chen, Chia-Jung I-93 Chen, Xu I-294 Chen, Yi II-54 Chu, Wanming I-54, I-117 Chung, Chung-Ping I-80

Chung, Tai-Myoung II-74 Cohen, Jaime I-258 Colin, Jean-Yves II-89 Crain, Tyler I-244 Crespo, Alfons II-343 Crolotte, Alain I-158 Cuzzocrea, Alfredo I-40, I-158 da Silva Barreto, Raimundo I-349 David, Vincent I-385 de Macedo Mourelle, Luiza II-387, II-399 de Sousa, Leandro P. II-215 Dias, Wanderson Roger Azevedo I-349 Dinh, Thuy Duong I-106 Dom´ınguez, Carlos II-343 Duan, Hai-xin I-182, I-453 Duarte Jr., Elias P. I-258, II-215 Duato, José II-353 Duggal, Abhinav I-66 El-Mahdy, Ahmed I-270 Ewing, Gregory II-33 Faldella, Eugenio II-331 Fathy, Khaled A. II-321 Fernando, Harinda II-245 Folkman, Lukas II-64 Fran¸ca, Felipe M.G. II-14 F¨ urlinger, Karl II-121 Gao, Fan II-131 Garcia Neto Segundo, Edgar J. II-399 Garg, Saurabh Kumar I-371, I-395 Ghazal, Ahmad I-158 Gomaa, Walid I-270 Gopalaiyengar, Srinivasa K. I-371 Goscinski, Andrzej M. I-206, I-433 Goswami, Diganta I-338 Goubier, Thierry I-385 Gu, Di-Syuan I-282 Guedes, André L.P. I-258 Hackenberg, Daniel I-170 Han, Yuzhang I-206

410

Author Index

Haque, Asrar Ul II-24 Haque, Mofassir II-33 Hassan, Houcine II-343, II-353 Hassan, Mohammad Mehedi I-194 He, Haohu II-54 Hobbs, Michael M. I-433 Hou, Kaixi I-460 Huang, Jiumei I-460 Huang, Kuo-Chan I-282 Huh, Eui-Nam I-194 Hussin, Masnida I-443 Imbs, Damien I-244 Inostroza-Ponta, Mario Izu, Cruz II-276

II-375

Jannesari, Ali I-14 Javadi, Bahman I-419 Jiang, He-Jhan I-282 Jozwiak, Lech II-14 Kaneko, Keiichi I-106 Kaosar, Md. Golam I-360 Katoch, Samriti I-66 Khan, Javed I. II-24 Khan, Wazir Zada II-153 Khorasani, Elahe I-318 Khreishah, Abdallah II-109 Kim, Cheol Min II-196 Kim, Hye-Jin II-186, II-196 Kozielski, Stanislaw I-230 Kranzlm¨ uller, Dieter II-121 Kwak, Ho-Young II-196 Lau, Francis C.M. I-294 Lee, Cheng-Yu I-93 Lee, Junghoon II-186, II-196 Lee, Young Choon I-443 Lei, Songsong II-43 Leung, Carson K. I-40 Li, Hongjuan I-2 Li, Keqiu I-2 Li, Shigang II-54 Li, Xiuqiao II-43 Li, Yamin I-54, I-117 Li, Yongnan II-43 Liljeberg, Pasi II-287 Lim, Hun-Jung II-74 Lima, Carlos R. Erig II-363 Lin, Tzong-Yen I-93

Liu, Wu I-453 Lopes, Heitor Silvério II-363 Louise, Stéphane I-385 Majumder, Soumyadip I-338 Malysiak-Mrozek, Bo˙zena I-230 Marco, Maria II-343 Mart´ınez–del–Amor, Miguel A. II-99 Mathieson, Luke II-375 McNickle, Don II-33 Md Fudzee, Mohd Farhan II-235 Mjølsnes, Stig Fr. II-141 Molka, Daniel I-170 Moreno, Edward David I-349 Moscato, Pablo II-375 Mrozek, Dariusz I-230 M¨ uller, Matthias S. I-170 Nakechbandi, Moustafa II-89 Nedjah, Nadia II-14, II-387, II-399 Nery, Alexandre Solon II-14 Nguyen, Man I-481 Nic´ acio, Daniel I-144 Ninggal, Mohd Izuan Hafez II-165 Park, Gyung-Leen II-186, II-196 Pathan, Al-Sakib Khan II-225, II-255 Paulet, Russell I-360 Paulovicks, Brent D. I-318 Pawlikowski, Krzysztof II-33 Pawlowski, Robert I-230 Peng, Shietung I-54, I-117 Peng, Yunfeng II-54 Pérez–Jiménez, Mario J. II-99 Petit, Salvador II-353 Phan, Hien I-481 Pranata, Ilung II-175 Pullan, Wayne II-64 Qin, Guangjun II-43 Qu, Wenyu I-2 Radhakrishnan, Prabakar I-66 Ragb, A.A. II-321 Rahman, Mohammed Ziaur I-306 Ram´ırez-Pacheco, Julio C. II-255 Raynal, Michel I-244 Ren, Ping I-453 Rivera, Orlando II-121 Rodrigues, Luiz A. I-258

Author Index Sahuquillo, Julio II-353 Salehi, Mohsen Amini I-419 Samra, Sameh I-270 Santana Farias, Marcos II-387 Scalabrin, Marlon II-363 Sch¨ one, Robert I-170 Serrano, M´ onica II-353 Seyster, Justin I-66 Sham, Chiu-Wing I-294 Sheinin, Vadim I-318 Shi, Justin Y. II-109 Shih, Po-Jen I-282 Shoukry, Amin I-270 Silva, Fabiano I-258 Sirdey, Renaud I-385 Skinner, Geoff II-175 So, Jungmin II-205 Soh, Ben I-481 Song, Biao I-194 Song, Bin II-312 Stantic, Bela II-64 Stojmenovic, Ivan I-2 Stoller, Scott D. I-66 Strazdins, Peter I-129 Sun, Lili II-54 Taifi, Moussa II-109 Tam, Wai M. I-294 Tan, Jefferson II-131 Tenhunen, Hannu II-287 Tichy, Walter F. I-14 Toral-Cruz, Homero II-255 Tucci, Primiano II-331 Tupakula, Udaya I-218

Varadharajan, Vijay I-218 Vinter, Brian I-409 Voorsluys, William I-395 Wada, Yasutaka I-270 Wang, Pingli II-312 Wang, Yini I-470 Wang, Yi-Ting I-80 Wen, Sheng I-470 Weng, Tsung-Hsi I-80 Westphal-Furuya, Markus Wetzel, Susanne I-27 Wu, Jianping I-182

I-14

Xiang, Yang I-470, II-153 Xiao, Limin II-43 Xu, Thomas Canhao II-287 Yao, Shucai II-54 Yeo, Hangu I-318 Yi, Xun I-360 Yoo, Jung Ho II-300 Zadok, Erez I-66 Zhang, Gongxuan II-312 Zhang, Lingjie I-460 Zhao, Ying I-460 Zhao, Yue I-294 Zheng, Ming I-182 Zhou, Wanlei I-470 Zhou, Wei I-470 Zhu, Zhaomeng II-312 Zomaya, Albert Y. I-443

411

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Melbourne Proceedings Part I (Lecture Notes in Computer Science)

Distributed and Parallel Computing: 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne,

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Image Analysis and Processing. ICIAP 2011 Proceedings Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Workshops, Part II ... Computer Science and General Issues)

Image Analysis and Processing. ICIAP 2011 Proceedings Part I (Lecture Notes in Computer Science)

Advanced Parallel Processing Technologies (Lecture Notes in Computer Science)

Advances in Visual Computing. ISVC 2011 Proceedings Part II (Lecture Notes in Computer Science)

Introduction to Parallel Processing: Algorithms and Architectures

Neural Information Processing. 18th ICONIP 2011, Part II (Lecture Notes in Computer Science, 7063)

Introduction to parallel processing: algorithms and architectures

Introduction to parallel processing: algorithms and architectures

Introduction to Parallel Processing - Algorithms and Architectures

Introduction to Parallel Processing Algorithms and Architectures

Introduction To Parallel Processing - Algorithms And Architectures

Introduction To Parallel Processing Algorithms And Architectures

Introduction to Parallel Processing - Algorithms and Architectures

Information Security. ISC 2011 Proceedings (Lecture Notes in Computer Science)

Human-Computer Interaction - INTERACT 2011 - Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing: 9th International Conference, ICA3PP 2009, Taipei, Taiwan, June 8-11, 2009, Proceedings (Lecture ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Proceedings, Part I ... Computer Science and General Issues)

String Processing and Information Retrieval. 18th International Symposium SPIRE 2011 Proceedings (Lecture Notes in Computer Science)

Advances in Visual Computing. ISVC 2011 Proceedings Part I (Lecture Notes in Computer Science)

Algorithms and Parallel VLSI Architectures

Algorithms and Architectures for Parallel Processing: 7th International Conference, ICA3PP 2007, Hangzhou, China, June 11-14, 2007, Proceedings

Neural Information Processing. 18th ICONIP 2011, Part III (Lecture Notes in Computer Science, 7064)

Neural Information Processing. 18th ICONIP 2011, Part I (Lecture Notes in Computer Science, 7062)

Introduction to Parallel Processing : Algorithms and Architectures (Series in Computer Science)

Computer Vision - Accv 2010 Workshops, Part II

Lecture Notes on Bucket Algorithms (Progress in Computer Science)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Workshops Melbourne Proceedings Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing. ICA3PP 2011 Melbourne Proceedings Part I (Lecture Notes in Computer Science)

Distributed and Parallel Computing: 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne,

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Image Analysis and Processing. ICIAP 2011 Proceedings Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Workshops, Part II ... Computer Science and General Issues)

Image Analysis and Processing. ICIAP 2011 Proceedings Part I (Lecture Notes in Computer Science)

Advanced Parallel Processing Technologies (Lecture Notes in Computer Science)

Advances in Visual Computing. ISVC 2011 Proceedings Part II (Lecture Notes in Computer Science)

Introduction to Parallel Processing: Algorithms and Architectures

Neural Information Processing. 18th ICONIP 2011, Part II (Lecture Notes in Computer Science, 7063)

Introduction to parallel processing: algorithms and architectures

Introduction to parallel processing: algorithms and architectures

Introduction to Parallel Processing - Algorithms and Architectures

Introduction to Parallel Processing Algorithms and Architectures

Introduction To Parallel Processing - Algorithms And Architectures

Introduction To Parallel Processing Algorithms And Architectures

Introduction to Parallel Processing - Algorithms and Architectures

Information Security. ISC 2011 Proceedings (Lecture Notes in Computer Science)

Human-Computer Interaction - INTERACT 2011 - Part II (Lecture Notes in Computer Science)

Algorithms and Architectures for Parallel Processing: 9th International Conference, ICA3PP 2009, Taipei, Taiwan, June 8-11, 2009, Proceedings (Lecture ... Computer Science and General Issues)

Algorithms and Architectures for Parallel Processing: 10th International Conference, ICA3PP 2010, Busan, Korea, May 21-23, 2010. Proceedings, Part I ... Computer Science and General Issues)

String Processing and Information Retrieval. 18th International Symposium SPIRE 2011 Proceedings (Lecture Notes in Computer Science)

Advances in Visual Computing. ISVC 2011 Proceedings Part I (Lecture Notes in Computer Science)

Algorithms and Parallel VLSI Architectures

Algorithms and Architectures for Parallel Processing: 7th International Conference, ICA3PP 2007, Hangzhou, China, June 11-14, 2007, Proceedings

Neural Information Processing. 18th ICONIP 2011, Part III (Lecture Notes in Computer Science, 7064)

Neural Information Processing. 18th ICONIP 2011, Part I (Lecture Notes in Computer Science, 7062)

Introduction to Parallel Processing : Algorithms and Architectures (Series in Computer Science)

Computer Vision - Accv 2010 Workshops, Part II

Lecture Notes on Bucket Algorithms (Progress in Computer Science)

Recommend Documents