Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6522

Marcos K. Aguilera Haifeng Yu Nitin H. Vaidya Vikram Srinivasan Romit Roy Choudhury (Eds.)

Distributed Computing and Networking 12th International Conference, ICDCN 2011 Bangalore, India, January 2-5, 2011 Proceedings

13

Volume Editors Marcos K. Aguilera Microsoft Research Silicon Valley 1065 La Avenida – bldg. 6, Mountain View, CA 94043, USA E-mail: [email protected] Haifeng Yu National University of Singapore School of Computing, COM2-04-25 15 Computing Drive, Republic of Singapore 117418 E-mail: [email protected] Nitin H. Vaidya University of Illinois at Urbana-Champaign 458 Coordinated Science Laboratory MC-228, 1308 West Main Street, Urbana, IL 61801, USA E-mail: [email protected] Vikram Srinivasan Alcatel-Lucent Technologies Manyata Technology Park, Nagawara, Bangalore 560045, India E-mail: [email protected] Romit Roy Choudhury Duke University, ECE Department 130 Hudson Hall, Box 90291, Durham, NC 27708, USA E-mail: [email protected] Library of Congress Control Number: 2010940620 CR Subject Classification (1998): C.2, D.1.3, D.2.12, D.4, F.2, F.1.2, H.4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-17678-X Springer Berlin Heidelberg New York 978-3-642-17678-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Message from the General Chairs

On behalf of the Conference Committee for ICDCN 2011, it is our pleasure to welcome you to Bangalore, India for the 12th International Conference on Distributed Computing and Networking. ICDCN is a premier international forum for distributed computing and networking researchers, vendors, practitioners, application developers, and users, organized every year with support from industry and academic sponsors. Since the ﬁrst conference on distributed computing held in 2000, ICDCN has become a leading forum for researchers and practitioners to exchange ideas and share best practices in the ﬁeld of distributed computing and networking. In addition, ICDCN serves as a forum for PhD students to share their research ideas and get quality feedback from renowned experts in the ﬁeld. The only way this reputation can be achieved is from the quality of the work submitted, the standard of tutorials and workshops organized, the dedication and sincerity of the Technical Program Committee members, the quality of the keynote speakers, the ability of the Steering Committee to react to change, and the policy of being friendly to students by sponsoring a large number of travel grants and keeping the registration cost the lowest among international conferences anywhere in the world. This 12th ICDCN illustrates the intense productivity and cutting-edge research of the members of the distributed computing and networking community across the globe. This is the ﬁrst time ICDCN is hosted in Bangalore, India, the Silicon Valley of India. It is jointly hosted by the leading global information technology company Infosys Technologies headquartered in Bangalore and the renowned research and academic institution the International Institute of Information Technology-Bangalore (IIIT-B). Bangalore is the hub of information technology companies and the seat of innovations in high-tech industry in India. The richness of its culture and history blended with the modern lifestyle and the vibrancy of its young professional population, together with its position at the heart of southern India, make Bangalore one of the major Indian tourist destinations. We are grateful for the generous support of our numerous sponsors: Infosys, Google, Microsoft Research, HP, IBM, Alcatel-Lucent, NetApp, and NIIT University. Their sponsorship is critical to the success of this conference. The success of the conference depended on the help of many other people too, and our thanks go to each and every one of them: the Steering Committee, which helped us in all stages of the conference, the Technical Program Committee, which meticulously evaluated each and every paper submitted to the conference, the Workshop and Tutorial Committee, which put together top-notch and topical workshops and tutorials, the Local Arrangements and Finance Committee, who worked day in and day out to make sure that each and every attendee of the conference feels

VI

Preface

at home before and during the conference, and the other Chairs who toiled hard to maintain the high standards of the conference making it a great success. Welcome and enjoy ICDCN 2011, Bangalore and India. January 2011

Sanjoy Paul Lorenzo Alvisi

Message from the Technical Program Chairs

The 12th International Conference on Distributed Computing and Networking (ICDCN 2011) continues to grow as a leading forum for disseminating the latest research results in distributed computing and networking. It is our greatest pleasure to present the proceedings of the technical program of ICDCN 2011. This year we received 140 submissions from all over the world, including Austria, Canada, China, Finland, France, Germany, India, Iran, Israel, The Netherlands, Portugal, Singapore, Spain, Sri Lanka, Switzerland, and USA. These submissions were carefully reviewed and evaluated by the Program Committee, which consisted of 36 members for the Distributed Computing track and 48 members for the Networking track. For some submissions, the Program Committee further solicited additional help from external reviewers. The Program Committee eventually selected 31 regular papers and 3 short papers for inclusion in the proceedings and presentation at the conference. It is our distinct honor to recognize the paper “Generating Fast Indulgent Algorithms” by Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers as the Best Paper in the Distributed Computing track, and the paper “GoDisco: Selective Gossip-Based Dissemination of Information in Social Community-Based Overlays” by Anwitaman Datta and Rajesh Sharma as the Best Paper in the Networking track. In both the reviewing and the best paper selection processes, PC members and PC Chairs who had a conﬂict of interest with any given paper were excluded from the decision-making process related to that paper. Besides the core technical program, ICDCN 2011 oﬀers a number of other stimulating events. Before the main conference program, we have a full day of tutorials. During the main conference, we are fortunate to have several distinguished scientists as keynote speakers. The main conference is further followed by several other exciting events including the PhD forum. We thank all authors who submitted a paper to ICDCN 2011, which allowed us to select a strong technical program. We thank the Program Committee members and external reviewers for their diligence and commitment, both during the reviewing process and during the online discussion phase. We thank the conference General Chairs and other Organizing Committee members for working with us to make ICDCN 2011 a success. January 2011

Marcos K. Aguilera Romit Roy Choudhury Vikram Srinivasan Nitin Vaidya Haifeng Yu

Organization

General Chairs Lorenzo Alvisi Sanjoy Paul

University of Texas at Austin, USA (Distributed Computing Track) Infosys Technologies, Bangalore, India (Networking Track)

Program Chairs Networking Track Vikram Srinivasan (Co-chair) Nitin Vaidya (Co-chair) Romit Roy Choudhury (Vice Chair)

Alcatel-Lucent, India University of Illinois at Urbana-Champaign, USA Duke University, USA

Distributed Computing Track Marcos K. Aguilera (Co-chair) Microsoft Research Silicon Valley, USA Haifeng Yu (Co-chair) National University of Singapore, Singapore

Keynote Chair Sajal Das Prasad Jayanti

University of Texas at Arlington and NSF, USA Dartmouth College, USA

Tutorial Chairs Vijay Garg Samir Das

University of Texas at Austin, USA Stony Brook University, USA

Publication Chair Marcos K. Aguilera Haifeng Yu Vikram Srinivasan

Microsoft Research Silicon Valley, USA National University of Singapore, Singapore Alcatel-Lucent, India

X

Organization

Publicity Chair Luciano Bononi Dipanjan Chakraborty Anwitaman Datta Rui Fan

University of Bologna, Italy IBM Research Lab, India NTU, Singapore Microsoft, USA

Industry Chairs Ajay Bakre

Intel, India

Finance Chair Santonu Sarkar

Infosys Technologies, India

PhD Forum Chairs Mainak Chatterjee Sriram Pemmaraju

University of Central Florida, USA University of Iowa, Iowa City, USA

Local Arrangements Chairs Srinivas Padmanabhuni Amitabha Das Debabrata Das

Infosys Technologies, India Infosys Technologies, India International Institute of Information Technology, Bangalore, India

International Advisory Committee Prith Banerjee Prasad Jayanti Krishna Kant Dipankar Raychaudhuri S. Sadagopan Rajeev Shorey Nitin Vaidya Roger Wattenhofer

HP Labs, USA Dartmouth College, USA Intel and NSF, USA Rutgers University, USA IIIT Bangalore, India NIIT University, India University of Illinois at Urbana-Champaign, USA ETH Zurich, Switzerland

Program Committee: Networking Track Arup Acharya Habib M. Ammari Vartika Bhandari

IBM Research, USA Hofstra University, USA Google, USA

Organization

Bharat Bhargava Saad Biaz Luciano Bononi Mainak Chatterjee Mun Choon Chan Carla-Fabiana Chiasserini Romit Roy Choudhury Marco Conti Amitabha Das Samir Das Roy Friedman Marco Gruteser Katherine H. Guo Mahbub Hassan Gavin Holland Sanjay Jha Andreas Kassler Salil Kanhere Jai-Hoon Kim Myungchul Kim Young-Bae Ko Jerzy Konorski Bhaskar Krishnamachari Mohan Kumar Joy Kuri Baochun Li Xiangyang Li Ben Liang Anutosh Maitra Archan Misra Mehul Motani Asis Nasipuri Srihari Nelakuditi Sotiris Nikoletseas Kumar Padmanabh Chiara Petrioli Bhaskaran Raman Catherine Rosenberg Rajashri Roy Bahareh Sadeghi Moushumi Sen Srinivas Shakkottai Wang Wei Xue Yang Yanyong Zhang

XI

Purdue University, USA Auburn University, USA University of Bologna, Italy University of Central Florida, USA National University of Singapore, Singapore Politecnico Di Torino,Italy Duke University, USA University of Bologna, Italy Infosys, India Stony Brook University, USA Technion, Israel Rutgers University, USA Bell Labs, USA University of New South Wales, Australia HRL Laboratories, USA University of New South Wales, Australia Karlstad University, Sweden University of New South Wales, Australia Ajou University, South Korea Information and Communication University, South Korea Ajou University, South Korea Gdansk University of Technology, Poland University of Southern California, USA University of Texas -Arlington, USA IISc, Bangalore, India University of Toronto, Canada Illinois Institute of Technology, USA University of Toronto, Canada Infosys, India Telcordia Lab, USA National University of Singapore, Singapore University of North Carolina at Charlotte, USA University of South Carolina, USA Patras University, Greece Infosys, India University of Rome La Sapienza, Italy IIT Bombay, India University of Waterloo, Canada IIT Kharagpur, India Intel, USA Motorola, India Texas A&M University, USA ZTE, China Intel, USA Rutgers University, USA

XII

Organization

Program Committee: Distributed Computing Track Mustaque Ahamad Hagit Attiya Rida A. Bazzi Ken Birman Pei Cao Haowen Chan Wei Chen Gregory Chockler Jeremy Elson Rui Fan Christof Fetzer Pierre Fraigniaud Seth Gilbert Rachid Guerraoui Tim Harris Maurice Herlihy Prasad Jayanti Chip Killian Arvind Krishnamurthy Fabian Kuhn Zvi Lotker Victor Luchangco Petros Maniatis Alessia Milani Yoram Moses Gopal Pandurangan Sergio Rajsbaum C. Pandu Rangan Andre Schiper Stefan Schmid Neeraj Suri Srikanta Tirthapura Sam Toueg Mark Tuttle Krishnamurthy Vidyasankar Hakim Weatherspoon

Georgia Institute of Technology, USA Technion, Israel Arizona State University, USA Cornell University, USA Stanford University, USA Carnegie Mellon University, USA Microsoft Research Asia, China IBM Research Haifa Labs, Israel Microsoft Research, USA Technion, Israel Dresden University of Technology, Germany CNRS and University of Paris Diderot, France National University of Singapore, Singapore EPFL, Switzerland Microsoft Research, UK Brown University, USA Dartmouth College, USA Purdue University, USA University of Washington, USA University of Lugano, Switzerland Ben-Gurion University of the Negev, Israel Sun Labs, Oracle, USA Intel Labs Berkeley, USA Universite Pierre & Marie Curie, France Technion, Israel Brown University and Nanyang Technological University, Singapore Universidad Nacional Autonoma de Mexico, Mexico Indian Institute of Technology Madras, India EPFL, Switzerland T-Labs/TU Berlin, Germany TU Darmstadt, Germany Iowa State University, USA University of Toronto, Canada Intel Corporation, USA Memorial University of Newfoundland, Canada Cornell University, USA

Organization

Additional Referees: Networking Track Rik Sarkar Kangseok Kim Maheswaran Sathiamoorthy Karim El Defrawy Sangho Oh Michele Nati Sung-Hwa Lim Yi Gai Tam Vu Young-June Choi Jaehyun Kim Amitabha Ghosh

Giordano Fusco Ge Zhang Sanjoy Paul Aditya Vashistha Bo Yu Sung-Hwa Lim Vijayaraghavan Varadharajan Ying Chen Francesco Malandrino Majed Alresaini Pralhad Deshpande

Additional Referees: Distributed Computing Track John Augustine Ioannis Avramopoulos Binbin Chen Atish Das Sarma Carole Delporte-Gallet Michael Elkin Hugues Fauconnier Danny Hendler Damien Imbs

Maleq Khan Huijia Lin Danupon Nanongkai Noam Rinetzky Nuno Santos Andreas Tielmann Amitabh Trehan Maysam Yabandeh

XIII

Table of Contents

The Inherent Complexity of Transactional Memory and What to Do about It (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya

1

Sustainable Ecosystems: Enabled by Supply and Demand Management (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chandrakant D. Patel and IEEE Fellow

12

Unclouded Vision (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Crowcroft, Anil Madhavapeddy, Malte Schwarzkopf, Theodore Hong, and Richard Mortier

29

Generating Fast Indulgent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers

41

An Eﬃcient Decentralized Algorithm for the Distributed Trigger Counting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, Vijay K. Garg, and Yogish Sabharwal Deterministic Dominating Set Construction in Networks with Bounded Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roy Friedman and Alex Kogan PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Bradler, Lachezar Krumov, Max M¨ uhlh¨ auser, and Jussi Kangasharju

53

65

77

Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya and Eshcar Hillel

83

Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sathya Peri and Krishnamurthy Vidyasankar

95

Locality-Conscious Lock-Free Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasia Braginsky and Erez Petrank Speciﬁcation and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vibhor Bhatt and Prasad Jayanti

107

119

XVI

Table of Contents

On the Performance of Distributed Lock-Based Synchronization . . . . . . . Yuval Lubowich and Gadi Taubenfeld

131

Distributed Generalized Dynamic Barrier Synchronization . . . . . . . . . . . . . Shivali Agarwal, Saurabh Joshi, and Rudrapatna K. Shyamasundar

143

A High-Level Framework for Distributed Processing of Large-Scale Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal Aﬃnity Driven Distributed Scheduling Algorithm for Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Narang, Abhinav Srivastava, Naga Praveen Kumar, and Rudrapatna K. Shyamasundar Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shamimuddin Sheerazuddin Relating L -Resilience and Wait-Freedom via Hitting Sets . . . . . . . . . . . . . . Eli Gafni and Petr Kuznetsov

155

167

179 191

Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valerie King, Steven Lonargan, Jared Saia, and Amitabh Trehan

203

A Necessary and Suﬃcient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olivier Baldellon, Achour Most´efaoui, and Michel Raynal

215

GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anwitaman Datta and Rajesh Sharma

227

Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maitreya Natu, Vaishali Sadaphal, Sangameshwar Patil, and Ankit Mehrotra On the Hardness of Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.B. Acharya and M.G. Gouda

239

251

An Algorithm for Traﬃc Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukanta Bhattacharya, Tanmay De, and Ajit Pal

263

Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bala Kalyanasundaram and Mahendran Velauthapillai

269

Table of Contents

Reliable Networks with Unreliable Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . Srikanth Sastry, Tsvetomira Radeva, Jianer Chen, and Jennifer L. Welch

XVII

281

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay

293

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congduc Pham

303

An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Martin, Yamin Al-Mousa, and Nirmala Shenoy

315

Security in the Cache and Forward Architecture for the Next Generation Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.C. Hadjichristoﬁ, C.N. Hadjicostis, and D. Raychaudhuri

328

Characterization of Asymmetry in Low-Power Wireless Links: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prasant Misra, Nadeem Ahmed, Diethelm Ostry, and Sanjay Jha

340

Model Based Bandwidth Scavenging for Device Coexistence in Wireless LANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Plummer Jr., Mahmoud Taghizadeh, and Subir Biswas

352

Minimal Time Broadcasting in Cognitive Radio Networks . . . . . . . . . . . . . Chanaka J. Liyana Arachchige, S. Venkatesan, R. Chandrasekaran, and Neeraj Mittal

364

Traﬃc Congestion Estimation in VANETs and Its Application to Information Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rayman Preet Singh and Arobinda Gupta

376

A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshihiro Nozaki, Hasan Tuncer, and Nirmala Shenoy

382

DHCP Origin Traceback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saugat Majumdar, Dhananjay Kulkarni, and Chinya V. Ravishankar A Realistic Framework for Delay-Tolerant Network Routing in Open Terrains with Continuous Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Veeramani Mahendran, Sivaraman K. Anirudh, and C. Siva Ram Murthy Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

394

407

419

Invited Paper: The Inherent Complexity of Transactional Memory and What to Do about It Hagit Attiya Department of Computer Science, Technion [email protected]

Abstract. This paper overviews some of the lower bounds on the complexity of implementing software transactional memory, and explains their underlying assumptions. It discusses how these lower bounds align with experimental results and design choices made in existing implementations to indicate that the transactional approach for concurrent programming must compromise either programming simplicity or scalability. There are several contemporary research avenues that address the challenge of concurrent programming. For example, optimizing coarse-grained techniques, and concurrent programming with mini-transactions—simple atomic operations on a small number of locations.

1 The TM Approach to Concurrent Programming As anyone with a laptop or an Internet connection (that is, everyone) knows, the multicore revolution is here. Almost any computing appliance contains several processing cores, and the number of cores in servers is in the low teens. With the improved hardware comes the need to harness the power of concurrency, since the processing power of individual cores does not increase. Applications must be restructured in order to reap the benefits of multiple processing units, without paying a costly price for coordination among them. It has been argued that writing concurrent applications is significantly more challenging than writing sequential ones. Surely, there is a longer history of creating and analyzing sequential code, and this is reflected in undergraduate eduction. Many programmers are mystified by the intricacies of interaction between multiple processes or threads, and the need to coordinate and synchronize them. Transactional memory (TM) has been suggested as a way to deal with the alleged difficulty of writing concurrent applications. In its simplest form, the programmer need only wrap code with operations denoting the beginning and end of a transaction. The transactional memory will take care of synchronizing the shared memory accesses so that each transaction seems to execute sequentially and in isolation. Originally suggested as a hardware platform by Herlihy and Moss [29], TM has resurfaced as a software mechanism a couple of years later. The first software implementation of transactional memory [43] provided, in essence, support for a multi-word synchronization operations on a static set of data items, in terms of a unary operation (LL/SC), somewhat optimized over prior implementations, e.g., [46, 9]. Shavit and Touitou coined the term software transactional memory (STM) to describe this implementation. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 1–11, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

H. Attiya

Only when the termination condition was relaxed to obstruction freedom (see Section 2.2), the first STM handling a dynamic set of data items was presented [28]. Work by Rajwar et al., e.g., [37, 42], helped to popularize the TM approach in the programming languages and hardware communities.

2 Formalizing TM This section outlines how transactional memory can be formally captured. A comprehensive in-depth treatment can be found in [33]. The model encompasses at least two levels of abstraction: The high level has transactions, each of which is a sequence of operations accessing data items. At the low level, the operations are translated into executions in which a sequence of events apply primitive operations to base objects, containing the data and the meta-data needed for the implementation. A transaction is a sequence of operations executed by a single process on a set of data items, shared with other transactions. Data items are accessed by read and write operations; some systems also support other operations. The interface also includes a try-commit and try-abort operations, in which a transaction requests to commit or abort, respectively. Any of these operations, not just try-abort, may cause the transaction to abort; in this case, we say that the transaction forcibly aborted. The collection of data items accessed by a transaction is its data set; the items written by the transaction are its write set, with the other items being its read set. A software implementation of transactional memory (abbreviated STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitive operations (abbreviated primitives) on the base objects. These procedures are followed by asynchronous processes in order to execute the operations of transactions. The primitives can be simple reads and writes, but also more sophisticated ones, like CAS or DCAS, typically applied to memory locations, which are the base objects for the implementation. When processes invoke these procedures, in an interleaved manner, we obtain executions, in the standard sense of asynchronous distributed computing (cf. [8]). Executions consist of configurations, describing a complete state of the system, and events, describing a single step by an individual process, including an application of a single primitive to base objects (possibly several objects, e.g., in case of DCAS). The interval of a transaction T is the execution interval that starts at the first event of T and ends at the last event of T . If T does not have a last event in the execution, then the interval of T is the (possibly infinite) execution interval starting at the first event of T . Two transactions overlap if their intervals overlap. 2.1 Safety: Consistency Properties of TM An STM is serializable if transactions appear to execute sequentially, one after the other [39]. An STM is strictly serializable if the serialization order preserves the order

The Inherent Complexity of Transactional Memory and What to Do about It

3

of non-overlapping transactions [39]. This notion is called order-preserving serializability in [47], and is the analogue of linearizability [31] for transactions.1 Opacity [23] further demands that even partially executed transactions, which may later abort, must be serializable (in an order-preserving manner). Opacity also accommodates operations beyond read and write. While opacity is a stronger condition than serializability, snapshot isolation [10] is a consistency condition weaker than serializability. Roughly stated, snapshot isolation ensures that all read operations in a transaction return the most recent value as of the time the transaction starts; the write sets of concurrent transactions must be disjoint. (Cf. [47, Definition 10.3].) 2.2 Progress: Termination Guarantees for TM One of the innovations of TM is in allowing transactions not to commit, when they are faced with conflicting transactions. This, however, admits trivial implementations where no progress is ever made. Finding the right balance between nontriviality and efficiency has lead to several progress properties. They are first and foremost distinguished by whether locking is accommodated or not. When locks are not allowed, the strongest requirement—rarely provided—is of waitfreedom, namely, that each transaction has to eventually commit. A weaker property ensures that some transaction eventually commits, or that a transaction commits when running by itself. The last property is called obstruction-freedom [28] (see further discussion in [3]). A lock-based STM (e.g., TL2 [16]) is often required to be weakly progressive [24], namely, a transaction that does not encounter conflicts must commit. Several lower bounds assume a minimal progress property, ensuring that a transaction terminates successfully if it runs alone, from a situation in which no other transactions is pending. This property encompasses both obstruction freedom and weak progressiveness. Related definitions [34, 20, 24] further attempt to capture the distinction between aborts that are necessary in order to maintain the safety properties (e.g., opacity) and spurious aborts that are not mandated by the consistency property, and to measure their ratio. Strong progressiveness [24] ensures that even when there are conflicts, some transaction commits. More specifically, an STM is strongly progressive if a transaction without nontrivial conflicts is not forcibly aborted, and if a set of transactions have nontrivial conflicts on a single item then not all of them are forcibly aborted. (Recall that a transaction is forcibly aborted, when the abort was not requested by a try-abort operation of the transaction, i.e., the abort is in response to try-commit, read or write operations.) Another definition [40] says that an STM is multi-version (MV)-permissive if a transaction is forcibly aborted only if it is an update transaction that has a nontrivial conflict with another update transaction. 1

Linearizability, like sequential consistency [36], talks about implementing abstract data structures, and hence they involve one abstraction—from the high-level operations of the data structure to the low level primitives. It also provides the semantics of the operations, and their expected results at the high-level, on the data structure.

4

H. Attiya

Strong progressiveness and MV-permissiveness are incomparable: The former allows a read-only transaction to abort, if it has a conflict with another update transaction, while the latter does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. Strictly speaking, these properties are not liveness properties in the classical sense [35], since they can be checked in finite executions. 2.3 Predicting Performance There has been some theoretical attempts to predict how well will TM implementations scale, resulting in definitions that postulate behaviors that are expected to yield superior performance. Disjoint-access parallelism. The most accepted such notion is disjoint-access parallelism, capturing the requirement that unrelated transactions progress independently, even if they occur at the same time. That is, an implementation should not cause two transactions, which are unrelated at the high-level, to simultaneously access the same low-level shared memory. We explain what it means for two transactions to be unrelated through a conflict graph that represents the relations between transactions. The conflict graph of an execution interval I is an undirected graph, where vertices represent transactions and edges connect transactions that share a data item. Two transactions T1 and T2 are disjoint access if there is no path between the vertices representing them in the conflict graph of their execution intervals; they are strictly disjoint access if there is no edge between these vertices. Two events contend on a base object o if they both access o, and at least one of them applies a non-trivial primitive to o.2 Transactions concurrently contend on a base object o if they have pending events at the same configuration that contend on o. Property 1 (Disjoint access parallelism). An STM implementation is disjoint-access parallel if two transactions concurrently contend on the same base object, only if they are not disjoint access. This definition captures the first condition of the disjoint-access parallelism property of Israeli and Rappoport [32], in accordance with most of the literature (cf. [30]). It is somewhat weaker, as it allows two processes to apply a trivial primitive on the same base object, e.g., read, even when executing disjoint-access transactions. Moreover, this definition only prohibits concurrent contending accesses, allowing transactions to contend on a base object o at different points of the execution. The original disjoint-access parallelism definition [32] also restricts the impact of concurrent transactions on the step complexity of a transaction. For more precise definitions and discussion, see [7]. 2

A primitive is non-trivial if it may change the value of the object, e.g., a write or CAS; otherwise, it is trivial, e.g., a read.

The Inherent Complexity of Transactional Memory and What to Do about It

5

Invisible reads. It is expected that many typical applications will generate workloads that include a significant portion of read-only transactions. This includes, for example, transactions to search a data structure, and find whether it contains a particular data. Many STMs attempt to optimize read-only transactions, and more generally, the implementation of read operations inside the transaction. By their very nature, read operations, and even more so, read-only transactions, need not leave a mark on the shared memory, and therefore, it is desirable to avoid writing in such transactions, i.e., to make sure that reads are invisible, and certainly, that read-only transactions do not write at all. Remark 1. Some authors [15] refer to a transaction as having invisible reads even if it writes, but the information is not sufficiently detailed to supply the exact details about the transaction’s data set. (In their words, “the STM does not know which, or even how many, readers are accessing a given memory location.”) This behavior is captured by the stronger notion of an oblivious STM [5].

3 Some Lower Bound Results This section overview some of the recent work on showing the inherent complexity of TM. This includes a few impossibility results showing that certain properties simply cannot be achieved by a TM, and several worst-case lower bounds showing that other properties put a high price on the TM, often in terms of the number of steps that should be performed, or as bounds on the local computation involved. The rest of the section mentions some of these results. 3.1 The Cost of Validation A very interesting result shows the additional cost of opacity over serializability, namely, making sure that the values read by a transaction are consistent as it is in progress (and not just at commit time, as done in many database implementations). Guerraoui and Kapalka [23] showed that the number of steps in a read operation is linear in the size of the invoking transaction’s read set, assuming that reads are invisible, the STM keeps only a single version of each data item, and is progressive (i.e., it never aborts a transaction unless it conflicts with another pending transaction). In contrast, when only serializability is guaranteed, then the values read should only be validated at commit time, leading to significant savings. 3.2 The Consensus Number of TM It has been shown that lock-based and obstruction-free TMs can solve consensus for at most two processes [22], that is, their consensus number [26] is 2. An intermediate step shows that such TMs are equivalent to shared objects that fail in a very clean manner [3]. Roughly speaking, this is a consensus object providing a familiar propose operation, allowing a thread to provide an input and wait for a unanimous decision value; however, the propose operation may return a definite fail indication, which ensures that the proposed value will not be decided upon. Intuitively, an aborted transaction corresponds to a propose operation returning false. To get the full result, further mechanisms are needed to handle the long-lived nature of a transactional memory.

6

H. Attiya

3.3 Achieving Disjoint-Access Parallelism Guerraoui and Kapalka [22] prove that obstruction-free implementations of software transactional memory cannot ensure strict disjoint-access parallelism. This property requires transactions with disjoint data sets not to access a common base object. This notion is stronger than disjoint-access parallelism (Property 1), which allows two transactions with disjoint data sets to access the same base objects, provided they are connected via other transactions. Note that the lower bound does not hold under this more standard notion, as Herlihy et al. [28] present an obstruction-free and disjoint-access parallel STM. For the stronger case of wait-free read-only transactions, the assumption of strict disjoint-access parallel can be replaced with the assumption that read-only transactions are invisible. We have proved [7] that an STM cannot be disjoint-access parallel and have invisible read-only transactions that always terminate successfully. A read-only transaction not only has to write, but the number of writes is linear in the size of its read set. Both results hold for strict serializability, and hence also for opacity. With a slight modification of the notion of disjoint-access parallelism, they also hold for serializability and snapshot isolation. 3.4 Privatization An important goal for STM is to access certain items by simple reads and writes, without paying the overhead of the transactional memory. It has been shown [21] that, in many cases, this cannot be achieved without prior privatization [45, 44], namely, invoking a privatization transaction, or some other kind of a privatizing barrier [15]. We have recently proved [5] that, unless parallelism (in terms of progressiveness) is greatly compromised or detailed information about non-conflicting transactions is tracked (the STM is not oblivious), privatization cost must be linear in the number of items that are privatized. 3.5 Avoiding Aborts It have been shown [34] that an opaque, strongly progressive STM requires NPcomplete local computation, while a weaker, online notion requires visible reads.

4 Interlude: How Well Does TM Work in Practice? Collectively, the results that will be described here demonstrate that TM faces significant limitations: It cannot provide clean semantics without weakening the consistency semantics or compromising the progress guarantees. The implementations are also significantly limited in their scalability. Finally, it is not clear how expressive is the programming idiom they provide (since their consensus number is only two). One might argue that these are just theoretical results, which anyway, (mostly) describe only the worst case, so, in practice, we are just fine. However, while the results

The Inherent Complexity of Transactional Memory and What to Do about It

7

are mostly stated for the worst case, these are often not corner cases, unlikely to happen in practice, but natural cases, representative of typical scenarios. Moreover, it is difficult to design an STM that behaves differently in different scenarios, or to expose these specific scenarios to the programmer using intricate guarantees. There is evidence that implementation-focused research has also been hitting a similar wall [11]. Design choices done in existing TMs, whether in hardware or in software, compromise either the claimed simplicity of the model (e.g., elastic transactions [19]), or its transparency and generality (e.g., transactional boosting [27]). Alternatively, there are TMs with reduced scalability, weakening progress guarantees or performance.

5 Concurrent Programming in a Post-TM Era The TM approach “infantilizes” programmers, telling them that the TM will take care of making sure their programs runs correctly and efficiently, even in a concurrent setting. Given that this approach may not be able to deliver the promised combination of efficiency and programming simplicity, and it must expose many of the complications of consistency or progress guarantees, perhaps we should stop sheltering the programmer from the reality of concurrency? It might be possible to expose a cleaner model of a multi-core system to programmers, while providing them with better methodologies, tools and programming patterns that will simplify the design of concurrent code, without hiding its tradeoffs. It is my belief that a multitude of approaches should be proposed, besides TM, catering to different needs and setups. This section mentions two, somewhat complementary, approaches to alleviating the difficulty of designing concurrent applications. 5.1 Optimizing Coarse-Grain Programming For small-scale applications, or with moderate amount of contention for the data, the overhead of managing the memory might outweight the cost of delays due to synchronization [17]. In such situations, it might be simpler to rely on coarse-grained synchronization, that is, design applications in which shared data is mostly accessed “in exclusion”. This does not mean a return to simplistic programming with critical sections and mutex. Instead, this recommends the use of novel methods that have several processes compete for the lock, and then, to avoid additional contention, have the lock holder carry out all (or many of the) pending operations on the data [25]. For non-locking algorithms, this can be seen as a return to Herlihy’s universal construction [26], somewhat optimized to improve memory utilization [13]. 5.2 Programming with Mini-transactions A complementary approach is motivated by the observation that many of the lower bounds rely on the fact that TM must deal with large, unpredictable (dynamic) data sets,

8

H. Attiya

accessed with an arbitrary set of operations, and interleaved with generic calculations (including I/O). What if the TM had to deal only with short transactions, with simple functionality and small, known-in-advance (static) data sets, to which only simple arithmetic, comparison, and memory access operations are applied? My claim is that such minitransactions could greatly alleviate the burden of concurrent programming, while still allowing efficient implementation. It is obvious that mini-transactions avoid many of the costs indicated by the lower bounds and complexity results with TM, because they are so restricted. Looking from an implementation perspective, mini-transactions is a design choice that simplifies and improves the performance of TM. Indeed, they are already almost provided by recent hardware TM proposals from AMD [2] and Sun [12]. The support is best-effort in nature, since, in addition to data conflicts, transactions can be aborted due to other reasons, for example, TLB misses, interrupts, certain function-call sequences and instructions like division [38]. Mini-transactions. Mini-transactions are a simple extension of DCAS, or its extension to k CAS, with small values of k, e.g., 3 CAS, 4 CAS. In fact, mini-transactions are a natural, multi-location variant of the LL/SC pair supported in IBM’s PowerPC [1] and DEC Alpha [18]. A mini transaction works on a small, if possible, static, data set, and applies simple functionality, without I/O, out-of-core memory accesses, etc. It is supposed to be short, in order to ensure success. Yet, even if all these conditions are satisfied, the application should be prepared to deal with spurious failures, and not violate the integrity of the data. An important issue is to allow “native” (uninstrumented) access to the locations accessed by mini-transactions, through a clean, implicit mechanism. Thus, they are subject to concerns similar to those arising when privatizing transactional data. Algorithmic challenges. Mini-transactions can provide a significant handle on the difficult task of writing concurrent applications, based on our experience of leveraging even the fairly restricted DCAS [6,4], and others’ experience in utilizing recent hardware TM support [14]. Nevertheless, using mini-transactions still leaves several algorithmic challenges. The first, already discussed above, is the design of algorithms accommodating the best-effort nature of mini-transactions. Another one is to deal with their limited arity, i.e., the small data set, in a systematic manner. An interesting approach could be to understand how mini-transactions can support the needs of amorphous data parallelism [41]. Finally, even with convenient synchronization of accesses to several locations, it is still necessary to find ways to exploit the parallelism, by having thread make progress on their individual tasks, without interfering with each other, and while helping each other as necessary. Part of the challenge is to span the full spectrum: from virtually sequential situations in which threads operate almost in isolation from each other, all the way to highly-parallel situations, where many concurrent threads should be harnessed to perform work efficiently, rather than slow progress due to high contention.

The Inherent Complexity of Transactional Memory and What to Do about It

9

6 Summary This paper describes recent research on formalizing transactional memory, and exploring its inherent limitations. It suggests ways to facilitate the design of efficient and correct concurrent applications, in the post-TM era, while still capitalizing on the lessons learned in designing TM, and the wide interest it generated. The least-explored of them is the design of algorithms and programming patterns that accommodate best-effort mini-transactions, in a way that does not compromise safety, and guarantees liveness in an eventual sense. Acknowledgements. I have benefited from discussions about concurrent programming, and transactional memory with many people, but would like to especially thank my Ph.D. student Eshcar Hillel for many illuminating discussions and comments on this paper. Part of this work was done while the author was on sabbatical at EPFL. The author is supported in part by the Israel Science Foundation (grants 953/06 and 1227/10).

References 1. PowerPC Microprocessor Family: The Programming Environment (1991) 2. Advanced Micro Devices, Inc. Advanced Synchronization Facility - Proposed Architectural Specification, 2.1 edition (March 2009) 3. Attiya, H., Guerraoui, R., Hendler, D., Kuznetsov, P.: The complexity of obstruction-free implementations. J. ACM 56(4) (2009) 4. Attiya, H., Hillel, E.: Built-in coloring for highly-concurrent doubly-linked lists. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 31–45. Springer, Heidelberg (2006) 5. Attiya, H., Hillel, E.: The cost of privatization. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 35–49. Springer, Heidelberg (2010) 6. Attiya, H., Hillel, E.: Highly-concurrent multi-word synchronization. In: Rao, S., Chatterjee, M., Jayanti, P., Murthy, C.S.R., Saha, S.K. (eds.) ICDCN 2008. LNCS, vol. 4904, pp. 112– 123. Springer, Heidelberg (2008) 7. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009 (2009) 8. Attiya, H., Welch, J.L.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn. Wiley, Chichester (2004) 9. Barnes, G.: A method for implementing lock-free shared-data structures. In: SPAA 1993, pp. 261–270 (1993) 10. Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P.: A critique of ANSI SQL isolation levels. In: SIGMOD 1995, pp. 1–10 (1995) 11. Cascaval, C., Blundell, C., Michael, M., Cain, H.W., Wu, P., Chiras, S., Chatterjee, S.: Software transactional memory: why is it only a research toy? Commun. ACM 51(11), 40–46 (2008) 12. Chaudhry, S., Cypher, R., Ekman, M., Karlsson, M., Landin, A., Yip, S., Zeffer, H., Tremblay, M.: Rock: A high-performance SPARC CMT processor. IEEE Micro. 29(2), 6–16 (2009) 13. Chuong, P., Ellen, F., Ramachandran, V.: A universal construction for wait-free transaction friendly data structures. In: SPAA 2010, pp. 335–344 (2010)

10

H. Attiya

14. Dice, D., Lev, Y., Marathe, V., Moir, M., Olszewski, M., Nussbaum, D.: Simplifying concurrent algorithms by exploiting hardware tm. In: SPAA 2010, pp. 325–334 (2010) 15. Dice, D., Matveev, A., Shavit, N.: Implicit privatization using private transactions. In: Transact 2010 (2010) 16. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 17. Dice, D., Shavit, N.: What really makes transactions fast? In: Transact 2006 (2006) 18. Digital Equipment Corporation. Alpha Architecture Handbook (1992) 19. Felber, P., Gramoli, V., Guerraoui, R.: Elastic transactions. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 93–107. Springer, Heidelberg (2010) 20. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 21. Guerraoui, R., Henzinger, T., Kapalka, M., Singh, V.: Transactions in the jungle. In: SPAA 2010, pp. 263–272 (2010) 22. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 23. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 24. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 25. Hendler, D., Incze, I., Shavit, N., Tzafrir, M.: Flat combining and the synchronizationparallelism tradeoff. In: SPAA 2010, pp. 355–364 (2010) 26. Herlihy, M.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991) 27. Herlihy, M., Koskinen, E.: Transactional boosting: a methodology for highly-concurrent transactional objects. In: PPoPP 2008, pp. 207–216 (2008) 28. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 29. Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: ISCA 1993 (1993) 30. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008) 31. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990) 32. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 2004, pp. 151–160 (2004) 33. Kapalka, M.: Theory of Transactional Memory. Nr. 4664, EPFL (2010) 34. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 35. Lamport, L.: Proving the correctness of multiprocess programs. IEEE Transactions on Software Engineering SE-3(2), 125–143 (1977) 36. Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Transactions on Computers 100(28), 690–691 (1979) 37. Larus, J.R., Rajwar, R.: Transactional Memory. Morgan and Claypool, San Francisco (2007) 38. Moir, M., Moore, K., Nussbaum, D.: The adaptive transactional memory test platform: A tool for experimenting with transactional code for Rock. In: Transact 2008 (2008) 39. Papadimitriou, C.H.: The serializability of concurrent database updates. J. ACM 26(4), 631–653 (1979)

The Inherent Complexity of Transactional Memory and What to Do about It

11

40. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 41. Pingali, K., Kulkarni, M., Nguyen, D., Burtscher, M., Mendez-Lojo, M., Prountzos, D., Sui, X., Zhong, Z.: Amorphous data-parallelism in irregular algorithms. Technical Report TR-0905, The University of Texas at Austin, Department of Computer Sciences (2009) 42. Rajwar, R., Goodman, J.R.: Transactional lock-free execution of lock-based programs. In: ASPLOS 2002, pp. 5–17 (2002) 43. Shavit, N., Touitou, D.: Software transactional memory. In: PODC 1995, pp. 204–213 (1995) 44. Shpeisman, T., Menon, V., Adl-Tabatabai, A.-R., Balensiefer, S., Grossman, D., Hudson, R.L., Moore, K.F., Saha, B.: Enforcing isolation and ordering in STM. SIGPLAN Not. 42(6), 78–88 (2007) 45. Spear, M.F., Marathe, V.J., Dalessandro, L., Scott, M.L.: Privatization techniques for software transactional memory. Technical Report Tr 915, Dept. of Computer Science, Univ. of Rochester (2007) 46. Turek, J., Shasha, D., Prakash, S.: Locking without blocking: making lock based concurrent data structure algorithms nonblocking. In: PODS 1992, pp. 212–222 (1992) 47. Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, San Francisco (2001)

Sustainable Ecosystems: Enabled by Supply and Demand Management Chandrakant D. Patel and IEEE Fellow Hewlett Packard Laboratories, Palo Alto, CA 94304, USA [email protected] Abstract. Continued population growth, coupled with increased per capita consumption of resources, poses a challenge to the quality of life of current and future generations. We cannot expect to meet the future needs of society simply by extending existing infrastructures. The necessary transformation can be enabled by a sustainable IT ecosystem made up of billions of service-oriented client devices and thousands of data centers. The IT ecosystem, with data centers at its core and pervasive measurement at the edges, will need to be seamlessly integrated into future communities to enable need-based provisioning of critical resources. Such a transformation requires a systemic approach based on supply and demand of resources. A supply side perspective necessitates using local resources of available energy, alongside design and management that minimizes the energy required to extract, manufacture, mitigate waste, transport, operate and reclaim components. The demand side perspective requires provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy-based control. This paper presents a systemic framework for supply-demand management in IT—in particular, on building sustainable data centers—and suggests how the approach can be extended to manage resources at the scale of urban infrastructures. Keywords: available energy, exergy, energy, data center, IT, sustainable, ecosystems, sustainability.

1 Introduction 1.1 Motivation Environmental sustainability has gained great mindshare. Actions and behaviors are often classified as either “green” or “not green” using a variety of metrics. Many of today’s “green” actions are based on products that are already built but classified as “environmentally friendly” based on greenhouse gas emission and energy consumption in use phase. Such compliance-time thinking lacks a sustainability framework that could holistically address global challenges associated with resource consumption. These resource consumption challenges will stem from various drivers. The world population is expected to reach 9 billion by 2050 [1]. How do we deal with the increasing strain that the economic growth is placing on our dwindling natural M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 12–28, 2011. © Springer-Verlag Berlin Heidelberg 2011

Sustainable Ecosystems: Enabled by Supply and Demand Management

13

resources? Can we expect to meet the needs of society by solely relying on replicating and extending the existing physical infrastructure to cope with economic and population growth? Indeed, anecdotal evidence of the strain that society is placing on the supply side—the resources used for goods and services—is apparent: rising prices for critical materials, such as copper and steel; the dramatic reduction in output of the Pemex Canatrell oil field in Mexico, one of the largest in the world; and limitations in city scale waste disposal. Furthermore, a rise in the price of fuel has led to inflationary impact that could threaten the quality of life of billions. Thus, depletion of limited natural resources and increases in the cost of basic goods necessitates new business models and infrastructures that are designed, built and operated using the least possible amount of appropriate materials and energy. The supply side must be considered together with the societal demand for resources. This paper presents a holistic framework for sustainable design and management. Unlike past work that has mostly focused on operational energy considerations of devices, this contribution weaves lifecycle implications into a broader supply-demand framework. The following are the salient contributions: • • • •

Use of available energy (also called exergy) from 2nd law of thermodynamics as a metric for quantifying sustainability. Formulation of a supply-demand framework based on available energy. Application of this framework to IT, in particular, to data centers. Extension of the framework to other ecosystems such as cities.

1.2 Role of the IT Ecosystem Consider the information technology (IT) ecosystem made up of billions of service oriented client devices, thousands of data centers and digital print factories. As shown in Figure 1, the IT ecosystem and other human managed ecosystems such as transportation, waste management, power delivery, industrial systems, etc. draw from a pool of available energy. In this context, IT has the opportunity to change existing business models and deliver a net positive impact with respect to consumption of available energy. To do so, sustainability of the IT ecosystem itself must be addressed holistically. Given a sustainable IT ecosystem, imagine the scale of impact when billons in growth economies like India utilize IT services to conduct transactions such as purchasing railway tickets, banking, availing healthcare, government services, etc. As the billions board the IT bus, and shun other business models, such as ones that require the use of physical transportation means like an auto-rickshaw to go to the train station to buy tickets, the net reduction in the consumption of available energy can be significant. Indeed, when one overlays a scenario where everything will be delivered as a service, a picture emerges of billions of end users utilizing trillions of applications through a cloud of networked data centers. However, to reach the desired price point where such services will be feasible—especially in emerging economies, where Internet access is desired at approximately US $1 per month—the total cost-ofownership (TCO) of the physical infrastructure that supports the cloud will need to be revisited. There are about 81 million Internet connections in India [2]. There has been progress in reducing the cost of access devices [3], but the cost to avail services still needs to be addressed. In this regard, without addressing the cost of data centers—the foundation for services to the masses—scaling to billions of users is not possible.

14

C.D. Patel

Fig. 1. Consumption of Available Energy

With respect to data centers, prior work has shown that a significant fraction of the TCO comes from the recurring energy consumed in the operation of the data center, and from the burdened capital expenditures associated with the supporting physical infrastructure [4]. The burdened cost of power and cooling, inclusive of redundancy, is estimated to be 25% to 30% of the total cost of ownership in typical enterprise data centers [4]. These power and cooling infrastructure costs may match, or even exceed, the cost of the IT hardware within the data center. Thus, including the cost of IT hardware, over half of the TCO in a typical data center is associated with design and management of the physical infrastructure. For Internet services providers, with thinner layers of software and licensing costs, the physical infrastructure could be responsible for as much as 75% of the TCO. Conventional approaches in building data centers with multiple levels of redundancies and excessive material—an “always-on” mantra with no regard to service level agreement, and lack of dynamic provisioning of resources—leads to excessive over provisioning and cost. Therefore, cost reduction requires an end to end approach that delivers least materials, least energy data centers. Indeed, contrary to the oft held view of sustainability as “paying more to be green”, minimizing the overall lifecycle available energy consumption and thereby building sustainable data centers leads to lowest cost data centers.

2 Available Energy or Exergy as a Metric 2.1 Exergy IT and other ecosystems draw from a pool of available energy as shown in Figure 1. Available energy, also called exergy, refers to energy that is available for performing work [5]. While energy refers to the quantity of energy, exergy quantifies the useful portion (or “quality”) of energy. As an example, in a vehicle, the combustion of a

Sustainable Ecosystems: Enabled by Supply and Demand Management

15

given mass of fuel such as diesel results in propulsion of vehicle (useful work done), dissipation of heat energy and a waste stream of exhaust gases at a given temperature. From the first law of thermodynamics, the quantity of energy was conserved in the combustion process as the sum of the energy in the products equals that in the fuel. However, from the 2nd law of thermodynamics, the usefulness of energy was destroyed since there is not much useful work that can be harnessed from the waste streams e.g. exhaust gases. One can also state that the combustion of fuel resulted in increase of entropy or disorder in the universe – going from a more ordered state in fuel to less ordered state in waste streams. As all processes result in increase in entropy, and consequent destruction of exergy due to entropy generation, minimizing the destruction of exergy is an important sustainability consideration. From a holistic supply-demand point of view, one can say that we are drawing from a finite pool of available energy, and minimizing destruction of available energy is key for future generations to enjoy the same quality of life as the current generation. With respect to making the most of available energy, it is also important to understand and avail opportunities in extracting available energy from waste streams. Indeed, it is instructive to examine the combustion example further to understand the exergy content of waste streams. Classical thermodynamics dictates the upper limit of the work, A, that could be recovered from a heat source, Q (in Joules), at temperature Tj (in Kelvins) emitting to a reservoir at ground state temperature Ta as:

⎛ T ⎞ A = ⎜1 − a ⎟ Q ⎜ T ⎟ j ⎠ ⎝

(1)

For example, with reference to equation 1, relative to a temperature of 298 K (25 oC), 1 joule of heat energy at 773 K (500 oC)—such as exhaust gases from a gas turbine— can give 0.614 joules of available energy. Therefore, a waste stream at this high temperature has good availability (61%) that can be harvested. By the same token, the same joule at 323 K (50 oC)—such as exhaust air from a high power server—can only give 0.077 joules of work. While this determines the theoretical maximum Carnot work that can be availed with a perfect reversible engine, the actual work is much less due to irreversible losses such as friction. Stated simply, the 2nd law of thermodynamics places a limit on the amount of energy that can be converted from one form to another. Similarly, laws of thermodynamics can be applied to other conversion means e.g. electrochemical reactions in fuel cells to estimate the portion of reaction enthalpy that can be converted to electricity [6]. Traditional methods of design involve the completion of an energy balance based on the conservation theory of the first law of thermodynamics. Such a balance can provide the information necessary to reduce thermal losses or enhance heat recovery, but an energy analysis fails to account for degradation in the quality of energy due to irreversibilities predicted by the second law of thermodynamics. Thus, an approach based on the second law of thermodynamics is necessary for analyzing available energy or exergy consumption across the lifecycle of a product—from “cradle to cradle”. Furthermore, it can also be used to create the analytics necessary to run operations that minimize destruction of exergy and create inference analytics that can enable need-based provisioning of resources. Lastly, exergy analysis is important to

16

C.D. Patel

determine the value of the waste stream and tie it to an appropriate process that can make the most of it. For example, the value of converting exhaust heat energy to electrical energy using a thermo-electric conversion process may apply in some cases, but not in others when one takes into account the exergy requirement to build and operate the thermo-electric conversion means. 2.2 Exergy Chain in IT Electrical energy is produced from conversion of energy from one form to another — a common chain starts with converting the chemical energy in the fuel to thermal energy from the combustion of fuel to mechanical energy in a rotating physical device to electrical energy from a magnetically based dynamo. Alternatively, available energy in water—such as potential energy at a given height in a dam—can be converted to mechanical energy and to electrical energy. The electrical energy is 100% available. However, as electrical energy is transmitted and distributed from the source to the point of use, losses along the way in transmission and distribution lead to destruction of availability. The source of power for most data centers (i.e., thermal power station) operates at an efficiency in the neighborhood of 35% to 60% [7]. Transmission and distribution losses can range from 5% to 12%. System level efficiency in the data center power delivery infrastructures (i.e., from building to chip) can range from 60% to 85% depending on the component efficiency and load. Around 80% is typical for a fully loaded state-of-the-art data center. Overall, out of every watt generated at the source, only about 0.3 W to 0.4 W is used for computation. If the generation cycle itself as well as overhead of the data center infrastructure (i.e., cooling) is taken into account, the coal-to-chip power delivery efficiency will be around 5% to 12%. In addition to consumption of exergy in operation, the material within the data center has exergy embedded in it. The embedded exergy stems from exergy required to extract, manufacture, mitigate waste, and reclaim the material. Exergy is also embedded in IT as result of direct use of water (for cooling) and indirect use of water (for production of parts, electricity, etc). Water too can be represented using exergy. As an example, assuming nature desalinates water and there is sufficient fresh water available from natural cycle, one can represent exergy embedded in water as a result of distribution (exergy required to pump) and treatment (exergy required to treat waste water). On average, treatment and distribution of a million gallons of surface water requires 1.5 MWh of electrical energy. Similarly, treatment of a million gallons of waste water consumes 2.5 MWh of electrical energy [8].

3 Supply Side and Demand Side Management 3.1 Architectural Framework In order to build sustainable ecosystems, the following systemic framework articulates the management of supply and demand side of available energy based on the needs of the users.

Sustainable Ecosystems: Enabled by Supply and Demand Management •

•

17

On the supply side: o minimizing the exergy required to extract, manufacture, mitigatewaste, transport, operate and reclaim components; o design and management using local sources of available energy to minimize the destruction of exergy in transmission and distribution, e.g., dissipation in transmission; and, take advantage of exergy in the waste streams, e.g., exhaust heat from turbine. On the demand side: o minimizing the consumption of exergy by provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy based control.

Sustainable ecosystems, given the supply and demand side definition above, are then built on delivering to the needs of the user. The needs of the user are derived from the service level agreement (SLA), decomposed into lower level metrics that can be applied in the field to enable integrated management of supply and demand. The balance of the paper steps through the framework by examining lifetime exergy consumption in IT, evolving a supply-demand framework for data centers and closing by extending the framework to other ecosystems. 3.2 Quantifying Lifetime Exergy Consumption As noted earlier, exergy or available energy stemming from the second law of thermodynamics fuses information about materials and energy use into a single meaningful measure. It estimates the maximum work in Joules that could theoretically have been extracted from a given amount of material or energy. By equating a given system in terms of its lifetime exergy consumption, it becomes possible to remove dependencies on the type of material or the type of energy (heat, electricity, etc) consumed. Therefore, given a lifecycle of a product, as shown in Figure 2, one can now create an abstract information plane that can be commonly applied across any arbitrary infrastructure. Lifecycle design then implies inputting the entire supply chain from “cradle to cradle” to account for exergy consumed in extraction, manufacturing, waste mitigation, transportation, operation and reclamation. From a supply side perspective, designers can focus on minimizing lifetime energy consumption through de-materialization, material choices, transportation, process choices, etc. across the lifecycle. The design toolkit requires a repository of available energy consumption data for various materials and processes. With respect to IT, the following provides an overview of the salient “hotspots” discerned using an exergy based lifetime analysis [9]: •

For service oriented access devices such as laptops, given typical residential usage pattern, the lifetime operational exergy consumption is 20-30% of the total exergy consumed while the rest is embedded (exergy consumed in extraction, manufacturing, transportation, reclamation). o Of the 70-80% of the embedded lifetime exergy consumption, display is a big component.

18

C.D. Patel

Fig. 2. Lifecycle of a product •

For data centers, for a given server, the lifetime operational exergy consumption is about 60% to 80% of the total lifetime exergy consumption [10]. o The large operational component stems from high electricity consumption in the IT equipment and the data center level cooling infrastructure [11][12].

From a strategic perspective, for handhelds, laptops and other forms of access devices, reducing the embedded exergy is critical. And, in order to minimize embedded exergy, least exergy process and material innovations are important. As an example, innovations in display technologies can reduce the embedded footprint of laptops and handhelds. On the other hand, for data centers, it is important to devise an architecture that focuses on minimizing electricity (100% available energy) consumed in operation. Next section presents a supply-demand based architecture for peak exergetic efficiency associated with synthesis and operation of a data center. A sustainable data center—built on lifetime exergy considerations, flexible and configurable resource micro-grids, pervasive sensing, communications and aggregation of sensed data, knowledge discovery and policy based autonomous control—is proposed. 3.3 Synthesis of a Sustainable Data Center Using Lifetime Exergy Analysis Business goals drive the synthesis of a data center. For example, assuming a million users subscribing to a service at US $1/month, the expected revenue would be US $12 million per year. Correspondingly, it may be desirable to limit the infrastructure (excluding software licenses, personnel) TCO to be 1/5th of that amount, or roughly US $2.4 million per year. A simple understanding of impact of power can be had by estimating the cost implications in areas where low cost utility power is not available, and diesel generators are used as a primary source. For example, if the data center supporting the million users consumes 1 MW of power at 1 W per user or 8.76 million KWh per year, the cost of just powering with diesel at approximately $0.25 per KWh will be about $2.2 million per year. Thus, growth economies strained on the

Sustainable Ecosystems: Enabled by Supply and Demand Management

19

resource side and reliant on local power generation with diesel will be at a great disadvantage. Having understood such constraints, the data center must meet the target total cost of ownership (TCO) and uptime based on the service level agreements for a variety of workloads. Data center synthesis can be enabled by using a library of IT and facility templates to create a variety of design options. Given the variety of supply-demand design options, the key areas of analysis for sustainability and cost become: 1. Lifecycle analysis to evaluate each IT-Facility template and systematically dematerialize to drive towards least lifetime embedded exergy design and lowest capital outlay, e.g., systematically reduce the number of IT, power and cooling units, remove excessive material in the physical design, etc. • Overall exergy analysis should also consider exergy in waste streams, and locality of the data center to avail supply side resources (power and cooling). 2. Performance modeling toolkit to determine the performance of the ensemble and estimate consumption of exergy during operation. 3. Reliability modeling toolkit to discover and design for various levels of uptime within the data center. 4. Performance modeling toolkit to determine the ability to meet the SLAs for a given IT-Facility template. 5. TCO modeling to estimate the deviation from the target TCO. Combining all the key elements noted above enables a structure for analysis for a set of applicable data center design templates for a given business need. The data center can be benchmarked in terms of performance per Joules of lifetime available energy or exergy destroyed. The lifetime exergy can be incorporated in a total cost of ownership model that includes software, personnel and license to determine the total cost of ownership of a rack [4] and used to price a cloud business model such as “infrastructure as a service”. 3.4 Demand Side Management of Sustainable Data Centers On the demand side, it is instructive to trace the energy flow in a data center. Electrical energy, all of which is available to do useful work, is transferred to IT equipment. Most of the available electrical energy drawn by the IT hardware is dissipated as heat energy, while useful work is availed through information processing. The amount of useful work is not proportional to the power consumed by the IT hardware. Even in idle mode IT hardware typically consumes more than 50% of its maximum power consumption [32]. As noted in [32], it is important to devise energy-proportional machines. However, it is also important to increase the utilization of IT hardware and reduce the total amount of required hardware. [31] presents such a resource management architecture. A common approach to increasing utilization is executing applications in virtual machines and consolidating the virtual machines onto fewer larger servers [33]. As shown in [15] workload consolidation has the potential to reduce the IT power demand significantly.

20

C.D. Patel

Next, additional exergy is used to actively transfer the heat energy from chip to the external ambient. All of the exergy delivered to the cooling equipment to remove heat is not used to affect the heat transfer. While a fraction of the electrical energy provided to a blower or pump is converted to flow work (product of pressure in N/m2 and volume flow in m3/s), and likewise a portion of the electrical energy applied to a compressor is converted to thermodynamic work (to reduce temperature of the data center), the bulk of the exergy provided is destroyed due to irreversibility. Therefore, in order to build an exergy efficient system, the mantra for demand side management in data center becomes one of allocation of IT (compute, networking and storage), power and cooling resources based on the need with the following salient considerations: •

•

•

Decompose SLAs to Service Level Objectives o based on the SLOs, allocate appropriate IT resources while meeting the performance and uptime requirements [28][29][30]; o account for spatial and temporal efficiencies and redundancies associated with thermo-fluids behavior in a given data center based on heat loads and cooling boundary conditions[14]. Consolidate workloads while taking into account the spatial and temporal efficiencies noted above, e.g., place critical workloads in “gold zones” of data centers which have inherent redundancies due to intersection of fluid flows from multiple air conditioning units and turn off or scale back power to IT equipment not in use [13][14][31]. Enable the data center cooling equipment to scale based on the heat load distribution in the data center [11].

Dynamic implementation of the key points described above can result in better utilization of resources, reduction of active redundant components and reduction in electrical power consumption by half [13][15]. As an example, a data center designed for 1 MW of power at maximum IT load can run up to 80% capacity with workload consolidation and dynamic control. The balance of 200 KW can be used for cooling and other support equipment. Indeed, besides availing failover margin by operating at 80%, the coefficient of performance of the power and cooling ensemble is often optimal at about 80% loading given the efficiency curves of UPSs, and mechanical equipment such as blowers, compressors, etc. 3.5 Coefficient of Performance of the Ensemble Figure 3 shows a schematic of energy transfer in a typical air-cooled data center through flow and thermodynamic processes. Heat is transferred from the heat sinks on a variety of chips—microprocessors, memory, etc—to the cooling fluid in the system e.g. driven by fans, air as a coolant enters the system and undergoes a temperature rise based on the mass flow and is exhausted out into the room. Fluid streams from different servers undergo mixing and other thermodynamic and flow processes in the exhaust area of the racks. As an example, for air cooled servers and racks, the dominant irreversibilities that lead to destruction of exergy arise from cold and hot air streams mixing and mechanical inefficiency of air moving devices. These streams (or some fraction of them) flow back to the modular computer room air conditioning units

Sustainable Ecosystems: Enabled by Supply and Demand Management

21

(CRACs) and transfer heat to the chilled water (or refrigerant) in the cooling coils. Heat transferred to the chilled water at the cooling coils is transported to the chillers through a hydronics network. The coolant in the hydronics network, water in this case, undergoes a pressure drop and heat transfer until it loses heat to the expanding refrigerant in the evaporator coils of the chiller. The heat extracted by the chiller is dissipated through the cooling tower. Work is added at each stage to change the flow and thermodynamic state of the fluid. While this example shows a chilled water infrastructure, the problem definition and analysis can be extended to other forms of cooling infrastructure. Qdatacenter

Hot Fluid Cold Fluid

Power Grid Wblower

Wensemble System Exergo-Thermo Volumes

Cooling Grid

Wpump

Outside Air, Wblower

Wpump Wcompressor Chiller

System Blower (s)

COPG

¦W

Qdc ¦Wblower ¦Wpump ¦ Wcompressor ¦ Wcoolingtower

system

k

l

m

n

o

Fig. 3. Energy Flow in the IT stack – Supply and Demand Side

Development of a performance model at each stage in the heat flow path can enable efficient cooling equipment design, and provide a holistic view of operational exergy consumption from chips to the cooling tower. The performance model should be agnostic and be applicable to an ensemble of components for any environmental control infrastructure. [16] proposes such an operational metric to quantify the performance of the ensemble from chips to cooling tower. The metric, called coefficient of performance of the ensemble, COPG, builds on the thermodynamic metric called coefficient of performance [16]. Maximizing the coefficient of performance of the ensemble leads to minimization of exergy required to operate the cooling equipment. In Figure 3, the systems—such as processor, networking and storage blades—are modeled as “exergo-thermo-volumes” (ETV), an abstraction to represent lifetime exergy consumption of the IT building blocks and their cooling performance [17][18]. The thermo-volumes portion represents the coolant volume flow and resistance to the & ) and pressure drop (ΔP) respectively, to afflow, characterized by volume flow ( V fect heat energy removal from the ETVs. The product of pressure drop (ΔP in N/m2) & in m3/s) determines the flow work required to move a given and volume flow ( V coolant (air here) through the given IT building block represented as an ETV. The & ) for a given temperature rise required through the minimum coolant volume flow ( V

22

C.D. Patel

ETV, shown by a dashed line in Fig 3, can be determined from the energy equation (Eq. 2).

& & =ρ V Q = m C p (T out − T in ) where m .

.

(2)

& is the mass flow in kg/s, ρ is density in where Q& is the heat dissipated in Watts, m 3 kg/m of the coolant (air in this example), Cp is specific heat capacity of air (J/kg-K) and Tout and Tin represent inlet and outlet temperatures of air. As shown in Equation 3, the electrical power (100% available energy) required by the blower (Wb) is the ratio of the calculated flow work to the blower wire to air efficiency, ζb. The blower characteristic curves show the efficiency (ζb) and are important to understand the optimal capacity at the ensemble level. Wb =

( Δ P etv × V& etv )

ς

(3)

b

Total heat load of the datacenter is assumed to be a direct summation of the power delivered to the computational equipment via UPS and PDUs. Extending the coefficient of performance (COP) to encompass power required by cooling resources in form of flow and thermodynamic work, the ratio of total heat load to the power consumed by the cooling infrastructure is defined as: COP = =

Total Heat Dissipation

(Flow Work + Thermodynamic Work ) of Cooling system Heat Extracted by Air Conditione rs Net Work Input

(4)

The ensemble COP is then represented as shown below. The reader is referred to [16] for details. COPG =

Qdatacenter ∑W + ∑ Wblower + ∑ W pump + Wcompressor + W coolingtower k system l m

(5)

3.6 Supply Side of a Sustainable Data Center Design The supply side motivation to design and manage the data center using local sources of available energy is intended to minimize the destruction of exergy in distribution. Besides reducing exergy loss in distribution, a local micro-grid [19] can take advantage of resources that might otherwise remain unutilized and also present an opportunity to use the exergy in waste streams. Figure 4 shows a power grid with solar and methane based generator at a dairy farm. The methane is produced by anaerobic digestion of manure from dairy cows [20]. The use of biogas from manure is well known and has been used all over the world [7]. The advantage of co-location [20]

Sustainable Ecosystems: Enabled by Supply and Demand Management

23

Fig. 4. Biogas and solar electric

stems from the use of heat energy exhausted by the server racks—one Joule of which has a maximum theoretical available energy of 0.077 W at 323 K (50 oC)—to enhance methane production as shown in Figure 4. The hot water from the data center is circulated through the “soup” in the digester. Furthermore, [20] also suggests the use of available energy in the exhaust stream of the electric generator to drive an adsorption refrigeration cycle to cool the data center. Thus, multiple local options for power—wind, sun, biogas and natural gas— sourced locally can power a data center. And, high and low grade exergy in waste streams such as exhaust gases, can be utilized to drive other systems. Indeed, cooling for the data center ought to follow the same principles —a cooling grid made up of local sources. A cooling grid can be made up of ground coupled loops to dissipate heat into the ground, and use outside air (Figure 3) when at appropriate temperature and humidity to cool the data center. 3.7 Integrated Supply-Demand Management of Sustainable Data Centers Figure 5 shows the architectural framework for integrated supply-demand management of a data center. The key components of the data center—IT (compute, networking and storage), power and cooling—have five key horizontal elements. The foundational design elements of the data center are lifecycle design using exergy as a measure and flexible micro-grids of power, cooling and IT building blocks. The micro-grids enable the integrated manager the ability to choose between multiple supply side sources of power, multiple supply side sources of cooling and multiple types of IT hardware and software. The flexibility in power and cooling provides the ability to set power levels of IT systems and the ability to vary cooling (speed of the blowers, etc). The design flexibility in IT comes from intelligent scheduling framework, multiple power states and virtualization [15]. On this design foundation, the management layers are sensing and aggregation, knowledge discovery and policy based control.

24

C.D. Patel

IT

Power

Cooling

Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design

Fig. 5. Architectural framework for a Sustainable Data Center

At runtime, the integrated IT-Facility manager maintains the run-time status of the IT and facility elements of the datacenter. A lower level facility manager collects physical, environmental and process data from racks, room, chillers, power distribution components, etc. Based on the higher level requirements passed down by the integrated manager, the facility management system creates low-level SLAs for operation of power and cooling devices e.g. translating a high level energy efficiency goals to lower level temperature and utilization levels for facility elements to guarantee SLAs. The integrated manager has a knowledge discovery and visualization module that has data analytics for monitoring lifetime reliability, availability and downtimes for preventive maintenance. It has modules that provide insights that are otherwise not apparent at runtime e.g. temporal data mining of facility historical data. As an example, data mining techniques have been explored for more efficient operation of an ensemble of chillers [23][34]. In [23], operational patterns (or motifs) are mined in historical data pertaining to an ensemble of water and air cooled chillers. These patterns are characterized in terms of their COPG, thus allowing comparison in terms of their operational energy efficiency. At the control level in Figure 5, a variety of approaches can be taken. As an example, in one approach, the cooling controller maintains dynamic control of the facility infrastructure (includes CRACs, UPS, Chillers, supply side power etc.) at levels determined by the facility manager to optimize the coefficient of performance of the COPG, e.g., for providing requisite air flow to the racks to maintain the temperature at the inlet of the racks between 25 oC and 30 oC [11][12][14][21]. While exercising dynamic cooling control through the facility manager, the controller also provides information to the IT manager to consolidate workloads to optimize the performance of the data center, e.g., the racks are ranked based on thermo-fluids efficiency at a given time, and the ranking is used in workload placement [15]. The IT equipment not in use is scaled down by the integrated manager [15]. Furthermore, in order to reduce the redundancy in the data center, working in conjunction with IT and Facility managers, the integrated manager uses virtualization and power scaling as a flexibility to mitigate failures e.g. air conditioner failures [12][14][15]. Based on past work, the total energy consumed—in power and cooling—with these demand management techniques would be half of that of state of the art designs. Coupling the demand side management with the supply side options from the local

Sustainable Ecosystems: Enabled by Supply and Demand Management

25

power grid, and the local cooling grid, opens up a completely new approach to integrated supply-demand side management [24] and lead to a “net zero” data center.

4 Applying Supply-Demand Framework to Other Ecosystems In previous sections, the supply-demand framework was applied in devising least lifetime exergy data centers. Sustainable IT can now become IT for sustainability to enable need-based provisioning of resources—power, water, waste, transportation, etc—at scales of cities and can thus deliver a net positive impact by reducing the consumption and depletion of precious Joules of available energy. Akin to the sustainable data center, the foundation of the “Sustainable City” or “City 2.0” is comprehensive lifecycle design [25][26]. Unlike previous generations, where cities were built predominantly focusing on cost and functionality desired by inhabitants, sustainable cities will require a comprehensive life-cycle view, where systems are designed not just for operation but for optimality across resource extraction, manufacturing and transport, operation, and end-of-life. The next distinction within sustainable cities will arise in the supply-side resource pool. The inhabitants of sustainable cities are expected to desire on-demand, just-in-time access to resources at affordable costs. Instead of following a centralized production model with large distribution and transmission networks, a more distributed model is proposed: augmentation of existing centralized infrastructure with local resource micro-grids. As shown for the sustainable data center, there is an opportunity to exploit the resources locally available to create local supply side grids made up of multiple local sources, e.g., power generation by photo-voltaic cells on roof tops, utilization of exergy available in waste stream of cities such as municipal waste and sewage together with natural gas fired turbines with full utilization of waste stream from the turbine, etc. Similarly, for other key verticals such as water, there is an opportunity to leverage the past experience to build water micro-grids using local sources, e.g., harvesting rain water to charge local man-made reservoirs and underground aquifers. Indeed, past examples such as the Amber Fort in Jaipur, Rajasthan, India show such considerations in arid regions [27].

design

management

Electricity

Water

Transport

Waste

Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design

Fig. 6. Architectural Framework for a Sustainable City

...

26

C.D. Patel

As shown in Figure 6, having constructed lifecycle based physical infrastructures consisting of configurable resource micro-grids, the next key element is a pervasive sensing layer. Such a sensing infrastructure can generate data streams pertaining to the current supply and demand of resources emanating from disparate geographical regions, their operational characteristics, performance and sustainability metrics, and the availability of transmission paths between the different micro-grids. The great strides made in building high density, small lifecycle footprint IT storage can enable archival storage of the aggregated data about the state of each micro-grid. Sophisticated data analysis and knowledge discovery methods can be applied to both streaming and archival data to infer trends and patterns, with the goal of transforming the operational state of the systems towards least exergy operations. The data analysis can also enable construction of models using advanced statistical and machine learning techniques for optimization, control and fault detection. Intelligent visualization techniques can provide high-level indicators of the ‘health’ of each system being monitored. The analysis can enable end of life replacement decisions, e.g., when to replace pumps in water distribution system to make the most of the lifecycle. Lastly, while a challenging task—given the flexible and configurable resource pools, pervasive sensing, data aggregation and knowledge discovery mechanisms—an opportunity exists to devise policy based control system. As an example, given a sustainability policy, upstream and downstream pumps in a water micro-grid can operate to maintain a balance of demand and supply.

5 Summary and Conclusions This paper presented a supply-demand framework to enable sustainable ecosystems and suggested that a sustainable IT ecosystem, built using such a framework, can enable IT to drive sustainability in other human managed ecosystems. With respect to the IT ecosystem, an architecture for a sustainable data center composed of three key components—IT, power and cooling—and five key design and management elements stripped across the verticals was presented (Figure 5).The key elements—lifecycle design, scalable and configurable resource micro-grids, sensing, knowledge discovery and policy based control—enable the supply-demand management of the key components. Next, as shown by Figure 6, an architecture for “sustainable cities” built on the same principles by integrating IT elements across key city verticals such as power, water, waste, transport, etc. was presented. One hopes that cities are able to incorporate the key elements of the proposed architecture along one vertical, and when sufficient number of verticals have been addressed, a unified city scale architecture can be achieved. The instantiation of the sustainability framework and the architecture for data centers and cities will require a multi-disciplinary workforce. Given the need to develop the human capital, the specific call to action at this venue is to: •

Leverage the past, and return to “old school” core engineering, to build the foundational elements of the supply-demand architecture using lifecycle design and supply side design principles.

Sustainable Ecosystems: Enabled by Supply and Demand Management •

27

Create a multi-disciplinary curriculum composed of various fields of engineering, e.g., melding of computer science and mechanical engineering to scale the supply-demand side management of power, etc. o The curriculum also requires social and economic tracks as sustainability in its broadest definition is defined by economic, social and environmental spheres—the triple bottom line—and requires us to operate at the intersection of these spheres.

References 1. United Nations Population Division, http://www.un.org/esa/population 2. Aguiar, M., Boutenko, V., Michael, D., Rastogi, V., Subramanian, A., Zhou, Y.: The Internet’s New Billion. Boston Consulting Group Report (2010) 3. Chopra, A.: $35 Computer Taps India’s Huge Low-Income Market. Christian Science Monitor (2010) 4. Patel, C., Shah, A.: Cost Model for Planning, Development and Operation of a Data Center, HP Laboratories Technical Report, HPL-2005-107R1, Palo Alto, CA (2005) 5. Moran, M.J.: Availability Analysis: A Guide to Efficient Energy Use. Prentice-Hall, Englewood Cliffs (1982) 6. Barbir, F.: PEM Fuel Cells, pp. 18–25. Elsevier Academic Press (2005) 7. Rao, S., Parulekar, B.B.: Energy Technology. Khanna Publishers (2005) 8. Sharma, R., Shah, A., Bash, C.E., Patel, C.D., Christian, T.: Water Efficiency Management in Datacenters. In: International Conference on Water Scarcity, Global Changes and Groundwater Management Responses, Irvine, CA (2008) 9. Shah, A., Patel, C., Carey, V.: Exergy-Based Metrics for Environmentally Sustainable Design. In: 4th International Exergy, Energy and Environment Symposium, Sharjah (2009) 10. Hannemann, C., Carey, V., Shah, A., Patel, C.: Life-cycle Exergy Consumption of an Enterprise Server. International Journal of Exergy 7(4), 439–453 (2010) 11. Patel, C.D, Bash, C.E., Sharma, R., Friedrich, R.: Smart Cooling of Data Centers. In: ASME IPACK, Maui, Hawaii (2003) 12. Patel, C., Sharma, R., Bash, C., Beitelmal, A.: Thermal Considerations in Data Center Design, In: IEEE-Itherm, San Diego (2002) 13. Bash, C.E., Patel, C.D., Sharma, R.K.: Dynamic Thermal Management of Air Cooled Data Centers. In: IEEE-Itherm (2006) 14. Beitelmal, A., Patel, C.D.: Thermo-fluids Provisioning of a High Density Data Center. HP Labs External Technical Report, HPL-2004-146R1 (2004) 15. Chen, Y., Gmach, D., Hyser, C., Wang, Z., Bash, C., Hoover, C., Singhal, S.: Integrated Management of Application Performance, Power and Cooling in Data Centers. In: 12th IEEE/IFIP Network Operations and Management Symposium (NOMS), Osaka (2010) 16. Patel, C.D., Sharma, R.K., Bash, C.E., Beitelmal, M.: Energy Flow in the Information Technology Stack: Introducing the Coefficient of Performance of the Ensemble. In: ASME International Mechanical Engineering Congress & Exposition, Chicago, Illinois (2006) 17. Shah, A., Patel, C.D.: Exergo-Thermo-Volumes: An Approach for Environmentally Sustainable Thermal Management of Energy Conversion Devices. Journal of Energy Resource Technology, Special Issue on Energy Efficiency, Sources and Sustainability 2 (2010) 18. Shah, A., Patel, C.: Designing Environmentally Sustainable Systems using ExergoThermo-Volumes. International Journal of Energy Research 33 (2009)

28

C.D. Patel

19. Sharma, R., Bash, C.E., Marwah, M., Christian, T., Patel, C.D.: MICROGRIDS: A new approach to supply-side design of data centers. In: IMECE 2009: Lake Buena Vista, FL (2009) 20. Sharma, R., Christian, T., Arlitt, M., Bash, C., Patel, C.: Design of Farm Waste-driven Supply Side Infrastructure for Data Centers, In: ASME-Energy Sustainability (2010) 21. Sharma, R., Bash, C.E., Patel, C.D., Friedrich, R.S., Chase, J.: Balance of Power: Dynamic Thermal Management of Internet Data Centers. IEEE Computer (2003) 22. Marwah, M., Sharma, R.K., Patel, C.D., Shih, R., Bhatia, V., Rajkumar, V., Mekanapurath, M., Velayudhan, S.: Data Analysis, Visualization and Knowledge Discovery in Sustainable Data Centers. In: Computer 2009, Bangalore (2009) 23. Patnaik, D., Marwah, M., Sharma, R., Ramakrishna, N.: Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining, In: ACM KDD (2009) 24. Gmach, D., Rolia, J., Bash, C., Chen, Y., Christian, T., Shah, A., Sharma, R., Wang, Z.: Capacity Planning and Power Management to Exploit Sustainable Energy. In: 6th International Conference on Network and Service Management, Niagara Falls, Canada (2010) 25. Bash, C., Christian, T., Marwah, M., Patel, C., Shah, A., Sharma, R.: City 2.0: Leveraging Information Technology to Build a New Generation of Cities. Silicon Valley Engineering Council (SVEC) Journal 1(1), 1–6 (2009) 26. Hoover, C.E., Sharma, R., Watson, B., Charles, S.K., Shah, A., Patel, C.D., Marwah, M., Christian, T., Bash, C.E.: Sustainable IT Ecosystems: Enabling Next-Generation Cities, HP Laboratories Technical Report, HPL-2010-73 (2010) 27. Water harvesting, http://megphed.gov.in/knowledge/RainwaterHarvest/Chap2.pdf 28. Cunha, I., Almeida, J., Almeida, V., Santos, M.: Self-adaptive capacity management for multi-tier virtualized environments. In: 10th IFIP/IEEE IM (2007) 29. Khana, G., Beaty, K., Kar, G., Kochut, A.: Application performance management in virtualized server environments. In: IEEE/IFIP NOMS (2006) 30. Gmach, D., Rolia, J., Cherkasova, L., Kemper, A.: Capacity management and demand prediction for next generation data centers. IEEE ICWS Salt Lake City (2007) 31. Kephart, J., Chan, H., Das, R., Levine, D., Tesauro, G., Rawson, F., Lefurgy, C.: Coordinating multiple autonomic managers to achieve specified power-performance tradeoffs. In: 4th IEEE Int. Conf. on Autonomic Computing, ICAC (2007) 32. Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. IEEE Computer 40(12), 33–37 (2007) 33. Andrzejak, A., Arlitt, M., Rolia, J.: Bounding the Resource Savings of Utility Computing Models, HP Laboratories Technical Report HPL-2002-339 (2002) 34. Patnaik, D., Marwah, M., Sharma, R., Ramakrishnan, N.: Data Mining for Modeling Chiller Systems in Data Centers, IDA (2010)

Unclouded Vision Jon Crowcroft1, Anil Madhavapeddy1 , Malte Schwarzkopf1, Theodore Hong1 , and Richard Mortier2 1

2

Cambridge University Computer Laboratory, 15, JJ Thomson Avenue, Cambridge CB3 0FB. UK [email protected] Horizon Digital Economy Research, University of Nottingham, Triumph Road, Nottingham NG7 2TU. UK [email protected]

Abstract. Current opinion and debate surrounding the capabilities and use of the Cloud is particularly strident. By contrast, the academic community has long pursued completely decentralised approaches to service provision. In this paper we contrast these two extremes, and propose an architecture, Droplets, that enables a controlled trade-oﬀ between the costs and beneﬁts of each. We also provide indications of implementation technologies and three simple sample applications that substantially beneﬁt by exploiting these trade-oﬀs.

1

Introduction

The commercial reality of the Internet and mobile access to it is muddy. Generalising, we have a set of cloud service providers (e.g. Amazon, Facebook, Flickr, Google and Microsoft, to name a representative few), and a set of devices that many – and soon most – people use to access these resources (so-called smartphones, e.g., Blackberry, iPhone, Maemo, Android devices). This combination of hosted services and smart access devices is what many people refer to as “The Cloud” and is what makes it so pervasive. But this situation is not entirely new. Once upon a time, looking as far back as the 1970s, we had “thin clients” such as ultra-thin glass ttys accessing timesharing systems. Subsequently, the notion of thin client has periodically resurfaced in various guises such as the X-Terminal, and Virtual Networked Computing (VNC) [14]. Although the world is not quite the same now as back in those thin client days, it does seem similar in economic terms. But why is it not the same? Why should it not be the same? The short answer is that the end user, whether in their home or on the top of the Clapham Omnibus,1 has in their pocket a device with vastly more resource than a mainframe of the 1970s by any measure, whether processing speed, storage capacity or network access rate. With this much power at our ﬁngertips, we should be able to do something smarter than simply using our devices as vastly over-speciﬁed dumb terminals. 1

http://en.wikipedia.org/wiki/The_man_on_the_Clapham_omnibus

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 29–40, 2011. c Springer-Verlag Berlin Heidelberg 2011

30

J. Crowcroft et al.

Meanwhile, the academic reality is that many people have been working at the opposite extreme from this commercial reality, trying to build “ultra-distributed” systems, such as peer-to-peer ﬁle sharing, swarms,2 ad-hoc mesh networks, mobile decentralised social networks,3 in complete contrast to the centralisation trends of the commercial world. We choose to coin the name “The Mist” for these latter systems. The deﬁning characteristic of the Mist is that data is dispersed among a multitude of responsible entities (typically, though not exclusively, ordinary users), rather than being under the control of a single monolithic provider. Haggle [17], Mirage [11] and Nimbus [15] are examples of architectures for, respectively, the networking, operating system and storage components of the Mist. The Cloud and the Mist are extreme points in a spectrum, each with its upsides and downsides. Following a discussion of users’ incentives (§2), we will expand on the capabilities of two instances of these ends later (§3). We will then describe our proposed architecture (§4) and discuss its implications for three particular application domains (§5), before concluding (§6).

2

User Incentives

For the average user, accustomed to doing plain old storage and computation on their own personal computer or mobile (what we might term “The Puddle”), there are multiple competing incentives pushing in many directions: both towards and away from the Cloud, and towards and away from the Mist (see Figure 1).

cloud social easy to use centrally managed scalable virus protection

de p puddle

mist

Sharing

default privacy no lock-in

Sync Data Location

physical control

Speed

high bandwidth

Security

hacking protection

Fig. 1. Incentives pushing users toward the centralised Cloud vs. the decentralised Mist 2 3

http://bittorrent.com/ http://joindiaspora.com, http://peerson.net/

Unclouded Vision

31

Consider some of the forms of utility a user wants from their personal data: – Sharing. There is a tension between the desire to share some personal data easily with selected peers (or even publicly), and the need for control over more sensitive information. The social Cloud tends to share data, whereas the decentralised Mist defaults to privacy at the cost of making social sharing more diﬃcult. – Synchronization. The Cloud provides a centralised naming and storage service to which all other devices can point. As a downside, this service typically incurs an ongoing subscription charge while remaining vulnerable to the provider stopping the service. Mist devices work in a peer-to-peer fashion which avoids provider lock-in, but have to deal with synchronisation complexity. – Data Location. The Cloud provides a convenient, logically centralised data storage point, but the speciﬁc location of any component is hard for the data owner to control.4 In contrast, the decentralised Mist permits physical control over where the devices are but makes it hard to reliably ascertain how robustly stored and backed-up the data is. – Speed. A user must access a centralised Cloud via the Internet, which limits access speeds and creates high costs for copying large amounts of data. In the Mist, devices are physically local and hence have higher bandwidth. However, Cloud providers can typically scale their service much better than individuals for those occasions when “ﬂash traﬃc” drives a global audience to popular content. – Security. A user of the Mist is responsible for keeping their devices updated and can be vulnerable to malicious malware if they fall behind. However, the damage of intrusion is limited only to their devices. In contrast, a Cloud service is usually protected by dedicated staﬀ and systems, but presents a valuable hacking target in which any failures can have widespread consequences, exposing the personal data of millions of users. These examples demonstrate the clear tension between what users want from services managing their personal data vs. how Cloud providers operate in order to keep the system economically viable. Ideally, the user would like to keep their personal data completely private while still hosting it on the Cloud. On the other hand, the cloud provider needs to recoup hosting costs by, e.g., selling advertising against users’ personal data. Even nominally altruistic Mist networks need incentives to keep them going: e.g., in BitTorrent it was recently shown that a large fraction of the published content is driven by proﬁt-making companies rather than altruistic amateur ﬁlesharers [2]. Rather than viewing this as a zero-sum conﬂict between users and providers, we seek to leverage the smart capabilities of our devices to provide happy compromises that can satisfy the needs of all parties. By looking more closely at the true underlying interests of the diﬀerent sides, we can often discover solutions that achieve seemingly incompatible goals [6]. 4

http://articles.latimes.com/2010/jul/24/business/la-fi-google-la-20100724

32

3

J. Crowcroft et al.

The Cloud vs. the Mist

To motivate the droplets architecture, we ﬁrst examine the the pros and cons of the Cloud and the Mist in more detail. The Cloud’s Benefits: Centralising resources brings several signiﬁcant beneﬁts, speciﬁcally: – economies of scale, – reduction in operational complexity, and – commercial gain. Perhaps the most signiﬁcant of these is the oﬄoading of the conﬁguration and management burden traditionally imposed by computer systems of all kinds. Additionally, cloud services are commonly implemented using virtualisation technology which enables statistical multiplexing and greater eﬃciencies of scale while still retaining “Chinese walls” that protect users from one another. As cloud services have grown, they have constructed specialised technology dedicated to the task of large data storage and retrieval, for example the new crop of “NoSQL” databases in recent years [10]. Most crucially, centralised cloud services have built up valuable databases of information that did not previously exist before. Facebook’s “social graph” contains detailed information on the interactions of hundreds of millions of individuals every day, including private messages and media. These databases are not only commercially valuable in themselves, they can also reinforce a monopoly position, as the network eﬀect of having sole access to this data can prevent other entrants from constructing similar databases. The Cloud’s Costs: Why should we trust a cloud provider with our personal data? There are many ways in which they might abuse that trust, data protection legislation notwithstanding. The waters are further muddied by the various commercial terms and conditions to which users initially sign up, but which providers often evolve over time. When was the last time you checked the URL to which your providers post alterations to their terms and conditions, privacy policies, etc.? Even if you object to a change, can you get your data back and move it to another provider, and ensure that they have really deleted it? The Mist’s Benefits: Accessing the Cloud can be ﬁnancially costly due to the need for constant high-bandwidth access. Using the Mist, we can reduce our access costs because data is stored (cached) locally and need only be uploaded to others selectively and intermittently. We keep control over privacy, choosing exactly what to share with whom and when. We also have better access to our data: we retain control over the interfaces used to access it; we are immune to service disruptions which might aﬀect the network or cloud provider; and we cannot be locked out from our own data by a cloud provider. The Mist’s Costs: Ensuring reliability and availability in a distributed decentralised system is extremely complex. In particular, a new vector for breach of

Unclouded Vision

33

personal data is introduced: we might leave our fancy device on top of the aforesaid Clapham Omnibus with our data in it! We have to manage the operation of the system ourselves, and need to be connected often enough for others to be able to contact us. Droplets: A Happy Compromise? In between these two extremes should lie the makings of a design that has all the positives and none of the negatives. In fact, a hint of a way forward is contained in the comments above. If data is encrypted on both our personal computer/device and in the Cloud, then for privacy purposes it doesn’t really matter where it is physically stored. However, for performance reasons, we do care. Hence we’d like to carry information of immediate value close to us. We would also like it replicated in multiple places for reliability reasons. We also observe that the vast majority of usergenerated content is of interest only within the small social circle of the content’s subject/creator/producer/owner and thus note that interest/popularity in objects tends to be Zipf-distributed. In the last paragraph, it might be unclear who “we” are: “we” refers to Joe Public, whether sitting at home or on the top of that bus. However, there is another important set of stakeholders: those who provide The Cloud and The Net. These stakeholders need to make money lest all of this fail. The service provider needs revenue to cover operational expenses and to make a proﬁt, but is loath to charge the user directly. Even in the case of the network, ISPs (and 3G providers) are mostly heading toward ﬂat data rates. As well as targeted advertisements and associated “click-through” revenue, service providers also want to carry out data mining to do market research of a more general kind. Fortunately, recent advances in cryptography and security hint at ways to continue to support the two-sided business models that abound in today’s Internet. In the case of advertising, the underlying interest of the Cloud provider is actually the ability to sell targeted ads, not to know everything about its users. Privacy-preserving query techniques can permit ads to be delivered to users matching certain criteria without the provider actually knowing which users they were [8,9,16]. In the case of data mining on the locations or transactions of users, techniques such as diﬀerential privacy [5] and k-anonymity [18] can allow providers to make queries on aggregate data without being able to determine information about speciﬁc users. So we propose Droplets, half way between the Cloud and the Mist. Droplets make use of the Mirage operating system [11], Nimbus storage [15] and Haggle networking [17]. They ﬂoat between the personal device and the cloud, using technologies such as social networks, virtualisation and migration [1,3], and they provide the basic components of a Personal Container [12]. They condense within social networks, where privacy is assured by society, but in the great unwashed Internet, they stay opaque. The techniques referred to above allow the service providers to continue to provide the storage, computation, indexing, search and transmission services that they do today, with the same wide range of business models.

34

4

J. Crowcroft et al.

Droplets

Droplets are units of network-connected computation and storage, designed to migrate around the Internet and personal devices. At a droplet’s core is the Mirage operating system, which compiles high-level language code into specialised targets such as Xen micro-kernels, UNIX binaries, or even Javascript applications. The same Mirage source code can thus run on a cloud computing platform, within a user’s web browser, on a smart-phone, or even as a plugin on a social network’s own servers. As we note in Table 1, there is no single “perfect” location where a Droplet should run all the time, and so this agility of placement is crucial to maximising satisfaction of the users’ needs while minimising their costs and risks. Table 1. Comparison of diﬀerent potential Droplets platforms Platform Storage Bandwidth

Google AppEngine VM (e.g., on EC2) Home Computer Mobile Phone moderate

moderate

high

low

high

high

limited

low

Accessibility

always on

always on

variable

variable

Computation

limited

ﬂexible, plentiful

ﬂexible, limited

limited

Cost

free

expensive

cheap

cheap

Reliability

high

high

medium (failure)

low (loss)

Storage in such an environment presents a notable challenge, which we address via the Nimbus system, a distributed, encrypted and delay-tolerant personal data store. Working on the assumption that personal data access follows a Zipf power-law distribution, popular objects can be kept live on relatively expensive but low-latency platforms such as a Cloud virtual machine, while older objects can be archived inexpensively but safely on a storage device at home. Nimbus also provides local attestation in the form of “trust fountains,” which let nodes provide a cryptographic attestation witnessing another node’s presence or ownership of some data. Trust fountains are entirely peer-to-peer, and so proof is established socially (similarly to the use of lawyers or public notaries) rather than via central authority. Haggle provides a delay-tolerant networking platform, in which all nodes are mobile and can relay messages via various routes. Even with the use of central “stable” nodes such as the Cloud, outages will still occur due to the scale and dynamics of the Cloud and the Net, as has happened several times to such highproﬁle and normally robust services as GMail. During such events, the user must not lose all access to their data, and so the Haggle delay-tolerant model is a good ﬁt. It is also interesting to observe that many operations performed by users are quite latency-insensitive, e.g. backups can be performed incrementally, possibly overnight.

Unclouded Vision

4.1

35

Deployment Model

Droplets are a compromise between the extremely-distributed Mist model and the more centralised Cloud. They store a user’s data and provide a network interface to this data rather than exposing it directly. The nature of this access depends on where the Droplet has condensed. – Internet droplet. If the Droplet is running exposed to the wild Internet, then the network interfaces are kept low-bandwidth and encrypted by default. To prevent large-scale data leaks, the Droplet rejects operations that would download or erase a large body of data. – Social network droplet. For hosting data, a droplet can condense directly within a social network, where it provides access to its database to the network, e.g., for data mining, in return for “free” hosting. Rather than allowing raw access, it can be conﬁgured to only permit aggregate queries to help populate the provider’s larger database, but still keep track of its own data. – Mobile droplet. The Droplet provides high-bandwidth, unfettered access to data. It also regularly checks with any known peers to see if a remote wipe instruction will cause it to permanently stop serving data. – Archiver droplet. Usually runs on a low-power device, e.g., an ARM-based BeagleBoard, accepting streams of data changes but not itself serving data. Its resources are used to securely replicate long-term data, ensuring it remains live, and to alert the user in case of signiﬁcant degradation. – Web droplet. A Droplet in a web browser executes as a local Javascript application, where it can provide web bookmarklet services, e.g., trusted password storage. It uses cross-domain AJAX to update a more reliable node with pertinent data changes. Droplets can thus adapt their external interfaces depending on where they are deployed, allowing negotiation of an acceptable compromise between hosting costs and desire for privacy. 4.2

Trust Fountains

To explain trust fountains by way of example, consider the following. As part of the instantiation of their Personal Container, Joe Public runs an instance of a Nimbus trust fountain. When creating a droplet from some data stored in his Personal Container, this trust fountain creates a cryptographic attestation proving Joe’s ownership of the data at that time in the form of a time-dependent hash token. The droplet is then encrypted under this hash token using a fast, medium strength cipher5 and pushed out to the cloud. By selectively publishing the token, Joe can grant access to the published droplet e.g., allowing data mining access to a provider in exchange for free data storage and hosting. Alternatively, 5

Strong encryption is not required as the attestations are unique for each droplet publication and breaking one does not grant an attacker access to any other droplets.

36

J. Crowcroft et al.

the token might only be shared with a few friends via an ad hoc wireless network in a coﬀee shop, granting them access only to that speciﬁc data at that particular time. 4.3

Backwards Provenance

A secondary purpose of the attestation is to enable “backwards provenance”, i.e., a way to prove ownership. Imagine that Joe publishes a picture of some event which he took using his smartphone while driving past it on that oftconsidered bus. A large news agency picks up and uses that picture after Joe publishes it to his Twitter stream using a droplet. The attached attestations then enable the news agency to compensate both the owner and potentially the owner’s access provider, who takes a share in all proﬁts made from Joe’s digital assets in exchange for serving them. Furthermore, Joe is given a tool to counter “hijacking” of his creation even if the access token becomes publicly known: using the cryptographic properties of the token, the issue log of his trust fountain together with his provider’s conﬁrmation of receipt of the attested droplet forms suﬃcient evidence to prove ownership and take appropriate legal action. Note that Joe Public can also deny ownership if he chooses, as only his trust fountain holds the crucial information necessary to regenerate the hash token and thus prove the attestation’s origin. 4.4

Handling 15 Minutes of Fame

Of course, whenever a droplet becomes suﬃciently popular to merit condensation into a cloud burst of marketing, then we have the means to support this transition, and we have the motivation and incentives to make sure the right parties are rewarded. In this last paragraph, “we” refers to all stakeholders: users, government and business. It seems clear that the always-on, everywhere-logged, ubiquitously-connected vision will continue to be built, while real people become increasingly concerned about their privacy [4]. Without such privacy features, it is unclear for how much longer the commercial exploitation of personal data will continue to be acceptable to the public; but without such exploitation, it is unclear how service providers can continue to provide the many “free” Internet services on which we have come to rely.

5

Droplications

The Droplet model requires us to rethink how we construct applications – rather than building centralised services, they must now be built according to a distributed, delay-tolerant model. In this section, we discuss some of the early services we are building.

Unclouded Vision

5.1

37

Digital Yurts

In the youthful days of the Internet, there was a clear division between public data (web homepages, FTP sites, etc.) and private (e-mail, personal documents, etc.). It was common to archive personal e-mail, home directories and so on, and thus to keep a simple history of all our digital activities. The pace of change in recent years has been tremendous, not only in the variety of personal data, but in where that data is held. It has moved out of the conﬁnes of desktop computers to data-centres hosted by third-parties such as Google, Yahoo and Facebook, who provide “free” hosting of data in return for mining information from millions of users to power advertising platforms. These sites are undeniably useful, and hundreds of millions of users voluntarily surrender private data in order to easily share information with their circle of friends. Hence, the variety of personal data available online is booming – from media (photographs, videos), to editorial (blogging, status updates), and streaming (location, activity). However, privacy is rapidly rising up the agenda as companies such as Facebook and Google collect vast amounts of data from hundreds of millions of users. Unfortunately, the only alternative that privacy-sensitive users currently have is to delete their online accounts, losing both access to and what little control they have over their online social networks. Often, deletion does not even completely remove their online presence. We have become digital nomads: we have to fetch data from many third-party hosted sites to recover a complete view of our online presence. Why is it so diﬃcult to go back to managing our own information, using our own resources? Can we do so while keeping the “good bits” of existing shared systems, such as ease-of-use, serendipity and aggregation? Although the immediate desire to regain control of our privacy is a key driver, there are several other longer-term concerns about third parties controlling our data. The incentives of hosting providers are not aligned with the individual: we care about preserving our history over our lifetime, whereas the provider will choose to discard information when it ceases to be useful for advertising. This is where the Droplet model is useful – rather than dumbly storing data, we can also negotiate access to that data with hosting providers via an Internet droplet, and arrive at a compromise between letting them data mine it, versus the costs of hosting it. When the hosting provider loses interest in the older, historical tail of data, the user can deploy an archival droplet to catch the data before it disappears, and archive it for later retrieval. 5.2

Dust Clouds

Dust Clouds [13] is a proposal for the provision of secure anonymous services using extremely lightweight virtual machines hosted in the cloud. As they are lightweight, they can be created and destroyed with very short lifetimes, yet still achieve useful work. However, several tensions exist between the requirements of users and cloud providers in such a system. For example, cloud providers have a strong requirement for a variety of auditing functions. They need to know who consumed what resources in order

38

J. Crowcroft et al.

to bill, to provision appropriately, to ensure that, e.g., upstream service level agreements with other providers are met, and so on. They would tend to prefer centralisation for reasons already mentioned (eﬃciency, economy of scale, etc.). By contrast, individual consumers use such a system precisely because it provides anonymity while they are doing things that they wish not to be attributed to them, e.g., to avoid arrest. Anonymity in a dust cloud is largely provided by having a rich mixnet of traﬃc and other resource consumption. Consumers would also prefer diversity, in both geography and provider, to ensure that they’re not at the mercy of a single judicial/regulatory system. Pure cloud approaches fail the users’ requirements by putting too much control in the hands of one (or a very small number of) cloud providers. Pure mist approaches fail the user by being unable to provide the richness of mixing to provide suﬃcient anonymity: many of the devices in the mist are either insuﬃciently powerful or insuﬃciently well-connected to support a large enough number of users’ processes. By taking a droplets approach we obviate both these issues: the lightweight nature of VM provisioning means that it becomes largely infeasible for the cloud provider to track in detail what users are doing, particularly when critical parts of the overall distributed process/communication are hosted on non-cloud infrastructure. Local auditing for payment recovery based on resources used is still possible, but the detailed correlation and reconstruction of an individual process’s behaviour becomes eﬀectively impossible. At the same time, the scalable and generally eﬃcient nature of cloud-hosted resources can be leveraged to ensure that the end result is itself suitably scalable. 5.3

Evaporating Droplets

With Droplets, we also have a way of creating truly ephemeral data items in a partially trusted or untrusted environment, such as a social network, or the whole Internet. Since Droplets have the ability to do computation, they can refuse to serve data if access prerequisites are not met: for example, time-dependent hashes created from a key and a time stamp can be used to control access to data in a Droplet. Periodically, the user’s “trust fountain” will issue new keys, notifying the Droplet that it should now accept the new key only. To “evaporate” data in a Droplet, the trust fountain simply ceases to provide keys for it, thus making users unable to access the Droplet, even if they still have the binary data or even the Droplet itself (assuming, of course, that brute-forcing the hash key is not a worthwhile option). Furthermore, their access is revoked even in disconnected state, i.e. when the Droplet cannot be notiﬁed to accept the new hash key only: since it is necessary to provide time stamp and key as authentication tokens in order for the Droplet to generate the correct hash, expired keys can no longer be used as they have to be provided along with their genuine origin time stamp. Additionally, as a more secure approach, the Droplet could even periodically re-encrypt its contents in order to combat brute-forcing. This subject has been of some research interest recently. Another approach [7] relies on statistical metrics that require increasingly large amounts of data from a DHT to be available to an attacker in order to reconstruct the data, but

Unclouded Vision

39

is vulnerable to certain Sybil attacks [19]. Droplets, however, have the power of being able to completely ensure that all access to data is revoked, even when facing a powerful adversary in a targeted attack. Furthermore, as a side eﬀect of the hash-key based access control, the evaporating Droplet could serve diﬀerent views, or stages of evaporation, to diﬀerent requesters depending on the access key they use (or its age). Finally, the “evaporating Droplet” can be made highly accessible from a user perspective by utilizing a second Droplet: a Web Droplet (see §4.1) that integrates with a browser can automate the process of requesting access keys from trust fountains and unlocking the evaporating Droplet’s contents.

6

Conclusions and Future Work

In this paper, we have discussed the tension between the capabilities of and demands on the Cloud and the Mist. We concluded that both systems are at opposite ends of a spectrum of possibilities and that compromise between providers and users is essential. From this, we derived an architecture for an alternative system, Droplets, that enables control over the trade-oﬀs involved, resulting in systems acceptable to both hosting providers and users. Having realised two of the main components involved in Droplets, Haggle networking and the Mirage operating system, we are now completing realisation of the third, Nimbus storage, as well as building some early “droplications”.

References 1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warﬁeld, A.: Live migration of virtual machines. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI), pp. 273–286. USENIX Association, Berkeley (2005) 2. Cuevas, R., Kryczka, M., Cuevas, A., Kaune, S., Guerrero, C., Rejaie, R.: Is content publishing in BitTorrent altruistic or proﬁt-driven (July 2010), http://arxiv.org/abs/1007.2327 3. Cully, B., Lefebvre, G., Meyer, D.T., Karollil, A., Feeley, M.J., Hutchinson, N.C., Warﬁeld, A.: Remus: High availability via asynchronous virtual machine replication. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI). USENIX Association, Berkeley (April 2008) 4. Doctorow, C.: The Things that Make Me Weak and Strange Get Engineered Away. Tor.com (August 2008), http://www.tor.com/stories/2008/08/weak-and-strange 5. Dwork, C.: Diﬀerential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) 6. Fisher, R., Patton, B.M., Ury, W.L.: Getting to Yes: Negotiating Agreement Without Giving In. Houghton Miﬄin (April 1992), http://www.amazon.com/exec/ obidos/redirect?tag=citeulike07-20&path=ASIN/0395631246 7. Geambasu, R., Kohno, T., Levy, A., Levy, H.M.: Vanish: Increasing data privacy with self-destructing data. In: Proceedings of the USENIX Security Symposium (August 2009)

40

J. Crowcroft et al.

8. Guha, S., Reznichenko, A., Tang, K., Haddadi, H., Francis, P.: Serving Ads from localhost for Performance, Privacy, and Proﬁt. In: Proceedings of Hot Topics in Networking (HotNets), New York, NY (October 2009) 9. Haddadi, H., Hui, P., Brown, I.: MobiAd: Private and scalable mobile advertising. In: Proceedings of MobiArch (to appear, 2010) 10. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010), http://dx.doi.org/10.1109/MC.2010.58 11. Madhavapeddy, A., Mortier, R., Sohan, R., Gazagnaire, T., Hand, S., Deegan, T., McAuley, D., Crowcroft, J.: Turning down the LAMP: software specialisation for the cloud. In: HotCloud 2010: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 11. USENIX Association, Berkeley (2010) 12. Mortier, R.: et al.: The Personal Container, or Your Lif. In Bit. In: Proceedings of Digital Futures (October 2010) 13. Mortier, R., Madhavapeddy, A., Hong, T., Murray, D., Schwarzkopf, M.: Using Dust Clouds to enhance anonymous communication. In: Proceedings of the Eighteenth International Workshop on Security Protocols, IWSP (April 2010) 14. Richardson, T., Staﬀord-Fraser, Q., Wood, K.R., Hopper, A.: Virtual network computing. IEEE Internet Computing 2(1), 33–38 (1998) 15. Schwarzkopf, M., Hand, S.: Nimbus: Intelligent Personal Storage. Poster at the Microsoft Research Summer School 2010, Cambridge, UK (2010) ¨ 16. Shikfa, A., Onen, M., Molva, R.: Privacy in content-based opportunistic networks. In: AINA Workshops, pp. 832–837 (2009) 17. Su, J., Scott, J., Hui, P., Crowcroft, J., De Lara, E., Diot, C., Goel, A., Lim, M.H., Upton, E.: Haggle: Seamless networking for mobile applications. In: Krumm, J., Abowd, G.D., Seneviratne, A., Strang, T. (eds.) UbiComp 2007. LNCS, vol. 4717, pp. 391–408. Springer, Heidelberg (2007) 18. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002) 19. Wolchok, S., Hofmann, O., Heninger, N., Felten, E., Halderman, J., Rossbach, C., Waters, B., Witchel, E.: Defeating vanish with low-cost sybil attacks against large DHTs. In: Proceedings of the 17th Network and Distributed System Security Symposium (NDSS), pp. 37–51 (2010)

Generating Fast Indulgent Algorithms Dan Alistarh1 , Seth Gilbert2 , Rachid Guerraoui1, and Corentin Travers3 1

2 3

EPFL, Switzerland National University of Singapore Universit´e de Bordeaux 1, France

Abstract. Synchronous distributed algorithms are easier to design and prove correct than algorithms that tolerate asynchrony. Yet, in the real world, networks experience asynchrony and other timing anomalies. In this paper, we address the question of how to efficiently transform an algorithm that relies on synchronization into an algorithm that tolerates asynchronous executions. We introduce a transformation technique from synchronous algorithms to indulgent algorithms [1], which induces only a constant overhead in terms of time complexity in well-behaved executions. Our technique is based on a new abstraction we call an asynchrony detector, which the participating processes implement collectively. The resulting transformation works for a large class of colorless tasks, including consensus and set agreement. Interestingly, we also show that our technique is relevant for colored tasks, by applying it to the renaming problem, to obtain the first indulgent renaming algorithm.

1 Introduction The feasibility and complexity of distributed tasks has been thoroughly studied both in the synchronous and asynchronous models. To better capture the properties of real-world systems, Dwork, Lynch, and Stockmeyer [2] proposed the partially synchronous model, in which the distributed system may alternate between synchronous to asynchronous periods. This line of research inspired the introduction of indulgent algorithms [1], i.e. algorithms that guarantee correctness and efficiency when the system is synchronous, and maintain safety even when the system is asynchronous. Several indulgent algorithms have been designed for specific distributed problems, such as consensus (e.g., [3, 4]). However, designing and proving correctness of such algorithms is usually a difficult task, especially if the algorithm has to provide good performance guarantees. Contribution. In this paper, we introduce a general transformation technique from synchronous algorithms to indulgent algorithms, which induces only a constant overhead in terms of time complexity. Our technique is based on a new primitive called an asynchrony detector, which identifies periods of asynchrony in a fault-prone asynchronous system. We showcase the resulting transformation to obtain indulgent algorithms for a large class of colorless agreement tasks, including consensus and set agreement. We also apply our transformation to the distinct class of colored tasks, to obtain the first indulgent renaming algorithm. Detecting Asynchrony. Central to our technique is a new abstraction, called an asynchrony detector, which we design as a distributed service for detecting periods of

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 41–52, 2011. c Springer-Verlag Berlin Heidelberg 2011

42

D. Alistarh et al.

asynchrony. The service detects asynchrony both at a local level, by determining whether the view of a process is consistent with a synchronous execution, and at a global level, by determining whether the collective view of a set of processes could have been observed in a synchronous execution. We present an implementation of an asynchrony detector, based on the idea that each process maintains a log of the messages sent and received, which it exchanges with other processes. This creates a view of the system for every process, which we use to detect asynchronous executions. The Transformation Technique. Based on this abstraction, we introduce a general technique allowing synchronous algorithms to tolerate asynchrony, while maintaining time efficiency in well-behaved executions. The main idea behind the transformation is the following: as long as the asynchrony detector signals a synchronous execution, processes run the synchronous algorithm. If the system is well behaved, then the synchronous algorithm yields an output, on which the process decides. Otherwise, if the detector notices asynchrony, we revert to an existing asynchronous backup algorithm with weaker termination and performance guarantees. Transforming Agreement Algorithms. We first showcase the technique by transforming algorithms for a large class of agreement tasks, called colorless tasks, which includes consensus and set agreement. Intuitively, a colorless task allows processes to adopt each other’s output values without violating the task specification, while ensuring that every value returned has been proposed by a process. We show that any synchronous algorithm solving a colorless task can be made indulgent at the cost of two rounds of communication. For example, if a synchronous algorithm solves synchronous consensus in t + 1 rounds, where t is the maximum number of crash failures (i.e. the algorithm it is time-optimal), then the resulting indulgent algorithm will solve consensus in t + 3 rounds if the system is initially synchronous, or will revert to a safe backup, e.g. Paxos [4, 5] or ASAP [6], otherwise. The crux of the technique is the hand-off procedure: we ensure that, if a process decides using the synchronous algorithm, any other process either decides or adopts a state which is consistent with the decision. In this second case, we show that a process can recover a consistent state by examining the views of other processes. The validity property will ensure that the backup protocol generates a valid output configuration. Transforming Renaming Algorithms. We also apply our technique to the renaming problem [7], and obtain the first indulgent renaming algorithm. Starting from the synchronous protocol of [8], our protocol renames in a tight namespace of N names and terminates in (log N + 3) rounds, in synchronous executions. In asynchronous executions, the protocol renames in a namespace of size N + t. Roadmap. In Section 2, we present the model, while Section 3 presents an overview of related work. We define asynchrony detectors in Section 4. Section 5 presents the transformation for colorless agreement tasks, while Section 6 applies it to the renaming problem. In Section 7 we discuss our results. Due to space limitations, the proofs of some basic results are omitted, and we present detailed sketches for some of the proofs.

Generating Fast Indulgent Algorithms

43

2 Model We consider an eventually synchronous system with N processes Π = {p1 , p2 , . . . , pN }, in which t < N/2 processes may fail by crashing. Processes communicate via messagepassing in rounds, which we model much as in [3, 9, 10]. In particular, time is divided into rounds, which are synchronized. However, the system is asynchronous, i.e. there is no guarantee that a message sent in a round is also delivered in the same round. We do assume that processes receive at least N − t messages in every round, and that a process always receives its own message in every round. Also, we assume that there exists a global stabilization time GST ≥ 0 after which the system becomes synchronous, i.e. every message is delivered in the same round in which it was sent. We denote such a system by ES(N, t). Although indulgent algorithms are designed to work in this asynchronous setting, they are optimized for the case in which the system is initially synchronous, i.e. when GST = 0. We denote the synchronous message-passing model with t < N failures by S(N, t). In case the system stabilizes at a later point in the execution, i.e. 0 < GST < ∞, then the algorithms are still guaranteed to terminate, although they might be less efficient. If the system never stabilizes, i.e. GST = ∞, indulgent algorithms might not terminate, although they always maintain safety. In the following, we say that an execution is synchronous if every message sent by a correct process in the course of the execution is delivered in the same round in which it was sent. Alternatively, if process pi receives a message m from process pj in round r ≥ 2, then every process received all messages sent by process pj in all rounds r < r. The view of a process p at a round r is given by the messages that p received at round r and in all previous rounds. We say that the view of process p is synchronous at round r if there exists an r-round synchronous execution which is indistinguishable from p’s view at round r.

3 Related Work Starting with seminal work by Dwork, Lynch and Stockmeyer [2], a variety of different models have been introduced to express relaxations of the standard asynchronous model of computation. These include failure detectors [11], round-by-round fault detectors (RRFD) [12], and, more recently, indulgent algorithms [1]. In [3, 9], Guerraoui and Dutta address the complexity of indulgent consensus in the presence of an eventually perfect failure detector. They prove a tight lower bound of t+ 2 rounds on the time complexity of the problem, even in synchronous runs, thus proving that there is an inherent price to tolerating asynchronous executions. Our approach is more general than that of this reference, since we transform a whole class of synchronous distributed algorithms, solving various tasks, into their indulgent counterparts. On the other hand, since our technique induces a delay of two rounds of communication over the synchronous algorithm, in the case of consensus, we miss the lower bound of t + 2 rounds by one round. Recent work studied the complexity of agreement problems, such as consensus [6] and k-set agreement [10], if the system becomes synchronous after an unknown stabilization time GST . In [6], the authors present a consensus algorithm that terminates in

44

D. Alistarh et al.

f + 2 rounds after GST , where f is the number of failures in the system. In [10], the authors consider k-set agreement in the same setting, proving that t/k + 4 rounds after GST are enough for k-set agreement, and that at least t/k + 2 rounds are required. The algorithms from these references work with the same time complexity in the indulgent setting, where GST = 0. On the other hand, the transformation in the current paper does not immediately yield algorithms that would work in a window of synchrony. From the point of view of the technique, references [6, 10] also use the idea of “detecting asynchrony” as part of the algorithms, although this technique has been generalized in the current work to address a large family of distributed tasks. Reference [13] considered a setting in which failures stop after GST , in which case 3 rounds of communication are necessary and sufficient. Leader-based, Paxos-like algorithms, e.g. [4, 5], form another class of algorithms that tolerate asynchrony, and can also be seen as indulgent algorithms. A precise definition of colorless tasks is given in [14]. Note that, in this paper, we augment their definition to include the standard validity property (see Section 5).

4 Asynchrony Detectors An asynchrony detector is a distributed service that detects periods of asynchrony in an asynchronous system that may be initially synchronous. The service returns a YES/NO indication at the end of every round, and has the property that processes which receive YES at some round share a synchronous execution prefix. Next, we make this definition precise. Definition 1 (Asynchrony Detector). Let d be a positive integer. A d-delay asynchrony detector in ES(N, t) is a distributed service that, in every round r, returns either YES or NO, at each process. The detector ensures the following properties. – (Local detection) If process p receives YES at round r, then there exists an r-round synchronous execution in which p has the same view as its current view at round r. – (Global detection) For all processes that receive YES in round r, there exists an (r− d)-round synchronous execution prefix S[1, 2, . . . , r − d] that is indistinguishable from their views at the end of round r − d. – (Non-triviality) The detector never returns NO during a synchronous execution. The local detection property ensures that, if the detector returns YES, then there exists a synchronous execution consistent with the process’ view. On the other hand, the global detection property ensures that, for processes that receive YES from the detector, the (r − R)-round execution prefix was “synchronous enough”, i.e. there exists a synchronous execution consistent with what these processes perceived during the prefix. The non-triviality property ensures that there are no false positives. 4.1

Implementing an Asynchrony Detector

Next, we present an implementation of a 2-delay asynchrony detector in ES(N, t), which we call AD(2). The pseudocode is presented in Figure 1.

Generating Fast Indulgent Algorithms

45

The main idea behind the detector, implemented in the process procedure, is that processes maintain a detailed view of the state of the system by aggregating all messages received in every round. For each round, each process maintains an Active set of processes, i.e. processes that sent at least one message in the round; all other processes are in the Failed set for that round (lines 2–4). Whenever a process receives a new message, it merges the contents of the Active and Failed sets of the sender with its own (lines 8–9). Asynchrony is detected by checking if there exists any process that is in the Active set in some round r, while being in the Failed set in some previous round r < r (lines 10–12). In the next round, each process sends its updated view of the system together with a synch flag, which was set to true, if asynchrony was detected.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

procedure detector()i msgi ← ⊥; synchi ← true; Active i ← [ ]; Failed i ← [ ]; for each round Rc do send( msgi ) msgSeti ← receive() (synchi , msgi ) ← process(msgSeti, Rc ) if synchi = true then output YES else output NO procedure process( msgSeti , r )i if synchi = true then Activei [Rc ] ← processes from which pi receives a message in round Rc F ailedi [Rc ] ← processes from which pi did not receive a message in round Rc if there exists pj ∈ msgSeti with synchj = false then synchi ← false for every msg j ∈ msgSet i do for round r from 1 to Rc do Active i [r] ← msg j .Active j [r] ∪ Active i [r] Failed i [r] ← msg j .Failed j [r] ∪ Failed i [r] for round r from 1 to Rc − 1 do for round k from r + 1 to Rc do if (Active i [k] ∩ Failed i [r] = ∅) then synchi ← false if synchi = true then msgi ← (synchi , (Activei [r])r∈[1,Rc ] , (F ailedi [r])r∈[1,Rc ] ) else msgi ← (synchi , ⊥, ⊥) return (synchi , msgi ) Fig. 1. The AD(2) asynchrony detection protocol

4.2 Proof of Correctness In this section, we prove that the protocol presented in the Section 4.1 satisfies the definition of an asynchrony detector. First, to see that the local detection condition is satisfied, notice that the contents of the Active and Failed sets at each process p can be used to construct a synchronous execution which is coherent with process p’s view. In the following, we focus on the global detection property. We show that, for a fixed round r > 0, given a set of processes P ⊆ Π that receive YES from AD(2) at the end of

46

D. Alistarh et al.

round r + 2, there exists an r-round synchronous execution S[1, r] such that the views of processes in P at the end of round r are consistent with S[1, r]. We begin by proving that if two processes receive YES from the asynchronous detector in round r + 2, then they must have received eachother’s round r + 1 messages, either directly, or through a relay. Note that, because of the round structure, a process’s round r + 1 message only contains information that it has acquired up to round r. In the following, we will use a superscript notation to denote the round at which the [r + 1] denotes the set Active[r + 2] local variables are seen. For example, Activer+2 q at process q, as seen from the end of round r + 2. Lemma 1. Let p and q be two processes that receive YES from AD(2) at the end of round r + 2. Then p ∈ Activer+2 [r + 1] and q ∈ Activer+2 q p [r + 1]. [r + 1]–the proof of the second statement is symProof. We prove that p ∈ Activer+2 q metric. Assume, for the sake of contradiction, that p ∈ / Activer+2 [r + 1]. Then, by q lines 8–9 of the process() procedure, none of the processes that send a message to q in round r + 2 received a message from p in round r + 1. However, this set of processes contains at least N − t > t elements, and therefore, in round r + 2, process p receives a message from at least one process that did not receive a message from p in round r+2 r + 1. Therefore p ∈ Activer+2 p [r + 2] ∩ F ailedp [r + 1] (recall that p receives its own message in every round). Following the process() procedure for p, we obtain that synchp = false in round r + 2, which means that process p receives NO from AD(2) in round r + 2, contradiction. Lemma 2. Let p and q be two processes in P . Then, for all rounds k < l ≤ r, Activerp [l] ∩ F ailedrq [k] = ∅, and Activerp [l] ∩ F ailedrq [k] = ∅, where the Active and Failed sets are seen from the end of round r. Proof. We prove that, given r ≥ l > k, Activerp [l] ∩ F ailedrq [k] = ∅. Assume, for the sake of contradiction, that there exist rounds k < l ≤ r and a processor s such that s ∈ Activerp [l] ∩ F ailedrq [k]. Lemma 1 ensures that p and q communicate in round r+2 r + 1, therefore it follows that s ∈ F ailedr+2 p [k]. This means that s ∈ Activep [l] ∩ F ailedr+2 p [k], for k < l, therefore p cannot receive YES in round r + 2, contradiction. The next lemma provides a sufficient condition for a set of processes to share a synchronous execution up to the end of some round R. The proof follows from the observation that the required synchronous execution E can be constructed by exactly following the contents of the Active and Failed sets by processes at every round in the execution. Lemma 3. Let E be an R-round execution in ES(N, t), and P be a set of processes in Π such that, at the end of round R, the following two properties are satisfied: 1. For any p and q in P , and any round r ∈ {1, 2, . . . , R − 1}, ActiveR p [r + 1] ∩ R F ailedq [r] = ∅. 2. | p∈P ActiveR p [R]| ≥ N − t. Then there exists a synchronous execution E which is indistinguishable from the views of processes in P at the end of round R.

Generating Fast Indulgent Algorithms

47

Finally, we prove that if a set of processes P receive YES from AD(2) at the end of some round R+2, then there exists a synchronous execution consistent with their views at the end of round R, for any R > 0, i.e. that AD(2) is indeed a 2-round asynchrony detector. The proof follows from the previous results. Lemma 4. Let R > 0 be a round and P be a set of processes that receive YES from AD(2) at the end of round R + 2. Then there exists a synchronous execution consistent with their views at the end of round R.

5 Generating Indulgent Algorithms for Colorless Tasks 5.1 Task Definition In the following, a task is a tuple (I, O, Δ), where I is the set of vectors of input values, O is a set of vectors of output values, and Δ is a total relation from I to O. A solution to a task, given an input vector I, yields an output vector O ∈ O such that O ∈ Δ(I). Intuitively, a colorless task is a terminating task in which any process can adopt any input or output value of any other process, without violating the task specification, and in which any (decided) output value is a (proposed) input value. We also assume that the output values have to verify a predicate P, such as agreement or k-agreement. For example, in the case of consensus, the predicate P states that all output values should be equal. Let val (V ) be the set of values in a vector V . We precisely define this family of tasks as follows. A colorless task satisfies the following properties: (1) Termination: every correct process eventually outputs; (2) Validity: for every O ∈ Δ(I), val (O) ⊆ val (I); (3) The Colorless property: If O ∈ Δ(I), then for every I with val (I ) ⊆ val (I) : I ∈ I and Δ(I ) ⊆ Δ(I). Also, for every O with val (O ) ⊆ val (O) : O ∈ O and O ∈ Δ(I). Finally, we assume that the outputs satisfy a generic property (4) Output Predicate: every O ∈ O satisfies a given predicate P. Consensus and k-set agreement are canonical examples of colorless tasks. 5.2 Transformation Description We present an emulation technique that generates an indulgent protocol in ES(N, t) out of any protocol in S(N, t) solving a given colorless task T , at the cost of two communication rounds. If the system is not synchronous, the generated protocol will run a given backup protocol Backup which ensures safety, even in asynchronous executions. For example, if an protocol solves synchronous consensus in t + 1 rounds (i.e. it is optimal), then the resulting protocol will solve consensus in t + 3 rounds if the system is initially synchronous. Otherwise, the protocol reverts to a safe backup, e.g. Paxos [5], or ASAP [6]. We fix a protocol A solving a colorless task in the synchronous model S(N, t). The running time of the synchronous protocol is known to be of R rounds. In the first phase of the transformation, each process p runs the AD(2) asynchrony detector in parallel with the protocol A, as long as the detector returns a YES indication at every round. Note that the protocol’s messages are included in the detector’s messages (or viceversa), preventing the possibility that the protocol encounters asynchronous message

48

D. Alistarh et al.

deliveries without the detector noticing. If the detector returns NO during this phase, the process stops running the synchronous protocol, and continues running only AD(2). If the process receives YES at the end of round R + 2, then it returns the decision value that A produced at the end of round R1 . On the other hand, if the process receives NO from AD(2) in round R + 2, i.e. asynchrony was detected, then the process will run the second phase of the transformation. More precisely, in phase two, the process will run a backup agreement protocol that tolerates periods of asynchrony (for example, the K4 protocol [10], if the task is k-set agreement). The main question is how to initialize the backup protocol, given that some of the processes may have already decided in phase one, without breaking the properties of the task. We solve this problem as follows. Let Supp (the support set) be the set of processes that received YES from AD(2) in round R + 1 that process p receives messages from in round R + 2. There are two cases. (1) If the set Supp is empty, then the process starts running the backup protocol using its initial proposal value. (2) If the set Supp is non-empty, then the process obtains a new proposal value as follows. It picks one process from Supp and adopts its state at R − 1. Then, in round R, it simulates receiving the messages in the end of round R+1 [R], where we maintain the notation used in Section 4. We will j∈Supp msgSet j show that in this case, the simulated protocol A will necessarily return a decision value at the end of simulated round R. The process p then runs the backup protocol, using as initial value the decision value resulting from the simulation of the first R rounds. 5.3

Proof of Correctness

We now prove that the resulting protocol verifies the task specification. The proofs of termination, validity, and the colorless property follow from the properties of the A and Backup protocols, therefore we will concentrate on proving that the resulting protocol also satisfies the output predicate P. Theorem 1 (Output Predicate). The indulgent transformation protocol satisfies the output predicate P associated to the task T . Assume for the sake of contradiction that there exists an execution in which the output of the transformation breaks the output predicate P. If all process decisions are made at the end of round R + 2, then, by the global detection property of AD(2), there exists a synchronous execution of A in which the same outputs are decided, which break the predicate P, contradiction. If all decisions occur after round R + 2, first notice that, by the validity and colorless properties, the inputs processes propose to the Backup protocol are always valid inputs for the task. It follows that, since all decisions are output by Backup, there exists an execution of the Backup protocol in which the predicate P is broken, again a contradiction. Therefore, at least one process outputs at the end of round R+ 2, and some processes decide at some later round. We prove the following claim. 1

Since AD(2) returns YES at process p at the end of round R + 2, it follows that it must have returned YES at p at the end of round R as well. The local detection property of the asynchrony detector implies that the protocol A has to return a decision value, since it executes a synchronous execution.

Generating Fast Indulgent Algorithms

49

Claim. If a process decides at the end of round R + 2, then (i) all correct processes will have a non-empty support set Supp and (ii) there exists an R-round synchronous execution consistent with the views that all correct processes adopt at the end of round R + 2. Proof (Sketch). First, let d be a process that decides at the end of round R + 2. Then, in round R + 2, process d received a message from at least N − t processes that got YES from AD(2) at the end of round R + 1. Since N ≥ 2t + 1, it follows that every process that has not crashed by the end of round R + 2 will have received at least one message from a process that has received YES from AD(2) in round R + 1; therefore, all non-crashed processes that get NO from AD(2) in round R + 2 will execute case 2, which ensures the first claim. Let Q = {q1 , . . . , q } be the non-crashed processes at the end of round R + 2. By the above claim, we know that these processes either decide or simulate an execution. We prove that all views simulated in this round are consistent with a synchronous execution up to the end of round R, in the sense of Lemma 3. To prove that the intersection of their simulated views in round R contains at least (N − t) messages, notice that the processes from which process d receives messages in round R + 2 are necessarily in this intersection, since otherwise process d would receive NO in round R + 2. To prove the first condition of Lemma 3, note that process d’s view of round R, i.e. [R], contains all messages simulated as received in round R by the the set msgSet R+2 d processes that receive NO in round R + 2. Since N − t > t, every process that receives NO in round R + 2 from the detector also receives a message supporting d’s decision in round R + 2; process d receives the same message and does not notice any asynchrony. Therefore, we can apply Lemma 3 to obtain that there exists a synchronous execution of the protocol A in which the processes in Q obtain the same decision values as the values obtained through the simulation or decision at the end of round R + 2. Returning to the proof of the output predicate, recall that there exists at least process d which outputs at the end of round R + 2. From the above Claim, it follows that all non-crashed processes simulate synchronous views of the first R rounds. Therefore all non-crashed processes will receive an output from the synchronous protocol A. Moreover, these synchronous views of processes are consistent with a synchronous execution, therefore the set of outputs received by non-crashed processes verifies the predicate P. Hence all the inputs that the processes propose to the Backup protocol verify the predicate P. Since Backup respects validity, it follows that the outputs of Backup will also verify the predicate P.

6 A Protocol for Strong Indulgent Renaming 6.1 Protocol Description In this section, we present an emulation technique that transforms any synchronous renaming protocol into an indulgent renaming protocol. For simplicity, we will assume that the synchronous renaming protocol is the one by Herlihy et al. [8], which is timeoptimal, terminating in log N + 1 synchronous rounds. The resulting indulgent protocol will rename in N names using log N + 3 rounds of communication if the system is initially synchronous, and will eventually rename into N + t names if the system is

50

D. Alistarh et al.

asynchronous, by safely reverting to a backup constituted by the asynchronous renaming algorithm by Attiya et al. [7]. Again, the protocol is structured into two phases. First Phase. During the first log N + 1 rounds, processes run the AD(2) asynchrony detector in parallel with the synchronous renaming algorithm. Note that the protocol’s messages are included in the detector’s messages. If the detector returns NO at one of these rounds, then the process stops running the synchronous algorithm, and continues only with the detector. If at the end of round [log N ] + 1, the process receives YES from AD(2), then it also receives a name name i as the decision value of the synchronous protocol. Second Phase. At the end of round [log N ] + 1, the processes start the asynchronous renaming algorithm of [7]. More precisely, each process builds a vector V with a single entry, which contains the tuple vi , namei , Ji , bi , ri , where vi is the processes’ initial value. The entry namei is the proposed name, which is either the name returned by the synchronous renaming algorithm, if the process received YES from the detector, or ⊥, otherwise. The entry Ji counts the number of times the process proposed a name–it is 1 if the process has received YES from the detector, and 0 otherwise; bi is the decision bit, which is initially 0. Finally, ri is the round number2 when the entry was last updated, which is in this case log n + 1. The processes broadcast their vectors V for the next two rounds, while continuing to run the asynchrony detector in parallel. The contents of the vector V are updated at every round, as follows: if a vector V containing new entries is received, the process adds all the new entries to its vector; if there are conflicting entries corresponding to the same process, the tie is broken using the round number ri . If, at the end of round log N + 3 the process receives YES from the detector, then it decides on namei . Otherwise, it continues runnning the AttiyaRenaming algorithm until decision is possible. 6.2 Proof of Correctness The first step in the proof of correctness of the transformation provides some properties of the asynchronous renaming algorithm of [7]. More precisely, the first Lemma states that the asynchronous renaming algorithm remains correct even though processes propose names initially, that is at the beginning of round log N + 2. The proof follows from an examination of the protocol and proofs from [7]. Lemma 5. The asynchronous renaming protocol of [7] ensures termination, name uniqueness, and a name space bound of N + t, even if processes propose names at the beginning of the first round. The previous Lemma ensures that the transformation guarantees termination, i.e. that every correct process eventually returns a name. The non-triviality property of the asynchrony detector ensures that the resulting algorithm will terminate in log N +3 rounds in any synchronous run. In the following, we will concentrate on the uniqueness of the names and on the bounds on the resulting namespace. We start by proving that the protocol does not generate duplicate names. 2

This entry in the vector is implied in the original version of the algorithm [7].

Generating Fast Indulgent Algorithms

51

Lemma 6 (Uniqueness). Given any two names ni , nj returned by processes in an execution, we have that ni = nj . Proof (Sketch). Assume for the sake of contradiction that there exists a run in which two processes pi , pj decide on the same name n0 . First, we consider the case in which both decisions occurred at round log N + 3, the first round at with a process can decide using our emulation. Notice that, if a decision is made, the processes necessarily decide on the decision value of the simulated synchronous protocol3. By the global detection property of AD(2) it then follows that there exists a synchronous execution of the synchronous renaming protocol in which two distinct processes return the same value, contradicting the correctness of the protocol. Similarly, we can show that if both decisions occur after round log N + 3, we can reduce the correctness of the transformation to the correctness of the asynchronous protocol. Therefore, the remaining case is that in which one of the decisions occurs at round log N + 3, and the other decision occurs at a later round, i.e. it is a decision made by the asynchronous renaming protocol. In this case, let pi be the process that decides on n0 at the end of round log N + 3. This implies that process pi received YES at the end of round log N + 3 from AD(2). Therefore, since pi sees a synchronous view, there exists a set S of at least N − t processes that received pi ’s message reserving name n0 in round log N + 2. It then follows that each non-crashed process receives a message from a process in the set S in round log N + 3. By the structure of the protocol, we obtain that each process has the entry vi , n0 , 1, 0, log N + 1 in their V vector at the end of round log N + 3. It follows from the structure of the asynchronous protocol that no process other than pi will ever decide on the name n0 at any later round, which concludes the proof of the Lemma. Finally, we prove that the transformation ensures the following guarantees on the size of the namespace. Lemma 7 (Namespace Size). The transformation ensures the following properties: (1) In synchronous executions, the resulting algorithm will rename in a namespace of at most N names. (2) In any execution, the resulting algorithm will rename in a namespace of at most N + t names. Proof (Sketch). For the proof of the first property, notice that, in a synchronous execution, any output combination for the transformation is an output combination for the synchronous renaming protocol. For the second property, let ≥ 0 be the number of names decided on at the end of round log N + 3 in a run of the protocol. These names are clearly between 1 and N . Lemma 6 guarantees that none of these names is decided on in the rest of the execution. On the other hand, Lemma 5 and the namespace bound of N + t for the asynchronous protocol ensure that the asynchronous protocol decides exclusively on names between 1 and N + t, which concludes the proof of the claim.

7 Conclusions and Future Work In this paper, we have introduced a general transformation technique from synchronous algorithms to indulgent algorithms, and applied it to obtain indulgent solutions for a 3

A simple analysis of the asynchronous renaming protocol shows that a process cannot decide after two rounds of communication, unless it had already proposed a value at the beginning of the first round.

52

D. Alistarh et al.

large class of distributed tasks, including consensus, set agreement and renaming. Our results suggest that, even though it is generally hard to design asynchronous algorithms in fault-prone systems, one can obtain efficient algorithms that tolerate asynchronous executions starting from synchronous algorithms. In terms of future work, we first envision generalizing our technique to generate algorithms that also work in a window of synchrony, and investigating its limitations in terms of time and communication complexity. Another interesting research direction would be to analyze if similar techniques exist in the case of Byzantine failures–in particular, if, starting from a synchronous fault-tolerant algorithm, one can obtain a Byzantine fault-tolerant algorithm, tolerating asynchronous executions. Acknowledgements. The authors would like to thank Prof. Hagit Attiya and Nikola Kneˇzevi´c for their help on previous drafts of this paper, and the anonymous reviewers for their useful feedback.

References 1. Guerraoui, R.: Indulgent algorithms. In: PODC 2000, pp. 289–297. ACM, New York (July 2000) 2. Dwork, C., Lynch, N.A., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35, 288–323 (1988) 3. Dutta, P., Guerraoui, R.: The inherent price of indulgence. In: PODC 2002: Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, pp. 88–97 (2002) 4. Lamport, L.: Fast paxos. Distributed Computing 19(2), 79–103 (2006) 5. Lamport, L.: Generalized consensus and paxos. Microsoft Research Technical Report MSRTR-2005-33 (March 2005) 6. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: How to solve consensus in the smallest window of synchrony. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 32–46. Springer, Heidelberg (2008) 7. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous environment. Journal of the ACM 37(3), 524–548 (1990) 8. Chaudhuri, S., Herlihy, M., Tuttle, M.R.: Wait-free implementations in message-passing systems. Theor. Comput. Sci. 220(1), 211–245 (1999) 9. Dutta, P., Guerraoui, R.: The inherent price of indulgence. Distributed Computing 18(1), 85–98 (2005) 10. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: Of choices, failures and asynchrony: The many faces of set agreement. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, Springer, Heidelberg (2009) 11. Chandra, T.D., Toueg, S.: Unreliable failure detectors for asynchronous systems (preliminary version). In: ACM Symposium on Principles of Distributed Computing, pp. 325–340 (August 1991) 12. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 13. Dutta, P., Guerraoui, R., Keidar, I.: The overhead of consensus failure recovery. Distributed Computing 19(5-6), 373–386 (2007) 14. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009)

An Eﬃcient Decentralized Algorithm for the Distributed Trigger Counting Problem Venkatesan T. Chakaravarthy1, Anamitra R. Choudhury1 , Vijay K. Garg2 , and Yogish Sabharwal1 1

IBM Research - India, New Delhi {vechakra,anamchou,ysabharwal}@in.ibm.com 2 University of Texas at Austin [email protected]

Abstract. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-speciﬁed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. The main result of the paper is a decentralized and randomized algorithm with expected message complexity O(n log n log w). Moreover, every processor in this algorithm receives no more than O(log n log w) messages with high probability. All the earlier algorithms for this problem have maximum message load of Ω(n log w).

1

Introduction

In this paper, we study the distributed trigger counting (DTC) problem. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user speciﬁed input. We note w may be much larger than n. The sequence of processors receiving the w triggers is not known apriori to the system. Moreover, the number of triggers received by each processor is also not known. We are interested in designing distributed algorithms for the DTC problem that are communication eﬃcient and are also decentralized. The DTC problem arises in applications such as distributed monitoring and global snapshots. Monitoring is an important issue in networked systems such as sensor networks and data networks. Sensor networks are typically employed to monitor physical or environmental conditions such as traﬃc volume, wildlife behavior, troop movements and atmospheric conditions, among others. For example, in traﬃc management, one may be interested in raising an alarm when the number of vehicles on a highway exceeds a certain threshold. Similarly, one may wish to monitor a wildlife region for the sightings of a particular species, and raise an alert, when the number crosses a threshold. In the case of data networks, M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 53–64, 2011. c Springer-Verlag Berlin Heidelberg 2011

54

V.T. Chakaravarthy et al.

example applications are monitoring the volume of traﬃc or the number of remote logins. See, for example, [7] for a discussion of applications of distributed monitoring. In the context of global snapshots (example, checkpointing), a distributed system must record all the in-transit messages in order to declare the snapshot to be valid. Garg et al. [4] showed the problem of determining whether all the in-transit messages have been received can be reduced to the DTC problem (they call this the distributed message counting problem). In the context of synchronizers [1], a distributed system is required to generate the next pulse when all the messages generated in the current pulse have been delivered. Any message in the current pulse can be viewed as a trigger of the DTC problem. Our goal is to design a distributed algorithm for the DTC problem that is communication eﬃcient and decentralized. We use the following two natural parameters that measure these two important aspects. – The message complexity, i.e., the number of messages exchanged between the processors. – The MaxRcvLoad, i.e., the maximum number of messages received by any processor in the system. Garg et al. [4] studied the DTC problem for a general distributed system. They presented two algorithms: a centralized algorithm and a tree-based algorithm. The centralized algorithm has message complexity O(n log w). However, the MaxRcvLoad of this algorithm can be as high as Ω(n log w). The tree-based algorithm has message complexity O(n log n log w). This algorithm is more decentralized in a heuristic sense, but its MaxRcvLoad can be as high as O(n log n log w), in the worst case. They also proved a lowerbound on the message complexity. They showed that any deterministic algorithm for the DTC problem must have message complexity Ω(n log(w/n)). So, the message complexity of the centralized algorithm is optimal asymptotically. However, this algorithm has MaxRcvLoad as high as the message complexity. In this paper, we consider a general distributed system where any processor can communicate with any other processor and all the processors are capable of performing basic computations. We assume an asynchronous model of computation and messages. We assume that the messages are guaranteed to be delivered but there is no ﬁxed upper bound on the message arrival time. Also, messages are not corrupted or spuriously introduced. This setting is common in data networks. We also assume that there are no faults in the processors and that the processors do not fail. Our main result is a decentralized randomized algorithm called LayeredRand that is eﬃcient in terms of both the message complexity and MaxRcvLoad. Its message complexity is O(n log n log w). Moreover, with high probability, its MaxRcvLoad is O(log n log w). The message complexity of our algorithm is the same as that of the tree based algorithm of Garg et al. [4]. However, the MaxRcvLoad of our algorithm is signiﬁcantly better than both their tree based and centralized algorithms. It is important to minimize MaxRcvLoad for many applications. For example, in sensor networks where the message processing may

An Eﬃcient Decentralized Algorithm for the DTC Problem Algorithm

Message Complexity Tree-based[4] O(n log n log w) Centralized[4] O(n log w) LayeredRand O(n log n log w)

55

MaxLoad O(n log n log w) O(n log w) O(log n log w)

Fig. 1. Summary of DTC Algorithms

consume limited power available at the node, a high MaxRcvLoad may reduce the lifetime of a node. Another important aspect of our algorithm is its simplicity. In particular, our algorithm is much simpler than both the algorithms of Garg et al. A comparison of our algorithm with the earlier results is summarized in Fig. 1. Designing an algorithm with message complexity O(n log w) and MaxRcvLoad O(log w) remains a challenging open problem. Our main result is formally stated next. For 1 ≤ i ≤ w, the external source delivers the ith trigger to some processor xi . We call the sequence x1 , x2 , . . . , xw as a trigger pattern. Theorem 1. Fix any trigger pattern. The message complexity of the LayeredRand algorithm is O(n log n log w). Furthermore, there exist constants c and d ≥ 1 such that 1 Pr[MaxRcvLoad ≥ c log n log w] ≤ d . n The above bounds hold for any trigger pattern, even if ﬁxed by an adversary. Related work. Most prior work (e.g. [3,7,6]) primarily consider the DTC problem in a centralized setting where one of the processors acts as a master and coordinates the system, and the other processors act as slaves. The slaves can communicate only with the master (they cannot communicate among themselves). Such a scenario applies where a communication network linking the slaves does not exist or the slaves have only limited computational power. Prior work addresses various issues arising in such a setup, such as message complexity. They also consider variations and generalizations of the DTC problem. One such variation is approximate threshold computation, where system need not raise an alert on seeing exactly w triggers; it suﬃces if the alert raised upon seeing at most (1 + )w triggers, where is some user speciﬁed tolerance parameter. Prior work also considers aggregate function more general than counting. Here, each input trigger i is associated with a value αi . The goal is to raise an alert when some aggregate of these values crosses the threshold (an example, aggregate function is sum). Note that the Echo or Wave algorithms [2,9,10] and the framework of repeated global computation [5] are not easily applicable for the DTC problem because the triggers arrive at processors asynchronously at unknown times. Computing the sum of all the trigger counts just once is not enough and repeated computation results in an excessive number of messages.

56

2

V.T. Chakaravarthy et al.

A Deterministic Algorithm

For the DTC problem, Garg et al. [4] presented an algorithm with the message complexity of O(n log w). In this section, we describe a simple alternative deterministic algorithm having the same message complexity. The aim of presenting this algorithm is to highlight the diﬃculties in designing an algorithm that simultaneously achieves good message complexity and MaxRcvLoad bounds. A naive algorithm for the DTC problem works as follows. One of the processors acts as a master and every processor sends a message to the master upon receiving each trigger. The master keeps count on the total number of triggers received. When the count reaches w, the user is informed and the protocol ends. The disadvantage with this algorithm is that its message complexity is O(w). A natural idea is avoid sending a message to the master for every trigger received. Instead, a processor will send one message for every B triggers received. Clearly, setting B to a high value will reduce the number of messages. However, care should taken to ensure that the system does not enter the dead state. For instance, suppose we set B = w/2. Then, the adversary can send w/4 triggers to some selected four processors. Notice that none of these processors would send a message to the master. Thus, even though all the w triggers have been delivered by the adversary, the system will not detect the termination. We say that the system is the dead state. Our deterministic algorithm with message complexity O(n log w) is described next. A predetermined processor would serve as the master. The algorithm works in multiple rounds. We start by setting two parameters: w ˆ = w and B = w/(2n). ˆ Each processor would send a message to the master for every B triggers received. The master will keep count of the triggers reported by other processors and the triggers received by itself. When the count reaches w/2, ˆ it declares end-of-round and sends a message to all the processors to this eﬀect. In return, each processor sends the number of unreported triggers to the master (namely, the triggers not reported to the master). This way, the master can compute w , the total number of triggers received so far in the system. It recomputes w ˆ = w−w ˆ ; the new w ˆ is the number of triggers yet to be received. The master recomputes B = w/(2n) ˆ and sends this number to every processor. The next round starts. When w ˆ < (2n), we set B = 1. We now argue that the system never enters a dead state. Consider the state of the system in the middle of any round. Each processor has less than w/(2n) ˆ unreported triggers. Thus, the total number of unreported triggers is less than w/2. ˆ The master’s count of reported triggers is less than w/2. ˆ Thus, the total number of triggers delivered so far is less than w. ˆ So, some more triggers are yet to be delivered. It follows that the system is never in a dead state and the system will correctly terminate upon receiving all the w triggers. Notice that in each round, w ˆ decreases at least by a factor of 2. So, the algorithm terminates after log w rounds. Consider any single round. A message is sent to the master for every B triggers received and the rounds gets completed when the master’s count reaches w/2. ˆ Thus, the number of messages sent to the master is w/(2B) ˆ = n. At the end of each round, the O(n) messages are exchanged between the master and the other processors. Thus, the number of

An Eﬃcient Decentralized Algorithm for the DTC Problem ()*+ /.-, w; Gc GG GG ww GG ww w GG ww GG ww GG w w GG w GG ww w ()*+wKe Si S /.-, ()*+ /.-, F 0X 0 KKSSS kk95 9 F 0X kksksss 00 k 00 KKKSKSSSS k k ss k 00 S k S k K s 00 KK SSSkSkkk ss 00 KK kkk SSS ss 00 00 00 kkkkkkKKKK sssSsSSSSS 00 SSS k K s k 0 K s k s KK 0 SSSS kk 00 s k k K s SSS 00 KK kk 0 sss SSS K kkk ()*+kk /.-, ()*+s /.-, /.-, ()*+ /.-, ()*+

57

Layer 0 (root)

Layer 1

Layer 2

. . . .

()*+ /.-,

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

Layer 3

Fig. 2. Illustration for LayeredRand

messages per round is O(n). The total number messages exchanged during all the rounds is O(n log w). The above algorithm is eﬃcient in terms of message complexity. However, the master may receive upto O(n log w) messages and so, the MaxRcvLoad of the algorithm is O(n log w). In the next section, we present an eﬃcient randomized algorithm which simultaneously achieves provably good message complexity and MaxRcvLoad bounds.

3

LayeredRand Algorithm

In this section, we present a randomized algorithm called LayeredRand. Its message complexity is O(n log n log w) and with high probability, its MaxRcvLoad is O(log n log w). For the ease of exposition, we ﬁrst describe our algorithm under the assumption that the triggers are delivered one at a time; meaning, all the processing required for handling a trigger is completed before the next trigger arrives. This assumption allows us to better explain the core ideas of the algorithm. We will discuss how to handle the concurrency issues in Sect. 5. For the sake of simplicity, we assume that n = 2L − 1, for some integer L. The n processors are arranged in L layers numbered 0 through L − 1. For 0 ≤ < L, layer consists of 2 processors. Layer 0 consists of a single processor, which we refer to as the root. Layer L − 1 is called the leaf layer. The layering is illustrated in Fig. 2, for n = 15. Only processors occupying adjacent layers will communicate with each other. The algorithm proceeds in multiple rounds. In the beginning of each round, the system needs to know how many triggers are yet to be received. This can be computed by keeping track of the total number of triggers received in all the previous rounds and subtracting this quantity from w. Let the term initial value of a round mean the number of triggers yet to be received at the beginning of the round. We use a variable w ˆ to store the initial value of any round. In the ﬁrst round, we set w ˆ = w, since all the w triggers are yet to be received.

58

V.T. Chakaravarthy et al.

We next describe the procedure followed in a single round. Let w ˆ denote the initial value of this round. For each 1 ≤ < L, we compute a threshold τ () for the layer : w ˆ τ () = . 4 · 2 · log(n + 1) Each processor x maintains a counter C(x), which is used to keep track of some of the triggers received by x and other processors occupying the layers below of that of x. The exact semantics C(x) will become clear shortly. The counter is reset to zero in the beginning of the round. Consider any non-root processor x occupying a level . Whenever x receives a trigger, it will increment C(x) by one. If C(x) reaches the threshold τ (), x chooses a processor y occupying level − 1 uniformly at random and sends a message to y. We refer to such a message as a coin. Upon receiving the coin, the processor y updates C(y) by adding τ () to C(y). Intuitively, receipt of a coin by y means that y has evidence that some processors below the layer − 1 have received τ ( − 1) triggers. After the update, if C(y) ≥ τ ( − 1), y will pick a processor z occupying level − 2 uniformly at random and send a coin to z. Then, processor y updates C(y) = C(y) − τ ( − 1). Processor z handles the coin similarly. See Fig. 2. A directed edge from a processor u to a processor v means that u may send a coin to v. Thus, a processor may send a coin to any processor in the layer above. This is illustrated for the top three layers in the ﬁgure. We now formally describe the behavior of a non-root processor x occupying a level . Whenever x receives a trigger from the external source or a coin from level + 1, it behaves as follows: – If a trigger is received, increment C(x) by one. – If a coin is received from level + 1, update C(x) = C(x) + τ ( + 1). – If C(x) ≥ τ (), • Among the 2−1 processors occupying level − 1, pick a processor y uniformly at random and send a coin to y. • Update C(x) = C(x) − τ (). The behavior of the root is similar to that of the other processors, except that it does not send coins. The root processor r also maintains a counter C(r). Whenever it receives a trigger from the external source, it increments C(r) by one. If it receives a coin from level 1, it updates C(r) = C(r) + τ (1). An important observation is that at any point of time, any trigger received by the system in the current round is accounted in the counter C(x) of exactly one processor x. This means that the sum of C(x) over all the processors gives us the exact count of the triggers received in the system so far in this round. This observation will be useful in proving the correctness of the algorithm. The crucial activity of the root is to initiate an end-of-round procedure. When C(r) reaches w/2 ˆ (i.e., when C(r) ≥ w/2), ˆ the root declares end-of-round. Now, the root needs to get a count of the total number of triggers received by all the processors in this round. Let this count be w . The processors are arranged in a pre-determined binary tree formation such that each processor x

An Eﬃcient Decentralized Algorithm for the DTC Problem

59

has exactly one parent from the layer above and exactly two children from the layer below. The end-of-round notiﬁcation can be broadcast to all the processors in a recursive top-down manner. Similarly, the sum of C(x) over all the processors can be reduced at the root in a recursive bottom-up manner. Thus, the root obtains the value w , i.e., the total number of triggers received in the system in this round. The root then updates the initial value for the next round by computing w ˆ = w ˆ − w , and broadcasts this to all the processors, again in a recursive fashion. All the processors then update their τ () values for the new round. This marks the start of the next round. Notice that in the end-of-round process, each processor receives at most a constant number of messages. At the end of any round, if the newly computed wˆ is zero, we know that all the w triggers have been received. So, the root can raise an alert to the user and the algorithm is terminated. It is easy to derive a bound on the number of rounds taken by the algorithm. Observe that in successive rounds the initial value drops by a factor of two (meaning, w ˆ of round i + 1 is at most half the w ˆ of round i). Thus, the algorithm takes at most log w rounds.

4

Analysis of the LayeredRand Algorithm

Here, we prove the correctness of the algorithm and then prove message bounds. 4.1

Correctness of the Algorithm

We now show that the system will correctly raise an alert to the user when all the w triggers are received. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure, i.e., the system does not get stalled in the middle of the round, when all the triggers have been delivered. We denote the set of all processors by P. Consider any round and let w ˆ be the initial value of the round. Let x be any non-root processor and let be the layer in which x is found. Notice that at any point of time, we have C(x) ≤ τ () − 1. Thus, we can derive a bound on the sum of C(x): x∈P−{r}

C(x) ≤

L−1 =1

2 (τ () − 1) ≤

(L − 1)w ˆ 4 · log(n + 1)

≤

w ˆ 4

Now suppose that all the outstanding w ˆ triggers have been delivered to the system in this round. We already saw that at any point of time, x∈P C(x) gives the number of triggers received by the system so far in the current round.1 Thus, x∈P C(x) = w. ˆ It follows that the counter at the root C(r) satisﬁes C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ But, this means that the root would initiate the end-ofround procedure. We conclude that the system will not enter a dead state. 1

We note that C(r) is an integer, and hence this holds even when w ˆ = 1.

60

V.T. Chakaravarthy et al.

The above argument shows that the system always makes progress by moving into the next round. As we observed earlier, the initial value w ˆ drops by a factor of at least two in each round. So, eventually, w ˆ must become zero and the system will raise an alert to the user. 4.2

Bound on the Message Complexity

Lemma 1. The message complexity of the algorithm is O(n log n log w). Proof: As argued before, the algorithm takes only O(log w) rounds to terminate. Consider any round and let w ˆ be the initial value of the round. Consider any layer 1 ≤ < L. Every coin sent from layer to − 1 means that at least τ () triggers have been received by the system in this round. Thus, the number of coins sent from layer to the layer − 1 can be at most w/τ ˆ (). Summing up over all the layers, we can get a bound on the total number of coins (messages) sent in this round: L−1 L−1 w ˆ Number of coins sent ≤ ≤ 4 · 2 log n ≤ 4 · (n − 1) log n τ () =1

=1

The end-of-round procedure involves only O(n) messages, in any particular round. Summing up over all log w rounds, we see that the message complexity of the algorithm is O(n log n log w). 4.3

Bound on the MaxRcvLoad

In this section, we show that with high probability, the MaxRcvLoad is bounded by O(log n log w). We use the following Chernoﬀ bound (see [8]) for this purpose. Theorem 2 (see [8], Theorem 4.4). Let X be the sum of a finite number of independent 0−1 random variables. Let the expectation of X be μ = E[X]. Then, for any r ≥ 6, Pr[X ≥ rμ] ≤ 2−rμ . Moreover, for any μ ≥ μ, the inequality is true, if we replace μ by μ on both sides. Lemma 2. Pr[MaxRcvLoad ≥ c log n log w] ≤ n−47 , for some constant c. Proof: Let us ﬁrst consider the number coins received by any processor. Processors in the leaf layer do no receive any coins and so, it suﬃces to consider the processors occupying other layers. Consider any layer 0 ≤ ≤ L − 2 and let x be any processor found in layer . Let Mx be the random variable denoting the number of coins received by x. As discussed before, the algorithm takes at most log w rounds. In any given round, w ˆ the number of coins received by layer is at most τ (+1) ≤ 4 · 2+1 log n. Thus, the total number of coins received by layer is at most 4 · 2+1 log n log w. Each of these coins is sent uniformly and independently at random to one of the 2 processors occupying layer . Thus, expected number of coins received by x is E[Mx ] ≤

4 · 2+1 log n log w 2

= 8 log n log w

An Eﬃcient Decentralized Algorithm for the DTC Problem

61

The random variable Mx is a sum of independent 0-1 random variables. Applying the Chernoﬀ bound given by Theorem 2 (taking r = 6), we see that Pr[Mx ≥ 48 log n log w] ≤ 2−48 log n log w

< n−48 .

Applying the union bound, we see that Pr[There exists a processor x having Mx ≥ 48 log n log w] < n−47 . During the end-of-round process, a processor receives at most a constant number of messages in any round. So, the total of these messages received by any processor is O(log w).

5

Handling Concurrency

In this section, we discuss how to handle the concurrency issues. All triggers and coin messages received by a processor can be placed into a queue and processed one at a time. Thus, there is no concurrency issue related to triggers and coins received within a round. However, concurrency issues need to be handled during an end-of-round. Towards this goal, we slightly modify the LayeredRand algorithm. The core functioning of the algorithm remains the same as before; we mainly modify the end-of-round procedure by adding some additional features (such as counters and queues). The rest of this section explains these features and the end-of-the round procedure in detail. We also prove correctness of the algorithm in the presence of concurrency issues. 5.1

Processing Triggers and Coins

Each processor x maintains two FIFO queues - a default queue and a priority queue. All triggers and coin messages received by a processor are placed in the default queue. The priority queue contains only the messages related to the endof-round procedure, which are handled on a priority basis. In the main event handling loop, a processor repeatedly checks for messages in queues. It ﬁrst examines the priority queue and handles the ﬁrst message in that queue, if any. If there is no message there, it examines the default queue and handles the ﬁrst message in that queue (if any). Every processor also maintains a counter D(x) that keeps a count of triggers directly received and processed by x, since the beginning of the algorithm. The triggers received by x that are in the default queue (not yet processed) are not accounted in D(x). The counter D(x) is incremented every time the processor processes a trigger from the default queue. This counter is never reset. It is maintained in addition to the counter C(x) (which gets reset in the beginning of each round). Every processor x maintains another variable, RoundNum, that indicates the current round number for this processor. Whenever x sends a coin to some other processor, it includes its RoundNum in the message. The processing of triggers and coins is done as before (as in Sect. 3).

62

5.2

V.T. Chakaravarthy et al.

End-of-Round Procedure

Here, we describe the end-of-round procedure in detail, highlighting the modiﬁcations. The procedure consists of four phases. The processors are arranged in the form of a binary tree as before. In the ﬁrst phase, the root processor broadcasts a RoundReset message down the tree to all nodes requesting them to send their D(x) counts. In the second phase, these counts are reduced at the root using Reduce messages; the root computes the sum of D(x) over all the processors. Note that, unlike the algorithm described in Sect. 3, here the root computes the sum of D(x) counters, rather than the sum of C(x) counters. We shall see that this is useful in proving correctness. Using the sum of D(x) counters, the root computes the initial value w ˆ for the next round. In the third phase, the root broadcasts this value w ˆ to all nodes using Inform messages. In the fourth phase, each processor sends an acknowledgement InformAck back to the root and enters the next round. We now describe the four phases in detail. First Phase: In this phase, the root processor initiates the broadcast of a RoundReset message by sending it down to its children. A processor x on receiving RoundReset message, does the following: – At this point, the processor suspends processing of the default queue until the end-of-round processing is completed. Thus all new triggers are queued up without being processed. This ensures that the D(x) value is not modiﬁed while end-of-round procedure is in progress. – If x is not a leaf processor, it forwards the RoundReset message to its children; if it is a leaf-processor, it initiates the second phase as described below. Second Phase: In this phase, the D(x) values are sum-reduced at the root from all the processors. The second phase starts when a leaf processor receives a RoundReset message, in response to which it initiates a Reduce message containing its D(x) value and passes it to its parent. When a non-leaf processor has received Reduce messages from all its children, it adds up the values in these messages to its own D(x) and sends a Reduce message to its parent with this sum. Thus, the root collects the sum of D(x) over all the processors. This sum w is the total numbers of triggers received in the system so far. Subtracting w from w, the root computes the initial value w ˆ for the next round. If w ˆ = 0, the root raises an alert and terminates the algorithm. Otherwise, the root initiates the third phase. Third Phase: In this phase, the root processor broadcasts the new w ˆ value by sending an Inform message to its children. A processor x on receiving the Inform message, performs the following: – It computes the threshold τ () value for the new round, where is the layer number of x. – If x is a non-leaf processor, it forwards the Inform message to its children; if x is a leaf processor, it initiates the fourth phase as described below.

An Eﬃcient Decentralized Algorithm for the DTC Problem

63

Fourth Phase: In this phase, the processors send an acknowledgement upto the root and enter the new round. The fourth phase starts when a leaf processor x receives an Inform message. After performing the processing for the Inform message, it performs the following actions: – It increments RoundNum. This signiﬁes that the processor has entered the next round. After this point, the processor does not process any coins from the previous rounds. Whenever the processor receives a coin generated in the previous rounds, it simply discards the coin. – C(x) is reset to zero. – It sends an InformAck to its parent. – The processor x resumes processing of the default queue. This way, x will start processing the outstanding triggers (if any). When a non-leaf node receives InformAck messages from all its children, it performs the same processing as above. When the root processor has received InformAck messages from all its children, the system enters the new round. We note that it is possible to implement the end-of-round procedure using three phases. However, the fourth phase (of sending acknowledgements) ensures that at any point of time, the processors can only be in two diﬀerent (consecutive) rounds. Moreover, when the root receives the InformAck messages from all its children, all the processors in the system are in the same round. Thus, end-ofround processing for diﬀerent rounds cannot be in progress simultaneously. 5.3

Correctness of Algorithm

We now show that the system correctly raises an alert to the user when all the w triggers are delivered. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure. Furthermore, we also show that system does not incorrectly raise an alert to the user before w triggers are delivered. We say that a trigger is unprocessed, if the trigger has been delivered to a processor and is waiting in its default queue. A processor is said to be in round k, if its RoundNum equals k. A trigger is said to be processed in round k, if the processor that received this trigger is in round k when it processed the trigger. Consider the point in time t when the system has entered a new round k. Let w ˆ be the initial value of the round. Recall that in the second phase, the root computes w = x∈P D(x) and sets w ˆ = w − w , where P is the set of all processors. Notice that in the ﬁrst phase, all processors suspend processing triggers from the default queue. The trigger processing is resumed only in the fourth phase after the RoundNum is incremented. Therefore, no more triggers are processed in the round k − 1. It follows that w is the total number of triggers that have been processed in the (previous) rounds k ≤ k − 1. Thus, any triggers processed in round k will be accounted in the counter C(x) of some processor x. This observation leads to the following argument. We now show that the root initiates the end-of-round procedure upon receiving at most w ˆ triggers. Suppose all the w ˆ triggers have been delivered and

64

V.T. Chakaravarthy et al.

processed in this round. Furthermore, assume that all the coins generated and sent in the above process have also been received and processed. Clearly, such a state will happen at some point in time, sincewe assume a reliable communicaˆ tion network. At this point of time, we have x∈P C(x) = w. At any point of time after t, we have x∈P−{r} C(x) ≤ w/4, ˆ where P is the set of all processors and r is the root processor. The claim is proved using the same arguments as in Sect. 4.1 and the fact that the processors discard the coins generated in previous rounds. From the above relations, we get that C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ The root initiates the end-of-round procedure whenever C(r) crosses w/2. ˆ Thus, the root will eventually start the end-of-round procedure. Hence the system never gets stalled in the middle of the round. Clearly, the system raises an alert on receiving w triggers. We now argue that the system does not raise an alert before receiving w triggers. This follows from the fact that w ˆ for a new round is calculated on the basis of D(x) counters. The analysis of message complexity and MaxRcvLoad are unaﬀected.

6

Conclusions

We have presented a randomized algorithm to the DTC problem which reduces the MaxRcvLoad of any node from O(n log w) to O(log n log w) with high probability. The ultimate goal of this line of work would be to design a deterministic algorithm with MaxRcvLoad O(log w).

References 1. Awerbuch, B.: Complexity of network synchronization. J. ACM 32(4), 804–823 (1985) 2. Chang, E.: Echo algorithms: Depth parallel operations on general graphs. IEEE Trans. Software Eng. 8(4), 391–401 (1982) 3. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA (2008) 4. Garg, R., Garg, V.K., Sabharwal, Y.: Scalable algorithms for global snapshots in distributed systems. In: 20th Int. Conf. on Supercomputing, ICS (2006) 5. Garg, V., Ghosh, J.: Repeated computation of global functions in a distributed environment. IEEE Trans. Parallel Distrib. Syst. 5(8), 823–834 (1994) 6. Huang, L., Garofalakis, M., Joseph, A., Taft, N.: Communication-eﬃcient tracking of distributed cumulative triggers. In: ICDCS (2007) 7. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-eﬃcient distributed monitoring of thresholded counts. In: SIGMOD Conference (2006) 8. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge (2005) 9. Segall, A.: Distributed network protocols. IEEE Transactions on Information Theory 29(1), 23–34 (1983) 10. Tel, G.: Distributed inﬁmum approximation. In: Lupanov, O.B., Bukharajev, R.G., Budach, L. (eds.) FCT 1987. LNCS, vol. 278, pp. 440–447. Springer, Heidelberg (1987)

Deterministic Dominating Set Construction in Networks with Bounded Degree Roy Friedman and Alex Kogan Department of Computer Science Technion, Israel {roy,sakogan}@cs.technion.ac.il

Abstract. This paper considers the problem of calculating dominating sets in networks with bounded degree. In these networks, the maximal degree of any node is bounded by Δ, which is usually signiﬁcantly smaller than n, the total number of nodes in the system. Such networks arise in various settings of wireless and peer-to-peer communication. A trivial approach of choosing all nodes into the dominating set yields an algorithm with the approximation ratio of Δ + 1. We show that any deterministic algorithm with non-trivial approximation ratio requires Ω(log∗ n) rounds, meaning eﬀectively that no o(Δ)-approximation deterministic algorithm with a running time independent of the size of the system may ever exist. On the positive side, we show two deterministic algorithms that achieve log Δ and 2 log Δ-approximation in O(Δ3 + log∗ n) and O(Δ2 log Δ + log∗ n) time, respectively. These algorithms rely on coloring rather than node IDs to break symmetry.

1

Introduction

The dominating set problem is a fundamental problem in graph theory. Given a graph G, a dominating set of the graph is a set of nodes such that every node in G is either in the set or has a direct neighboring node in the set. This problem, along with its variations, such as the connected dominating set or the k-dominating set, play signiﬁcant role in many distributed applications, especially in those running over networks that lack any predeﬁned infrastructure. Examples include mobile ad-hoc networks (MANETs), wireless sensor networks (WSNs), peer-to-peer networks, etc. The main application of dominating sets in such networks is to provide a virtual infrastructure, or overlay, in order to achieve scalability and eﬃciency. Such overlays are mainly used to improve routing schemes, where only nodes in the set are responsible for routing messages in the network (e.g., [29, 30]). Other applications of dominating sets include eﬃcient power management [11, 30] and clustering [3, 14]. In many cases, the network graph is such that each node has a limited number of direct neighbors. Such a limitation may result from several reasons. First,

This work is partially supported by the Israeli Science Foundation grant 1247/09 and by the Technion Hasso Plattner Center.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 65–76, 2011. c Springer-Verlag Berlin Heidelberg 2011

66

R. Friedman and A. Kogan

it can represent a hardware limitation, such as a bounded number of communication ports in a device [8]. Second, it can be an outcome of an inherent communication protocol limitation, like in the case of BlueTooth networks composed of units, called piconets, that include at most eight devices [10]. Finally, performance considerations, such as space complexity and network scalability, may limit the number of nodes with which each node may communicate directly. This is a common case for structured peer-to-peer networks, where each node selects a constant number of neighbors when it joins the network [17, 25]. The problem of ﬁnding a dominating set that has a minimal number of nodes is known to be N P -complete [12], and, in fact, it is also hard for approximation [9]. Although the approximation ratio of existing solutions for the dominating set problem, O(log Δ), was found to be the best possible (to within a lower order additive factor, unless NP has an nO(log log n) -time deterministic algorithm [9]), the gap between lower and upper bounds on the running time of distributed deterministic solutions remains wide. Kuhn et al. [19] showed that any distributed approximation algorithm for the dominating set problem with a polylogarithmic approximation ratio requires at least Ω( log n/ log log n) communication rounds. Along with that, the existing distributed deterministic algorithms incur a linear (in number of nodes) running time [7, 23, 29]. This worst-case upper bound remains valid even when graphs of interest are restricted to the bounded degree case, like the ones described above. The deterministic approximation algorithms [7, 23, 29] are based on the centralized algorithm of Guha and Khuller [13], which in turn is based on a greedy heuristic for the related set-cover problem [5]. Following the heuristic, these algorithms start with an empty dominating set and proceed as following. Each node calculates the span, the number of uncovered neighbors, including the node itself. (A node is uncovered if it is not in the dominating set and does not have any neighbor in the set.) Then it exchanges the span with all nodes within distance of 2 hops and decides whether to select itself to the dominating set based on its span and the span of nodes within distance 2. These iterations are repeated by a node v until v or at least one of its neighbors is uncovered. The decision whether to join the dominating set in the above iterative process is taken based on the lexicographic order of the pair span, ID [7, 23, 29]. The use of IDs to break ties leads to long dependency chains, where a node cannot join the set because of another node having higher ID. This, in turn, leads to a time complexity that is linear in the number of nodes. To see that, consider a ring, where nodes have IDs starting from 1 and increasing clockwise. At the ﬁrst iteration, only the node with the highest ID = n will join the set. At the second iteration, only the node with ID = n − 3 will join the set, since it has 3 uncovered neighbors (including itself), while nodes n − 2 and n − 1 have only 2 and 1, respectively. At the third iteration, the node with ID = n − 6 will join, and so on. Thus, such an approach will require roughly n/3 phases. In this paper, we employ coloring to reduce the length of such dependency chains. Our approach is two-phased: we ﬁrst run a coloring algorithm that assigns each node with a color, which is diﬀerent from a color of any other node within

Deterministic Dominating Set Construction

67

distance 2. Then, we run the same iterative process described above, while using colors instead of IDs to break ties between nodes with equal span, shortening the length of the maximal chain. This approach results in a distributed deterministic algorithm with approximation ratio of log Δ (or, more precisely, log Δ + O(1)) and running time of O(Δ3 +log∗ n). Notice, though, that the coloring required by our algorithm can be precomputed for other purposes, e.g., time slot scheduling for the wireless channel access [15, 28]. When the coloring is given, the running time of the algorithm becomes O(Δ3 ), independent of the size of the system. We also describe a modiﬁcation to our algorithm that reduces its running time to O(Δ2 log Δ + log∗ n) (O(Δ2 log Δ) in case of coloring is already given) while the approximation ratio is increased by a constant factor. An essential question that arises in the context of bounded degree networks is whether it is possible to construct a local approximation algorithm, i.e., an algorithm with a running time that depends solely on the degree bound. As have been already stated above, in the general case, Kuhn et al. [19] provide a negative answer and state that at least Ω( log n/ log log n) communication rounds are needed. Along with that, in several other related communication models, such as the unit disc graph, local approximation algorithms are known to exist [6]. In this paper, we show that any deterministic algorithm with a nontrivial approximation ratio requires at least Ω(log∗ n) rounds, thus answering negatively to the question stated above. In light of this lower bound, our modiﬁed algorithm leaves an additive gap of O(Δ2 log Δ).

2

Related Work

Due to its importance, the dominating set problem was considered in various networking models. For general graphs, the best distributed deterministic O(log Δ)-approximation algorithms have linear running time [7, 23, 29]. In fact, these algorithms perform no better than a trivial approach where each node collects a global view of the network by exchanging messages with its neighbors and then calculates locally a dominating set approximation by running, e.g., the centralized algorithm of Guha and Khuller [13]. The only lower bound known for general graphs is due to Kuhn et. al. [19], which states that at least Ω( log n/ log log n) communication rounds are needed to ﬁnd a constant or polylogarithmic approximation1. Their proof relies on a construction of a special family of graphs in which the maximal node degree depends on the size of the graph. Thus, this construction cannot be realized in the bounded degree model. Another body of works considers unit-disk graphs (UDG), which are claimed to model the communication in wireless ad-hoc networks. Although the dominating set problem remains NP-hard in this model, approximation algorithms with a constant ratio are known (e.g., [6, 20]). Recently, Lenzen and Wattenhofer [22] showed that any f -approximation algorithm for the dominating set problem in the UDG model runs in g(n) time, where f (n)g(n) ∈ Ω(log∗ n). In 1

This work assumes unbounded local computations.

68

R. Friedman and A. Kogan

Table 1. Comparison of results on distributed deterministic O(log Δ)-approximation of optimal dominating sets Model

Running time Algorithm/Lower bound log n/ log log n) [19] General O(n) [7, 23, 29] Ω(log∗ n) this paper Bounded degree O(log ∗ n + Δ3 ), O(log ∗ n + Δ2 log Δ) this paper Ω(

contrary, we consider a diﬀerent model of graphs with bounded degree nodes, in which Δ is not a constant number, but rather an independent parameter of the problem. This enables us to obtain more reﬁned lower bound. Speciﬁcally, we show that while obtaining O(Δ)-approximation for the optimal dominating set in our model is possible even without any communication, any o(Δ)-approximation algorithm requires Ω(log∗ n) time. Although our proof employs a similar (ring) graph, which can be realized also in the UDG model, the formalism we use allows us to obtain our lower bound in a shorter and more straight-forward way. The dominating set problem in bounded degree networks was considered by Chlebik and Chlebikova [4], who derive explicit lower bounds on the approximation ratios of centralized solutions. While we are not aware of any previous work on distributed approximation of dominating sets in bounded degree networks, several related problems were considered in this setting. Very recently, Astrand and Suomela et al. provided distributed deterministic approximation algorithms to a series of such problems, e.g., vertex cover [1, 2] and set cover [2]. Panconesi and Rizzi considered maximal matchings and various colorings [26]. It is worth to mention several randomized approaches that have been proposed for the general graph model and which can also be applied in the setting of networks with bounded degree. For instance, Jia et al. [16] propose an algorithm with O(log n log Δ) running time, while Kuhn et. al. [21] achieve even better O(log2 Δ) running time. These solutions, however, provide only probabilistic guarantees on the running time and/or approximation ratio (for example, the former achieves the approximation ratio of O(log Δ) in expectation and O(log n) with high probability), while our approach deterministically achieves the approximation ratio of log Δ. The results of previous work along with the contribution of this paper are summarized in Table 1.

3

Model and Preliminaries

We model the network as a graph G = (V, E). The number of nodes is n and the degree of any node in the graph is limited by a global parameter Δ. We assume that both n and Δ are known to any node in the system. Also, we assume that each node has a unique identiﬁer of size O(log n). In fact, both assumptions are required only by the coloring procedure we use as a subroutine [18]. Our lower bound does not require the latter assumption and, in particular, holds for anonymous networks as well.

Deterministic Dominating Set Construction

69

k · o(log∗ n) nodes vj

f (n) nodes

vi

Fig. 1. A (partial) 2-ring graph R(n, 2)

k · o(log∗ n) nodes

Fig. 2. A subgraph G of R(n, 2)

Our model of computation is a synchronous, message-passing system (denoted as LOCAL in [27]) with reliable processes and reliable links. In particular, time is divided into rounds and in every round, a node may send one message of an arbitrary size to each of its direct neighbors in G, receive all messages sent to it by its direct neighbors at the same round and perform some local computation. Consequently, for any given pair of nodes v and u at distance of k edges in G, a message sent by v in round i may reach u not before round i + k − 1. All nodes start the computation at the same round. The time complexity of the algorithms presented below is the number of rounds from the start until the last node ceases to send messages. Let Nk (v, G) denote the k-neighborhood of a node v in a graph G, that is Nk (v, G) is a set of all nodes (not including v itself), which are at most k hops from v in G. In the following deﬁnitions, all node indices are taken modulo n. Definition 1. A ring graph R(n) = (Vn , En ) is a circle graph consisting of n nodes where Vn = {v1 , v2 , ..., vn } and En = {(vi , vi+1 ) | 1 ≤ i ≤ n}. A k-ring graph R(n, k) = (Vn , Enk ) is an extension of the ring graph, where Vn = {v1 , v2 , ..., vn } and Enk = {(vi , u) | u ∈ Nk (vi , R(n)) ∧ 1 ≤ i ≤ n}. Notice that in R(n, k) each node v has exactly 2k edges, one to each of its neighbors in Nk (v, R(n)) (see Fig. 1). Given R(n, k) and two nodes vi , vj ∈ Vn , i ≤ j, let Sub(R(n, k), vi , vj ) be a subgraph (V, E) where V = {vk ∈ Vn | i ≤ k ≤ j}. Thus, assuming a clockwise ordering of nodes on the ring, Sub(R(n, k), vi , vj ) contains the sequence of nodes between vi and vj in the clockwise direction. The nodes vi and vj are referred to as boundary nodes in the sequence. Definition 2. Suppose A is an algorithm operating on R(n, k) and assigning each node vi ∈ Vn a value c(vi ) ∈ {0, 1}. Let r(vi ) = minj {j ≤ i | ∀k, j ≤ k ≤ i : c(vk ) = c(vi )}. Similarly, let l(vi ) = maxj {i ≤ j | ∀k, i ≤ k ≤ j : c(vk ) = c(vi )}. Then Seq(vi ) = Sub(R(n, k), vr(vi ) , vl(vi ) ) is the longest sequence of nodes

70

R. Friedman and A. Kogan

containing vi and in which all nodes have the value c(vi ). We call vl(vi ) as the leftmost node in Seq(vi ), vl(vi )+1 as the second leftmost node, and so on.

4 4.1

Proof of Bounds Lower Bound

The minimal dominating set of any bounded degree graph has a size of at least n . Thus, a simple approach for choosing all nodes of the graph into the domiΔ+1 nating set gives a trivial (Δ + 1)-approximation for the optimal set. An essential question is whether a non-trivial approximation can be calculated deterministically in the bounded degree graphs in an eﬀective way, i.e., independent of the system size. The following theorem gives a negative answer to this question. Theorem 1. Any distributed deterministic o(Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time. Proof. Assume, by way of contradiction, that there exists a deterministic algorithm A that ﬁnds an o(Δ)-approximation in o(log∗ n) time. Given a ring of n nodes, R(n), the following algorithm colors it with 3 colors, for any given k. – Construct the k-ring graph R(n, k) and run A on it. For each node vi ∈ Vn , denote the value c(vi ) as 1 if A selects v into the dominating set, and as 0 otherwise. – Every node vi ∈ Vn chooses its color according to whether or not vi and some of its neighbors are chosen to the dominating set by A. Speciﬁcally, consider the sequence Seq(vi ) as deﬁned in Def. 2. • If vi is not in the set, the nodes in the sequence are colored with colors 2 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 2, the second leftmost node chooses color 1, the third leftmost node chooses color 2, and so on. • If vi is in the set, the nodes in the sequence are colored with colors 0 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 0, the second leftmost node chooses color 1, the third leftmost node chooses color 0, and so on. The produced coloring uses 3 colors and is a subject to a straight-forward distributed implementation. Notice that the coloring is legal (i.e., no two adjacent nodes share the same color) inside sequences of nodes chosen and not chosen to the dominating set by A. Thus, the legality of the produced coloring should be veriﬁed in cases where the sequences end. Consider two neighboring nodes (in R(n)) v and u, where v is a left neighbor of u (i.e., v appears immediately after u in the ring when considering nodes in the clockwise direction). If v is in the set and u is not, then the color of u, being the leftmost in the sequence of nodes not in the set, is 2, while the color of v is 0 or 1. Similarly, if u is in the set and v is not, then the color of u, being the leftmost in the sequence of nodes in the set, is 0, while the color of v is 2 or 1. Thus, the produced coloring is legal.

Deterministic Dominating Set Construction

71

The running time of the algorithm is g(n) ∈ o(log∗ n) rounds spent for running A and an additional number of rounds to decide on colors. The length of the longest sequence of nodes not in the dominating set cannot exceed 2k, since otherwise there will be a node that is not covered by any other node in the selected dominating set. Thus, the implementation of the ﬁrst rule for the coloring decision requires a constant number of rounds. In the following, we show that there exists k such that the length of the longest sequence chosen to the dominating set by A is o(log∗ n). Thus, for this k, nodes decide on their colors in o(log∗ n) time. Thus, the running time of the algorithm to color a ring with 3 colors sums up to o(log∗ n), contradicting the famous lower bound of Linial [24]. We are left with the claim that for some k, the length of the longest sequence of nodes chosen to the dominating set by A is o(log∗ n). Suppose, by way of contradiction, that for any k there exists a function f (n) ∈ Ω(log∗ n) such that A produces a sequence of length f (n). Let vi and vj be the boundary nodes of such a sequence s.t. i ≤ j, and construct a subgraph G = Sub(R(n, k), vi−k·g(n) , vj+k·g(n) ). Notice that this subgraph contains the same f (n) nodes chosen by A into the dominating set plus additional 2k · g(n) nodes (see Fig. 2). Also note that a minimum domi1 nating set in G , Opt(G ), contains 2k+1 (f (n) + 2k · g(n)) nodes. When A is run on G , nodes in the original sequence of length f (n) cannot distinguish between the two graphs, i.e., R(n, k) and G . This is because in our model, a node can collect information in o(log∗ n) rounds only from nodes at distance of at most o(log∗ n) edges from it. Thus, being completely deterministic, A must select the same f (n) nodes (plus some additional nodes to ensure that all nodes in G are covered). Consequently, |A(G )| ≥ f (n), where |A(G )| denotes the size of the dominating set calculated by A for the graph G . On the other hand, A has an o(Δ)-approximation ratio, thus for any graph G, |A(G)| ≤ o(Δ) · |OP T (G)| + c, where c is some non-negative constant. For simplicity, we will assume c = 0; the proof does not change much for c > 0. In the graph R(n, k) (and G ), Δ = 2k, thus there exist Δ and k s.t. 2o(Δ ) = 2o(2k ) < 2k + 1. In addition, since f (n) ∈ Ω(log∗ n) and g(n) ∈ o(log∗ n), there exists n > k s.t. 2k · g(n ) < f (n ). Thus, for Δ , k and n , we get: 1 (f (n ) + 2k · g(n )) +1 2 < o(Δ ) · f (n ) < f (n ) ≤ |A(G )|, 2k + 1

o(Δ ) · |OP T (G )| = o(Δ ) ·

2k

contradicting the fact that A has an o(Δ)-approximation ratio.

It follows immediately from the previous theorem that no local deterministic algorithm that achieves an optimal O(log Δ)-approximation may exist. Corollary 1. Any distributed deterministic O(log Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time.

72

4.2

R. Friedman and A. Kogan

Upper Bound

First, we describe an algorithm that achieves log Δ-approximation in O(Δ3 + log∗ n) time. Next, we show a modiﬁed version that runs in O(Δ2 log Δ + log∗ n) time and achieves 2 log Δ-approximation. We will use the following notion: Definition 3. A k-distance coloring is an assignment of colors to nodes such that any two nodes within k hops of each other have distinct colors. Our algorithm consists of two parts. The ﬁrst part is a 2-distance coloring routine, implemented by means of a coloring algorithm provided by Kuhn [18]. Kuhn’s distributed deterministic algorithm produces 1-distance coloring for any input graph G using Δ + 1 colors in O(Δ + log∗ n) time. For our purpose, we run this algorithm on G2 graph, created from G by (virtually) connecting each node with any of its neighbors at distance 2. This means that any message sent on such a virtual link is routed by an intermediate node to its target, increasing the running time of the algorithm by a constant factor. The second part of the algorithm is the approximation routine, which is a simple application of the greedy heuristic described in Sect. 1, where colors obtained in the ﬁrst phase are used to break ties instead of IDs. That is, nodes exchange their span and color with all neighbors at distance 2 and decide to join the set if their span, color pair is lexicographically higher than any of the received pairs. The pseudo-code for the algorithm is given in Algorithm 1. It denotes the set of immediate neighbors of a node i by N1 (i) and the set of neighbors of i at distance 2 by N2 (i). Additionally, each node i uses the following local variables: – color: array with values of colors assigned to each node j ∈ N2 (i) by a 2-distance coloring routine. Initially, all values are set to ⊥. – state: array that holds the state of each node j ∈ N1 (i). The state can be uncovered, covered or marked. Initially, all values are set to uncovered. The nodes chosen to the dominating set are those that ﬁnish the algorithm with their state set to marked. – span: array with values for each node j ∈ N2 (i); span[j] holds the number of nodes in N1 (j) ∪ {j} that are uncovered by any other node already selected into the dominating set, as reported by j. Initially, all values are set to ⊥. – done: boolean array that speciﬁes for each node j ∈ N1 (i) whether j has ﬁnished the algorithm. Initially, all values are set to false. Theorem 2. The algorithm in Algorithm 1 computes a dominating set with an approximation ratio of log Δ in O(Δ3 + log∗ n) time. Proof. We start by proving the bound on the running time of the algorithm. The 2-distance coloring routine requires O(Δ2 + log∗ n) time. This is because the maximal degree of nodes in the graph G2 is bounded by Δ(Δ − 1) and each round of the coloring algorithm of Kuhn [18] in G2 can be simulated by at most 2 rounds in the given graph G.

Deterministic Dominating Set Construction

73

Algorithm 1. code for node i 1: color[i] = calc-2-dist-coloring() 2: distribute-and-collect(color, 2)

use the coloring algorithm of [18]

3: while state[j] = uncovered for any j ∈ N1 (i) ∪ {i} do 4: span[i] := |{state[j] = uncovered | j ∈ N1 (i) ∪ {i}}| 5: distribute-and-collect(span, 2) 6: if span[i], color[i] > max{span[j], color[j] | j ∈ N2 (i) ∧ span[j] = ⊥} then 7: state[i] := marked 8: distribute-and-collect(state, 1) 9: if state[j] = marked for any j ∈ N1 (i) then 10: state[i] := covered 11: distribute-and-collect(state, 1) 12: done 13: broadcast done to all neighbors distribute-and-collect(arrayi , radius): 14: foreach q in [1,2, ..., radius] do 15: broadcast arrayi to all neighbors 16: receive arrayj from all j ∈ N1 (i) s.t. done[j] = f alse 17: foreach node l at distance q from i do 18: if ∃j ∈ N1 (i) s.t. done[j] = f alse ∧ node l at distance q − 1 from j then 19: arrayi [l] = arrayj [l] 20: done 21: done when done is received from j: 22: done[j] = true 23: span[j] = ⊥

The maximal value that the span can be assigned to is Δ + 1, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, the maximal number of distinct values for all span, color pairs is O(Δ3 ). In every other iteration of the greedy heuristic (while-do loop in Lines 3–12 in Algorithm 1), all nodes having a maximal value of the span, color pair join the set. Thus, after at most O(Δ3 ) iterations, all nodes are covered, while each iteration can be implemented in O(1) synchronous communication rounds. Summing over both phases produces the required bound on the running time. Note that if coloring is not required, the running time is independent of n. For the approximation ratio proof, observe that the span of a node is inﬂuenced only by its neighbors at distance of at most 2 hops. Also, notice that the dominating set problem is easily reduced to the set-cover problem (by creating a set for each node along with all its neighbors [13]). Thus, the algorithm chooses essentially exactly the same nodes as the well-known centralized greedy heuristic for the set-cover problem [5], which picks sets based on the number of uncovered elements they contain. Thus, the approximation ratio of the algorithm follows directly from the analysis of that heuristic (for details, see [5]).

74

R. Friedman and A. Kogan

To reduce the running time of the algorithm (at the price of increasing the approximation ratio by a factor of 2), we modify the algorithm to work with an adjusted span for each node u. The adjusted span is the smallest power of 2 that is at least as large as the number of u’s uncovered neighbors (including u itself). Thus, during the second phase of the algorithm, u exchanges its adjusted span and color with all nodes at distance 2 and decides to join the dominating set if its adjusted span, color is lexicographically higher than that of any node at distance 2. Note that one might adjust the span to the power of any other constant c > 1 improving slightly the approximation ratio, but not the asymptotic running time. Theorem 3. The modified algorithm computes a dominating set with an approximation ratio of 2 log Δ in O(Δ2 log Δ + log∗ n) time. Proof. The maximal value that the adjusted span can be assigned to is log Δ, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, similarly to the proof of Theorem 2, we can infer that the running time is O(Δ2 log Δ + log∗ n). The factor 2 in the approximation ratio appears due to the span adjustment. In order to prove this claim, consider the centralized greedy heuristic for the set-cover problem [5] with the adjusted span modiﬁcation. That is, the number of uncovered elements in a set S is replaced (adjusted) by the smallest power of 2 which is at least as large as this number, and at each step, the heuristic chooses a set that covers the largest adjusted number of uncovered elements. Following the observation in the proof of Theorem 2, setting the approximation ratio for the centralized set-cover heuristic that uses the adjusted span modiﬁcation will set the proof for the approximation ratio of the modiﬁed dominating set algorithm. When the (modiﬁed or unmodiﬁed) greedy heuristic chooses a set S, suppose that it charges each element of S by the price 1/i, where i is the number of uncovered elements in S. As a result, the total price paid by the heuristic is exactly the number of sets it chooses, while each element is charged only once. Consider the set S ∗ = {ek , ek−1 , . . . , e1 } in the optimal set-cover solution Sopt , and assume without loss of generality that the greedy heuristic covers the elements of S ∗ in the given order: ek , ek−1 , . . . , e1 . Consider the step at which the heuristic chooses a set that covers ei . At the beginning of that step, at least i elements are uncovered. Thus, if the heuristic were to choose the set S ∗ at that step, it would pay the price of 1/i per element. Using the adjusted span modiﬁcation, the heuristic might pay at that step at most twice the price per element covered, i.e., it pays for ei at most 2/i. Consequently, the total price paid by the heuristic to cover all elements in the set S ∗ is at most Σ1≤i≤k 2/i = 2Hk , where Hk = Σ1≤i≤k 1/i = log k + O(1) is the k-th harmonic number. Thus, since every element is in some set of Sopt , we get that in order to cover all elements, the modiﬁed greedy heuristic pays at most ΣS∈Sopt 2Hm = 2Hm ΣS∈Sopt 1 = 2Hm |Sopt |, where m is the size of the biggest set in Sopt . In the instance of the set-cover problem produced from the graph with the bounded degree Δ, m = Δ+1, which establishes the required approximation ratio.

Deterministic Dominating Set Construction

5

75

Conclusions

In this paper, we examined distributed deterministic solutions for the dominating set problem, one of the most important problems in graph theory, in the scope of graphs with bounded node degree. Such graphs are useful for modeling networks in many realistic settings, such as various types of wireless and peer-to-peer networks. For these graphs, we showed that no purely local, i.e., independent of the number of nodes, deterministic algorithms that calculate a non-trivial approximation may ever exist. This lower bound is complemented by two approximation algorithms. The ﬁrst algorithm ﬁnds a log Δ-approximation in O(Δ3 + log∗ n) time, while the second one achieves a 2 log Δ-approximation in O(Δ2 log Δ + log∗ n) time. These results compare favorably to previous deterministic algorithms with running time of O(n). With regard to the lower bound, they leave an additive gap of O(Δ2 log Δ) for further improvements. In the full version of this paper, we show a simple extension of our bounds for weighted bounded degree graphs.

Acknowledgments We would like to thank Fabian Kuhn and Jukka Suomela for fruitful discussions on the subject, and to anonymous reviewers whose valuable comments helped to improve the presentation of this paper.

References 1. ˚ Astrand, M., Flor´een, P., Polishchuk, V., Rybicki, J., Suomela, J., Uitto, J.: A local 2-approximation algorithm for the vertex cover problem. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 191–205. Springer, Heidelberg (2009) 2. Astrand, M., Suomela, J.: Fast distributed approximation algorithms for vertex cover and set cover in anonymous networks. In: Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 294–302 (2010) 3. Chen, Y.P., Liestman, A.L.: Approximating minimum size weakly-connected dominating sets for clustering mobile ad hoc networks. In: Proc. ACM Int. Symp. on Mob. Ad Hoc Networking and Computing (MobiHoc), pp. 165–172 (2002) 4. Chlebik, M., Chlebikova, J.: Approximation hardness of dominating set problems in bounded degree graphs. Inf. Comput. 206(11) (2008) 5. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 6. Czyzowicz, J., Dobrev, S., Fevens, T., Gonzalez-Aguilar, H., Kranakis, E., Opatrny, J., Urrutia, J.: Local algorithms for dominating and connected dominating sets of unit disk graphs with location aware nodes. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 158–169. Springer, Heidelberg (2008) 7. Das, B., Bharghavan, V.: Routing in ad-hoc networks using minimum connected dominating sets. In: Proc. IEEE Int. Conf. on Comm (ICC), pp. 376–380 (1997) 8. Dong, Q., Bejerano, Y.: Building robust nomadic wireless mesh networks using directional antennas. In: Proc. IEEE INFOCOM, pp. 1624–1632 (2008)

76

R. Friedman and A. Kogan

9. Feige, U.: A threshold of ln n for approximating set cover. Journal of the ACM 45, 314–318 (1998) 10. Ferro, E., Potorti, F.: Bluetooth and Wi-Fi wireless protocols: a survey and a comparison. IEEE Wireless Communications 12(1), 12–26 (2005) 11. Friedman, R., Kogan, A.: Eﬃcient power utilization in multi-radio wireless ad hoc networks. In: Abdelzaher, T., Raynal, M., Santoro, N. (eds.) OPODIS 2009. LNCS, vol. 5923, pp. 159–173. Springer, Heidelberg (2009) 12. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. Ltd., New York (1979) 13. Guha, S., Khuller, S.: Approximation algorithms for connected dominating sets. Algorithmica 20, 374–387 (1998) 14. Han, B., Jia, W.: Clustering wireless ad hoc networks with weakly connected dominating set. Journal of Parallel and Distr. Computing 67(6), 727–737 (2007) 15. Herman, T., Tixeuil, S.: A distributed TDMA slot assignment algorithm for wireless sensor networks. In: Nikoletseas, S.E., Rolim, J.D.P. (eds.) ALGOSENSORS 2004. LNCS, vol. 3121, pp. 45–58. Springer, Heidelberg (2004) 16. Jia, L., Rajaraman, R., Suel, T.: An eﬃcient distributed algorithm for constructing small dominating sets. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 33–42 (2001) 17. Kaashoek, M.F., Karger, D.R.: Koorde: A simple degree-optimal distributed hash table. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 98–107. Springer, Heidelberg (2003) 18. Kuhn, F.: Weak graph colorings: distributed algorithms and applications. In: Proc. Symp. on Paral. in Algorithms and Architectures (SPAA), pp. 138–144 (2009) 19. Kuhn, F., Moscibroda, T., Wattenhofer, R.: What cannot be computed locally! In: Proc. ACM Symp. on Principles of Distr. Comp. (PODC), pp. 300–309 (2004) 20. Kuhn, F., Moscibroda, T., Wattenhofer, R.: On the locality of bounded growth. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 60–68 (2005) 21. Kuhn, F., Moscibroda, T., Wattenhofer, R.: The price of being near-sighted. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 980–989 (2006) 22. Lenzen, C., Wattenhofer, R.: Leveraging linial’s locality limit. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 394–407. Springer, Heidelberg (2008) 23. Liang, B., Haas, Z.J.: Virtual backbone generation and maintenance in ad hoc network mobility management. In: Proc. IEEE INFOCOM, pp. 1293–1302 (2000) 24. Linial, N.: Locality in distributed graph algorithms. SIAM Journal on Computing 21(1), 193–201 (1992) 25. Malkhi, D., Naor, M., Ratajczak, D.: Viceroy: a scalable and dynamic emulation of the butterﬂy. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 183–192 (2002) 26. Panconesi, A., Rizzi, R.: Some simple distributed algorithms for sparse networks. Distributed Computing 14(2), 97–100 (2001) 27. Peleg, D.: Distributed computing: a locality-sensitive approach. SIAM, Philadelphia (2000) 28. Rhee, I., Warrier, A., Min, J., Xu, L.: DRAND: distributed randomized TDMA scheduling for wireless ad-hoc networks. In: Proc. 7th ACM Int. Symp. on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 190–201 (2006) 29. Sivakumar, R., Das, B., Bharghavan, V.: Spine routing in ad hoc networks. Cluster Computing 1(2), 237–248 (1998) 30. Wu, J., Dai, F., Gao, M., Stojmenovic, I.: On calculating power-aware connected dominating sets for eﬃcient routing in ad hoc wireless networks. Journal of Communications and Networks, 59–70 (2002)

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks Dirk Bradler1 , Lachezar Krumov1 , Max M¨ uhlh¨ auser1 , and Jussi Kangasharju2 1

TU Darmstadt, Germany {bradler,krumov,max}@cs.tu-darmstadt.de 2 University of Helsinki, Finland [email protected]

Abstract. Peer-to-Peer networks are divided into two main classes: unstructured and structured. Overlays from the ﬁrst class are better suited for exhaustive search, whereas those from the second class oﬀer very efﬁcient key-value lookups. In this paper we present a novel overlay, PathFinder, which combines the advantages of both classes within one single overlay for the ﬁrst time. Our evaluation shows that PathFinder is comparable or even better in terms of lookup and complex query performance than existing peer-to-peer overlays and scales to millions of nodes.

1

Introduction

Peer-to-peer overlay networks can be classiﬁed into unstructured and structured networks, depending on how they construct the overlay. In an unstructured network the peers are free to choose their overlay neighbors and what they oﬀer to the network.1 In order to discover if a certain piece of information is available a peer must somehow search through the overlay. There are several implementations of such search algorithms. The original Napster used a central index server, Kazaa relied on a hybrid network with supernodes and the original Gnutella used a decentralized ﬂooding of queries [4]. The BubbleStorm network [5] is a fully decentralized network based on random graphs and is able to provide eﬃcient exhaustive search. Structured networks, on the other hand, have strict rules about how the overlay is formed and where content should be placed within the network. Structured networks are also often called distributed hash tables (DHT) and the research world has seen several examples of DHTs [3,7]. DHTs are very eﬃcient for simple key-value lookups. Because objects are addressed with their unique names, searching in a DHT is hard to be made more eﬃcient [6]. However, wildcard searching and complex queries either impose extensive complexity and costs in terms of additional messages or are not supported at all. Given the attractive properties of both these diﬀerent network structures: it is natural to ask the question: Is it possible to combine these two properties in 1

In this paper we focus on networks where peers store and share content, e.g., ﬁles, database items, etc.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 77–82, 2011. c Springer-Verlag Berlin Heidelberg 2011

78

D. Bradler et al.

one single network? Our answer to this question is PathFinder, a peer-to-peer overlay which combines an unstructured and a structured network in a single overlay. PathFinder is based on a random graph which gives it short average path length, large number of alternative paths for fault tolerable, highly robust and reliable overlay topology. Our main contribution is the eﬃcient combination of exhaustive searching and key-value lookups in a single overlay. The rest of this paper is organized as follows. In Section 2 we present an overview of PathFinder. Section 3 compares it to existing P2P overlays and we conclude in Section 4. Due to space limitations, the reader is referred to [1] for technical aspects such as node join and leave, handling crashed nodes, and network size adaptation. In [1] an extensive evaluation of PathFinder under churn and attacks is also presented.

2

PathFinder Design

In this section we present the system model and preliminaries of PathFinder. We also describe how the basic key-value lookup and exhaustive search work. For further basic operations like node join/leave, handling crashed nodes, see [1]. 2.1

Challenges

We designed PathFinder to be fully compliant with the concept of BubbleStorm [5], namely an overlay structure based on random graphs. We augment the basic random graph with a deterministic lookup mechanism (see Section 2.4) to add eﬃcient lookups into the exhaustive search provided by BubbleStorm. The challenge and one of the key contributions of this paper is developing a deterministic mechanism for exploiting these short paths in implement DHT-like lookups. 2.2

System Model and Preliminaries

All processes in PathFinder beneﬁt from the properties of its underlying random graph and the routing scheme built on top of it. PathFinder construction principle. The basic idea of PathFinder is to build a robust network of virtual nodes on top of the physical peers (i.e. actual physical nodes). Routing among peers is carried out in the virtual network. The actual data transfer still takes place directly among the physical peers. PathFinder builds a random graph of virtual nodes and then distributes them among the actual peers. At least one virtual node is assigned to each peer. From the routing point of view, the data in the network is stored on the virtual nodes. When a peer B is looking for a particular piece of information it has to ﬁnd a path from one of its virtual nodes to the virtual node containing the requested data. Then B directly contacts the underlying peer A which is responsible for the targeted virtual node. B retrieves the requested data directly from A. This process is described in detail in Section 2.4.

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks

79

It is known that the degree sequence in a random graph is Poisson distributed. We need two pseudorandom number generators (PRNG) which initialized with the same ID always produce a deterministic sequence of numbers. Given a number c, the ﬁrst generator returns Poisson distributed numbers with mean value c. The second PRNG given a node ID produces a deterministic sequence of numbers which we use as IDs for the neighbors of the given node. The construction principle of PathFinder is as follows. First we ﬁx a number c (see [1] on how to chose c according to the number of peers and how to adapt it once the network becomes too small/large). Then, for each virtual node we determine the number of neighbors with the ﬁrst number generator. The actual nodes IDs to which the current virtual node should be connected are chosen with the second number generator. The number generator is started with the ID of the virtual node. The process can be summarized in the following steps: 1. The underlying peer determines how many virtual nodes it should handle. See [1] for details. 2. For every virtual node handled by the peer: (a) The peer uses the poisson number generator to determine the number of neighbors of the current virtual node. (b) The peer then draws as many pseudo random numbers according to the number drawn in the previous step. (c) The peer selects the virtual nodes with IDs matching to those numbers as neighbors for its current virtual node. The construction mechanism of PathFinder allows the peers to build a random graph out of their virtual nodes. It is of crucial importance that a peer only needs a PRNG to perform that operation. There is no need for network communication. Similarly, any peer can determine the neighbors of any virtual node, simply seeding the pseudo random generator with the corresponding ID. Now we have both, a random graph topology suited for exhaustive search and a mechanism for each node to compute the neighbor list of any other node. i.e. DHT-like behavior within PathFinder. Routing table example of PathFinder. Figure 1 shows a small sample of PathFinder with a routing table for the peer with ID 11. The random graph has 5 virtual nodes (1 through 5) and there are 4 peers (with IDs from 11 through 14). Peer 11 handles two virtual nodes (4 and 5) and all the rest of the peers have 1 virtual node each. The arrows between the virtual nodes show the directed neighbor links. Each peer keeps track of its own outgoing links as well as incoming links from other virtual nodes. A peer learns the incoming links when the other peers attempt to connect to it. Keeping track of the incoming links is strictly speaking not necessary, but makes key lookups much more eﬃcient (see Section 2.4). The routing table of peer marked as 11 therefore consists of all outgoing links from its virtual nodes 4 and 5 and the incoming link from virtual node 3.

D. Bradler et al.

80

Fig. 1. A small example of PathFinder

2.3

10k nodes 1M nodes 100M nodes

100

Cumulative Lookups

80

60

40

20

0

2

4

6

8

10

Path Length

Fig. 2. Key lookup with local expanding ring search from source and target

Fig. 3. Distribution of complete path length, 5000 key lookups with c = 20

Storing Objects

An object is stored on the virtual node (i.e. on the peer responsible for the virtual node) which matches the object’s identiﬁer. If the hash space is larger than the number of virtual nodes, then we map the object to the virtual node whose identiﬁer matches the preﬁx of the object hash. 2.4

Key Lookup

Key lookup is the process when a peer contacts another peer possessing a given data of interest. Using the structure of the network, the requesting peer traverses only one single and usually short path from itself to the target peer. Key lookup is the main function of a DHT. In order to perform quick lookups, the average number of hops between peers as well as the variance needs to be kept small. We now show how PathFinder achieves eﬃcient lookups and thus behaves as any other DHT. Suppose that peer A wants to retrieve an object O. Peer A determines that the virtual node w is responsible for object O by using the hash function described above. Now A has to route in the virtual network from one of its virtual nodes to w and directly retrieve O from the peer responsible for w. Denote with V the set of virtual nodes managed by the peer A. For each virtual node in V , A calculates the neighbors of those nodes. (Note that this calculation is already done, since these neighbors are the entries in peer A’s routing table.) A checks if any of those neighbors is the virtual node w. If yes, A contacts the underlying peer to retrieve O. If none of peer A’s virtual node neighbors is responsible for O, A calculates the neighbors of all of its neighbors, i.e. its second neighbors. Because the neighbors of each virtual node are pre-known (see Section 2.2), this is a simple local computation. Again, peer A checks if any of the new calculated neighbors is responsible for O. If yes, peer A sends its request to the virtual node whose neighbor is responsible for O. If still no match is found, peer A expands its search by calculating the neighbors of the nodes from the previous step and checks again. The process continues until a match is found. A may have to calculate several neighbors, but a match is guaranteed.

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks

81

Because peer A is able to compute w’s neighboring virtual nodes, A can expand the search rings locally from both the source and target sides, which is called forward and backward chaining. In every step the search depth of the source and target search ring is increased by one. In that way the number of rings around the source are divided between the source itself and the target. This leads to exponential decrease in the number of IDs that have to be computed. We generated various PathFinder networks from 103 up to 108 nodes with average degree 20. In all of them we performed 5000 arbitrary key lookups. It turned out that, expanding rings of depth 3 or 4 (i.e., path length between 6 and 8) is suﬃcient for a successful key lookup, as shown in Figure 3. 2.5

Searching with Complex Queries

PathFinder supports searching with complex queries with tunable success rate almost identical to BubbleStorm [5]. In fact, since both PathFinder and BubbleStorm are based on random graphs, we implemented the search mechanism of BubbleStorm directly into PathFinder. In BubbleStorm both data and queries are sent to some number of nodes, where the exact number of messages depends on how we set the probability of ﬁnding the match. We use exactly the same algorithm in PathFinder for searching and the reader is referred to [5] for details.

3

Comparison and Analysis

Most DHT overlays provide the same functionality, since they all support the common interface for key based routing. The main diﬀerences between various DHT implementations are average lookup path length, resilience to failures, and load balancing. In this Section we compare PathFinder to established DHTs. ) The lookup path length of Chord is well studied: Lavg = log(N . The maximum 2 log(N ) ) path length of Chord is log(1+d) . The average path length of PathFinder is log(N , log(c) where c is the average number of neighbors. The path length of the Pastry model can be estimated by log2b (N ) [3], where b is a tunable parameter. The Symphony overlay is based on a small world graph. This leads to key lookups in 2 O( log k(N) ) hops [2]. The variable k refers only to long distance links. The actual 1 amount of neighbors is indeed much higher [2]. The diameter of CAN is 12 dN d with a degree for each node 2d, with a ﬁxed d. With large d the distribution of path length becomes gaussian, like Chord. We use simulations to evaluate the practical eﬀects of the individual factors. Figure 4 shows the results for a 20,000 nodes network. We perform 5,000 lookups among random pairs of nodes and measure the number of hops each DHT takes to ﬁnd the object. Figure 5 displays the results. Note that PathFinder results come from actual simulation, not analytical calculations. PathFinder also inherits the exhaustive search mechanism of BubleStorm. Hence, as an unstructured overlay it performs identical to BubleStorm and the reader is referred to [5] for thorough comparison to other unstructured systems.

82

D. Bradler et al.

10

Pastry PathFinder (c=20) SkipNet Chord Symphony

80

8 6

Number of Hops

Cumulative Lookups

100

60

40

4

2

20

0

5

10

15

20

25

Path Length

Fig. 4. Average number of hops for 5,000 key lookups in diﬀerent DHTs

4

1 1⋅103

Chord Pastry PathFinder (c=20) PathFinder (c=50) DeBruijn 1⋅104

1⋅105

1⋅106

1⋅107

1⋅108

Number of Nodes

Fig. 5. Average number of hops for different DHTs measured analytically. Numbers for PathFinder are simulated.

Conclusions

In this paper we have presented PathFinder, an overlay which combines eﬃcient exhaustive search and eﬃcient key-value lookups in the same overlay. Combining these two mechanisms in the same overlay is very desirable, since it allows efﬁcient and overhead-free implementation of natural usage patterns. PathFinder is the ﬁrst overlay to combine exhaustive search and key-value lookups in an eﬃcient manner. Our results show that PathFinder has performance comparable or better to existing overlays. It scales easily to millions of nodes and its key lookup performance is in large networks better than in existing DHTs. Because PathFinder is based on a random graph, we are able to directly beneﬁt from existing search mechanisms (BubbleStorm) for enabling eﬃcient exhaustive search.

References 1. Bradler, D., Krumov, L., Kangasharju, J., Weihe, K., M¨ uhlh¨ auser, M.: Pathﬁnder: Eﬃcient lookups and eﬃcient search in peer-to-peer networks. Tech. Rep. TUD-CS2010872, TU Darmstadt (October 2010) 2. Manku, G., Bawa, M., Raghavan, P.: Symphony: Distributed Hashin In A Small World. In: Proc. 4th USENIX Symposium on Internet Techn. and Systems (2003) 3. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001) 4. Steinmetz, R., Wehrle, K. (eds.): Peer-to-Peer Systems and Applications. LNCS, vol. 3485. Springer, Heidelberg (2005) 5. Terpstra, W., Kangasharju, J., Leng, C., Buchmann, A.: Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: Proc. SIGCOMM, pp. 49–60 (2007) 6. Yang, Y., Dunlap, R., Rexroad, M., Cooper, B.: Performance of full text search in structured and unstructured peer-to-peer systems. In: Proc. IEEE INFOCOM (2006) 7. Zaho, B., Kubiatowicz, J., Joseph, A.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Comp. 74 (2001)

Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) Hagit Attiya1,2 and Eshcar Hillel1 1

2

Department of Computer Science, Technion Ecole Polytechnique Federale de Lausanne (EPFL)

Abstract. We present PermiSTM, a single-version STM that satisfies a practical notion of permissiveness, usually associated with keeping many versions: it never aborts read-only transactions, and it aborts other transactions only due to a conflicting transaction (which writes to a common item), thereby avoiding spurious aborts. It avoids unnecessary contention on the memory, being strictly disjointaccess parallel.

1 Introduction Transactional memory is a leading paradigm for programming concurrent applications for multicores. It is seriously considered as part of software solutions (abbreviated STMs) and as a basis for novel hardware designs, which exploit the parallelism offered by contemporary multicores and multiprocessors. A transaction encapsulates a sequence of operations on a set of data items: it is guaranteed that if a transaction commits, then all its operations appear to be executed atomically. A transaction may abort, in which case none of its operations are executed. The data items written by the transaction are its write set, the data items read by the transaction are its read set, and together they are the transaction’s data set. When an executing transaction may violate consistency, the STM can forcibly abort it. Many existing STMs, however, sometimes spuriously abort a transaction, even when in fact, the transaction may commit without compromising data consistency [9]. Frequent spurious aborts can waste system resources and significantly impair performance; in particular, this reduces the chances of long transactions, which often only read the data, to complete. Avoiding spurious aborts has been an important goal for STM design, and several conditions have been proposed to evaluate how well it is achieved [8, 9, 12, 16, 20]. A permissive STM [9] never aborts a transaction unless necessary to ensure consistency. A stronger condition, called strong progressiveness [12], further ensures that even when there are conflicts, at least one of the transactions involved in the conflict is not aborted. Alternatively, multi-version (MV) permissiveness [20] focuses on read-only transactions (whose write set is empty), and ensures they never abort; update transactions, with non-empty write set, may abort when in conflict with other transactions writing to the same items. As its name suggests, multi-version progressiveness was meant to be provided by a multi-version STM, maintaining multiple versions of each data item.

This research is supported in part by the Israel Science Foundation (grant number 953/06).

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 83–94, 2011. c Springer-Verlag Berlin Heidelberg 2011

84

H. Attiya and E. Hillel

It has been suggested [20] that refraining to abort read-only transactions mandates the overhead associated with maintaining multiple versions: additional storage, a complex implementation of a precedence graph (to track versions), as well as an intricate garbage collection mechanism, to remove old versions. Indeed, MV-permissiveness is satisfied by current multi-version STMs, both practical [20, 21] and more theoretical [16, 19], keeping many versions per item. It can be achieved by other multi-version STMs [22,3], if enough versions of the items are maintained. This paper shows it is possible to achieve MV-permissiveness while keeping only a single version of each data item. We present PermiSTM, a single-version STM that is both MV-permissive and strongly progressive, indicating that multiple versions are not the only design choice when seeking to reduce spurious aborts. By maintaining a single version, PermiSTM avoids the high space complexity associated with multiversion STMs, which is often unacceptable in practice. This also eliminates the need for intricate mechanisms of maintaining and garbage collecting old versions. PermiSTM is lock-based, like many contemporary STMs, e.g., [6, 5, 7, 23]. For each data item, it maintains a single version, as well as a lock, and a read counter, counting the number of pending transactions that have read the item. Read-only transactions never abort (without having to declare them as such, in advance); update transactions abort only if some data item in their read set is written by another transaction, i.e., at least one of the conflicting transactions commits. Although it is blocking, PermiSTM is deadlock-free, i.e., always some transaction can make progress. The design choices of PermiSTM offer several benefits, most notably: – Simple lock-based design makes it easier to argue about correctness. – Read counters avoid the overhead of incremental validation, thereby improving performance, as demonstrated in [6, 17], especially in read-dominated workloads. Read-only transactions do not require validation at all, while update transactions validate their read sets only once. – Read counters circumvent the need for a central mechanism, like a global version clock. Thus, PermiSTM is strictly disjoint-access parallel [10], namely, processes executing transactions with disjoint data sets do not access the same base objects. It has been proved [20, Theorem 2] that a weakly disjoint-access parallel STM [2, 14] cannot be MV-permissive. PermiSTM, satisfying the even stronger property of strict disjoint-access parallelism, shows that this impossibility result depends on a strong progress condition: a transaction delays only due to a pending operation (by another transaction). In PermiSTM, a transaction may delay due to another transaction reading from its write set, even if no operation of the reading transaction is pending.

2 Preliminaries We briefly describe the transactional memory model [15]. A transaction is a sequence of operations executed by a single process. Each operation either accesses a data item or tries to commit or abort the transaction. Specifically, a read operation specifies the item to read, and returns the value read by the operation; a write operation specifies the item and value to be written; a try-commit operation returns an indication whether

Single-Version STMs Can Be Multi-version Permissive boolean CAS(obj, exp, new) { // Atomically if obj = exp then obj ← new return TRUE return FALSE }

85

boolean k CSS (o[k], e[k], new) { // Atomically if o[1] = e[1] and . . . o[k] = e[k] then o[1] ← new return TRUE return FALSE }

Fig. 1. The CAS and k-compare-single-swap primitives

the transaction committed or aborted; an abort operation returns an indication that the transaction is aborted. While trying to commit, a transaction might be aborted, e.g., due to conflict with another transaction.1 A transaction is forcibly aborted if an invocation of a try-commit returns an indication that the transaction is aborted. Every transaction begins with a sequence of read and write operations. The last operation of a transaction is either an access operation, in which case the transaction is pending, or a try-commit or an abort operation, in which case the transaction is committed or aborted. A software implementation of transactional memory (STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitives on the base objects, which asynchronous processes follow in order to execute the operations of the transactions. An event is a computation step by a process consisting of local computation and the application of a primitive to base objects, followed by a change to the process’s state, according to the results of the primitive. We employ the following primitives: READ(o) returns the value in base object o; WRITE(o, v) sets the value of base object o to v; CAS(o, exp, new) writes the value new to base object o if its value is equal to exp, and returns a success or failure indication; k CSS is similar to CAS, but compares the values of k independent base objects (see Figure 1). 2.1 STM Properties We require the STM to be opaque [11]. Very roughly stated, opacity is similar to requiring strict view serializability applied to all transactions (included aborted ones). Restrictions on spurious aborts are stated by the following two conditions. Definition 1. A multi-version (MV-)permissive STM [20] forcibly aborts a transaction only if it is an update transaction that has a conflict with another update transaction. Definition 2. An STM is strongly progressive [12] if a transaction that has no conflicts cannot be forcibly aborted, and if a set of transactions have conflicts on a single item then not all of them are forcibly aborted. These two properties are incomparable: strong progressiveness allows a read-only transaction to abort, if it has a conflict with another update transaction; on the other hand, MVpermissiveness does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. 1

Two transactions conflict if they access the same data item; the conflict is nontrivial if at least one of the operations is a write. In the rest of the paper all conflicts are nontrivial conflicts.

86

H. Attiya and E. Hillel

Finally, an STM is strictly disjoint-access parallel [10] if two processes, executing transactions T1 and T2 , access the same base object, at least one with a non-trivial primitive, only if the data sets of T1 and T2 intersect.

3 The Design of PermiSTM The design of PermiSTM is very simple. The first and foremost goal is to ensure that a read-only transaction never aborts, while maintaining only a single-version. This suggests that the data returned by a read operation issued by a read-only transaction T should not be overwritten until T completes. A natural way to achieve this goal is to associate a read counter with each item, tracking the number of pending transactions reading from the item. Transactions that write to the data items respect the read counters; an update transaction commits and updates the items in its write set only in a “quiescent” configuration, where no (other) pending transaction is reading an item in its write set. This yields read-only transactions that guarantee consistency without requiring validation and without specifying them as such in advance. The second goal is to guarantee consistent updates of data items, by using ordinary locks to ensure that only one transaction is modifying a data item at each point. Thus, before writing its changes, at commit time, an update transaction acquires locks. Having two different mechanisms—locks and counters—in our design, requires care in combining them. One question is when during the executing, a transaction decrements the read counters of the items in its read set? The following simple example demonstrates how a deadlock may happen if an update transaction does not decrement its counters, before acquiring locks: T1 : read(a) write(b) try-commit T2 : read(b) write(a) try-commit T1 and T2 incremented the read counters of a and b, respectively, and later, in commit time, T1 acquires a lock on b, while T2 acquires a lock on a. To commit, T1 has to wait for T2 to complete and decrement the read counter of b, while T2 has to wait for the same to happen with T1 and item a. Since an update transaction first decrements read counters, it must ensure consistency by acquiring locks also for items in its read set. Therefore, an update transaction acquires locks for all items in its data set. Finally, read counters are incremented as they are encountered during the execution of the transaction. What happens if read-only transactions wait for locks to be released? The next example demonstrates how this can create a deadlock: read(b) T1 : read(a) T2 : write(b) write(a) try-commit If T2 acquires a lock on b, then T1 cannot read b until T2 completes; T2 cannot commit as it has to wait for T1 to complete and decrease the read counter of a; MV-permissiveness does not allow both transactions to be forcibly aborted. Thus, read counters get preference over locks, and they can always be incremented. Prior to committing, an update transaction first decrements its read counters, and then acquires locks on all items in its

Single-Version STMs Can Be Multi-version Permissive

write set (ws)

87

item seq data

lock

read set (rs)

read counter

item seq data

owner seq

rcounter

data

status

Fig. 2. Data structures used in the algorithm: an item (left) and a transaction descriptor (right)

data set, in a fixed order (while validating the consistency of its read set); this avoids deadlocks due to blocking cycles, and livelocks due to repeated aborts. Since committing a transaction and committing its updates are not done atomically, a committed transaction that has not yet completed updating all the items in its write set, can yield an inconsistent view for a transaction reading one of these items. If a read operation simply reads the value in the item, it might miss the up-to-date value of the item. Therefore, a read operation is required to read the current value of the item, which can be found either in the item, or in the data of the transaction.2 To simplify the exposition of PermiSTM, k-compare-single-swap (k CSS) [18] is applied to commit an update transaction while ensuring that the read counters of the items in its write set are all zero. Section 4 describes how the implementation can be modified to use only CAS; the resulting implementation is (strongly) disjoint-access parallel but is not strictly disjoint-access parallel. Data Structures. Figure 2 presents the data structures of items and transactions’ descriptors used in our algorithm. We associate a lock and a read counter with each item, as follows: – A lock includes an owner field, and an unbounded sequence number, seq, that are accessed atomically. The owner field is set to the id of the update transaction owning the lock and is 0 if no transaction holds the lock. The seq field holds the sequence number of the data, it is incremented whenever a new data is committed to the item, and it is used to assert the consistency of reads. – A simple read counter, rcounter, tracks how many transactions are reading the item. – The data field holds the value that was last written to the item, or its initial value if no transaction yet written to the item. The descriptor of a transaction consists of the read set, rs, the write set ws, and the status of the transaction. The read and write sets are collections of data items. – A data item in the read set includes a reference to an item, the data read from the item, and the sequence number of this data, seq. 2

This is analogous to the notion of current version of a transactional object in DSTM [13].

88

H. Attiya and E. Hillel

– A data item in the write set includes a reference to an item, the data to be written in the item, and the sequence number of the new data, seq, i.e., the sequence number of the current data plus 1. – A status indicates if the transaction is COMMITTED or ABORTED, initially NULL. The current data and sequence number of an item are defined as follows: If the lock of the item is owned by a committed transaction that writes to this item, then the current data and sequence number of the item appear in the write set of the owner transaction. Otherwise (owner is 0, or the owner is not committed, or the item is not in the owner’s write set), the current data and current sequence number appear in the item. The Algorithm. Next we give a detailed description of the main methods, for handling the operations; the code appears in Pseudocodes 1 and 2. The reserved word self in the pseudocode is a self-reference to the descriptor of the transaction whose code is being executed. read method: If the item is already in the transaction’s read set (line 2), return the value from the read set (line 3). Otherwise, increment the read counter of the item (line 5). Then, the reading transaction adds the item to its read set (line 7) with the current data and sequence number of the item (line 6). write method: If the item is not already in the transaction’s write set (line 11), then add the item to the write set (line 12). Set data of the item in the transaction’s write set to the new data to be written (line 13). No lock is acquired at this stage. tryCommit method: Decrement all the read counters of the items in the transaction’s read set (line 16). If the transaction is read-only, i.e., the write set of the transaction is empty (line 17), then commit (line 18); the transaction completes and returns (line 19). Otherwise, this is an update transaction and it continues: acquire locks on all items in the data set (line 20); commit the transaction (line 22) and the changes to the items (lines 23-25); release locks on all items in the data set (line 26). The transaction may abort while acquiring locks due to a conflict with another update transaction (line 21). acquireLocks method: Acquire locks on all items in the data set of the transaction by their order (line 30). If the item is in the read set (line 33), check that the sequence number in the read set (line 34) is the same as the current sequence number of the item (line 32). If the sequence number has changed (line 35) then the data read is overwritten by another committed transaction and the transaction aborts (line 36). Use CAS to acquire the lock: set owner from 0 to the descriptor of the transaction; if the item is in the read set this is done while asserting that seq is unchanged (line 38). If the CAS failed then owner is non-zero since there is another owner (or seq has changed), so spin, re-reading the lock (line 38) until owner is 0. If the item is in the write set (line 39), set the sequence number of the item in the transaction’s write set, seq, to the sequence number of the current data plus 1 (line 41). commitTx method: Use k CSS to set status to COMMITTED , while ensuring that all read counters of items in the transaction’s write set are 0 (line 47). If the read counter of one of these items is not 0, a pending transaction is reading from this item, then spin, until all rcounters are 0.

Single-Version STMs Can Be Multi-version Permissive

Pseudocode 1. Methods for read, write and try-commit operations 1: Data read(Item item) { 2: if item in rs then 3: di ← rs.get(item) 4: else 5: incrementReadCounter(item) 6: di ← getAbsVal(item) 7: rs.add(item,di) 8: return di.data 9: }

10: write(Item item, Data data) { 11: if item not in ws then 12: ws.add(item,item,0,0) 13: ws.set(item,item,0,data) 14: }

15: tryCommit() { 16: decrementReadCounters() // decrement read counters 17: if ws is empty then // read-only transaction 18: WRITE (status, COMMITTED ) 19: return // update transaction 20: acquireLocks() // lock all the data set 21: if ABORTED = READ(status) then return 22: commitTx() // commit update transaction 23: for each item in ws do // commit the changes to the items 24: di ← owner.ws.get(item) 25: WRITE (item.data, di.data) 26: releaseLocks() // release locks on all the data set 27: } 28: acquireLocks() { 29: ds ← ws.add(rs) // items in the data set (read and write sets) 30: for each item in ds by their order do 31: do 32: cur ← getAbsVal(item) // current value 33: if item in rs then // check validity of read set 34: di ← rs.get(item) // value read by the transaction 35: if di.seq != cur.seq then // the data is overwritten 36: abort() 37: return 38: while ! CAS(item.lock, 0,cur.seq, self,cur.seq) 39: if item in ws then 40: di ← ws.get(item) 41: ws.set(item,item,cur.seq+1,di.data) 42: } 43: commitTx() { 44: kCompare[0] ← status // the location to be compared and swaped 45: for i = 1 to k − 1 do // k − 1 locations to be compared 46: kCompare[i] ← ws.get(i).item.rcounter 47: while !k CSS (kCompare, NULL ,0 . . . 0, COMMITTED ) do 48: no-op // until no reading transactions is pending 49: }

89

90

H. Attiya and E. Hillel

Pseudocode 2. Additional methods for PermiSTM 50: incrementReadCounter(Item item) { 51: do m ← READ(item.rcounter) 52: while ! CAS(item.rcounter, m, m + 1) 53: }

59: releaseLocks() { 60: ds ← ws.add(rs) 61: for each item in ds do 62: di ← ds.get(item) 63: WRITE (item.lock, 0,di.seq) 64: }

54: decrementReadCounters() { 55: for each item in rs do 56: do m ← READ(item.rcounter) 57: while ! CAS(item.rcounter, m, m − 1) 58: }

65: DataItem getAbsVal(Item item) { 66: lck ← READ(item.lock) 67: dt ← READ(item.data) 68: di ← item, lck.seq, dt // values from the item 69: if lck.owner != 0 then 70: sts ← READ(lck.owner.status) 71: if sts = COMMITTED then 72: if item in lck.owner.ws then 73: di ← lck.owner.ws.get(item) // values from the write set of the owner 74: return di 75: } 76: abort() { 77: ds ← ws.add(rs) 78: for each item in ds do 79: lck ← READ(item.lock) 80: if lck.owner = self then // the transaction owns the item 81: WRITE (item.lock, 0,lck.seq) // release lock 82: WRITE (status, ABORTED ) 83: }

Properties of PermiSTM. Since PermiSTM is lock-based, it is easier to argue that it preserves opacity than in implementations that do not use locks. Specifically, an update transaction holds locks on all items in its data set before committing, which allows updating transactions to be serialized at their commit point. The algorithm ensures that an update transaction does not commit, leaving the value of the items in its write set unchanged, as long as there is a pending transaction reading one of the items to be written. A read operation reads the current value of the item, after incrementing its read counter. So, if an update transaction commits before the read counter is incremented, but the changes are not yet committed in the items, the reading transaction still maintains a consistent state as it reads the value from the write set of the committed transaction, which is the up-to-date value of the item. Hence, a read-only transaction is serialized after the update transaction that writes to one of the read items and is last to be committed. Since all transactions do not decrement read counters until commit time, and since all read operations return the up-to-date value of the item, all transactions maintain a consistent view. As this holds for committed as well as aborted transactions, PermiSTM is opaque.

Single-Version STMs Can Be Multi-version Permissive

91

Next we discuss the progress properties of the algorithm. After an update transaction acquires locks on all items in its data set it may wait for other transactions reading items in its write set to complete, it may even starve due to continual stream of readers; thus, our STM is blocking. However, the STM guarantees strong progressiveness, as transactions are forcibly aborted only due to another committed transaction with a read-afterwrite conflict; since read-only transactions are never forcibly aborted, PermiSTM is MV-permissive. Furthermore, read-only transactions are obstruction-free [13]. A readonly transaction may delay due to contention with concurrent transactions, updating the same read counters, but once it is running solo it is guaranteed to commit. Write, try-commit and abort operations only access the descriptor of the transaction and the items in the data set of the transaction; this may result in contention only with non-disjoint transactions. A read operation, in addition to accessing the read counter of the item, also reads the descriptor of the owning transaction, which may result in contention only with non-disjoint transactions; thus, PermiSTM is strictly disjoint-access parallel. Note that disjoint transactions may concurrently read the descriptor of a transaction owning items the transactions read, however, this does not violate strict disjointaccess parallelism. Furthermore these disjoint transaction read from the same base object only if they all intersect with the owning transaction; This property is called 2-local contention [1] and it implies (strong) disjoint-access parallelism [2].

4

CAS-Based PermiSTM

The k CSS operation can be implemented in software from CAS [18], without sacrificing the properties of PermiSTM. However, this implementation is intricate and incurs a step complexity that can be avoided in our case. This section outlines the modifications of PermiSTM needed to obtain an STM with similar properties using CAS instead of a k CSS primitive; this results in more costly read operations. We still wish to guarantee that an update transaction commits only in a “quiescent” configuration, in which no other transaction is reading an item in its write set. If the committing update transaction does not use k CSS, then the responsibility of “notifying” the update transaction that it cannot commit is shifted to the read operations, and they pay the extra cost of preventing the update transactions from committing in a nonquiescent configuration. A transaction commits by changing its status from NULL to COMMITTED ; a way to prevent an update transaction from committing is by invalidating its status. For this purpose, we attach a sequence number to the transaction status. Prior to committing, an update transaction reads its status, which now includes the sequence number, and repeats the following for each item in its write set: spin on the item’s read counter until the read counter becomes zero, then annotate the zero with a reference to its descriptor, and the status sequence number. The transaction changes its status to COMMITTED only if the sequence number of its status has not changed since it has read it. Once it completes annotating all zero counters, and unless it is notified by some read operation that one of the counters changed and it is no longer “quiescent”, the update transaction can commit—using only a CAS. A read operation basically increases the read counter, and then reads the current value of the item. The only change is when it encounters a “marked” counter. If the

92

H. Attiya and E. Hillel

update transaction annotating the item already committed, the read operation simply increases the counter. Otherwise, the read operation invalidates the status of the update transaction, by increasing its status sequence number. If more than one transaction is reading an item from the write set of the update transaction, at least one of them prevents the update transaction from committing, by changing its status sequence number. The changes in the data structures used by the algorithm are as follows: The status of a transaction descriptor now includes the state of the transaction (NULL, COMMITTED, or ABORTED ), as well as a sequence number, seq, that is used to invalidate the status; these fields are accessed atomically. The read counter, rcount, of an item is a tuple including a counter of the number of readers, the owner transaction of the item (holding its lock), and seq matching the status sequence number of the owner. We reuse the core implementation of operations from Pseudocodes 1 and 2. The most crucial modification is in the protocol for incrementing the read counter, which invalidates the status of the owner transaction when increasing the item’s read counter. Pseudocode 3 presents the main modifications. In order to commit, an update transaction reads the read counter of every item in its write set (lines 87-88), and when the read counter is 0, the update transactions annotates the 0 with its descriptor and status sequence number, using CAS (line 89). Finally it sets the status to COMMITTED while increasing the status sequence number, using CAS. If the status was invalidated and the last CAS fails, the transaction re-reads the status (line 86) and goes over the procedure again. A successful CAS implies that the transaction committed while no other transaction is reading any item in its write set.

Pseudocode 3. methods for avoiding k CSS 84: commitTx() { 85: do 86: sts ← READ(status) 87: for each item in ws do 88: do rc ← READ(item.rlock) // spin until no readers 89: while ! CAS(item.rcounter, 0,rc.owner,rc.seq, 0,self,sts.seq) // annotated 0 // commit in a “quiescent” configuration 90: while ! CAS(status, NULL ,sts.seq, COMMITTED ,sts.seq+1) 91: } 92: incrementReadCounter(Item item) { 93: do 94: rc ← READ(item.rcounter) 95: if rc.owner != 0 then // the read counter is “marked” 96: CAS(rc.owner.status, NULL ,rc.seq, NULL ,rc.seq+1) // invalidate status 97: while ! CAS(item.rcounter, rc, rc.counter+1,rc.owner,rc.seq) // increase counter 98: } 99: decrementReadCounters() { 100: for each item in rs do 101: do rc ← READ(item.rcounter) 102: while !CAS (item.rcounter, rc, rc.counter−1,0,0) // clean and decrease counter 103: }

Single-Version STMs Can Be Multi-version Permissive

93

To ensure that an update transaction only commits in a “quiescent” configuration, a read operation that finds the read counter of the item “marked” (lines 94-95) continuous as follows: use CAS to invalidate the status of the owner transaction—by increasing its sequence number (line 96), if the status sequence number has changed, either the owner is committed or its status was already invalidated; finally, the reader transaction simply increases the reader counter using CAS (line 97). If increasing the reader counter fails, the reader repeats the procedure. While decreasing the read counters the reader transaction cleans each read counter by setting its owner and seq fields to 0 (line 102). In addition, methods such as tryCommit and abort are adjusted to handle the new structure, for example, accessing the read counter and state indicator through the new read counters and status fields. The resulting algorithm is not strictly disjoint-access parallel. Two transactions, T1 and T2 , reading items a and b, respectively, may access the same base object, when checking and invalidating the status of a third transaction, T3 , updating these items. The algorithm, however, has 2-local contention [1] and is (strongly) disjoint-access parallel, as this memory contention is always due to T3 , which intersects both T1 and T2 .

5 Discussion This paper presents PermiSTM, a single-version STM that is both MV-permissive and strongly progressive; it is also disjoint-access parallel. PermiSTM has simple design, based on read counters and locks, to provide consistency without incremental validation. This also simplifies the correctness argument. The first variant of PermiSTM uses a k-compare-single-swap to commit update transactions. No architecture currently provides k CSS in hardware, but it can be supported by best-effort hardware transactional memory (cf. [4]). In PermiSTM, update transactions are not obstruction free [13], since they may block due to other conflicting transactions. Indeed, a single-version, obstruction-free STM cannot be strictly disjoint-access parallel [10]. Read-only transactions modify the read counters of all items in their read set. This matches the lower bound for read-only transactions that never abort, for (strongly) disjoint-access parallel STMs [2]. Several design principles of PermiSTM are inspired by TLRW [6], which uses readwrite locks. TLRW, however, is not permissive as read-only transactions may abort due to a timeout while attempting to acquire a lock. We avoid this problem by tracking readers through read counters (somewhat similar to SkySTM [17]) instead of read locks. Our algorithm improves on the multi-versioned UP-MV STM [20], which is not weakly disjoint-access parallel (nor strictly disjoint-access parallel), as it uses a global transaction set, holding the descriptors of all completed transactions yet to be collected by the garbage collection mechanism. UP-MV STM requires that operations execute atomically; its progress properties depend on the precise manner this atomicity is guaranteed, which is not detailed. We remark that simply enforcing atomicity with a global lock or a mechanism similar to TL2 locking [5] could make the algorithm blocking. Acknowledgements. We thank the anonymous refereees for helpful comments.

94

H. Attiya and E. Hillel

References 1. Afek, Y., Merritt, M., Taubenfeld, G., Touitou, D.: Disentangling multi-object operations. In: PODC 1997, pp. 111–120 (1997) 2. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009, pp. 69–78 (2009) 3. Aydonat, U., Abdelrahman, T.: Serializability of transactions in software transactional memory. In: TRANSACT 2008 (2008) 4. Dice, D., Lev, Y., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Simplifying concurrent algorithms by exploiting hardware transactional memory. In: SPAA 2010, pp. 325– 334 (2010) 5. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 6. Dice, D., Shavit, N.: TLRW: Return of the read-write lock. In: SPAA 2010, pp. 284–293 (2010) 7. Ennals, R.: Software transactional memory should not be obstruction-free. Technical Report IRC-TR-06-052, Intel Research Cambridge Tech. Report (2006) 8. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 9. Guerraoui, R., Henzinger, T.A., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) 10. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 11. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 12. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 13. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 14. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 1994, pp. 151–160 (1994) 15. Kapalka, M.: Theory of Transactional Memory. PhD thesis, EPFL (2010) 16. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 17. Lev, Y., Luchangco, V., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Anatomy of a scalable software transactional memory. In: TRANSACT 2009 (2009) 18. Luchangco, V., Moir, M., Shavit, N.: Nonblocking k-compare-single-swap. In: SPAA 2003, pp. 314–323 (2003) 19. Napper, J., Alvisi, L.: Lock-free serializable transactions. Technical Report TR-05-04, The University of Texas at Austin (2005) 20. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 21. Perelman, D., Keidar, I.: SMV: Selective Multi-Versioning STM. In: TRANSACT 2010 (2010) 22. Riegel, T., Felber, P., Fetzer, C.: A lazy snapshot algorithm with eager validation. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 284–298. Springer, Heidelberg (2006) 23. Saha, B., Adl-Tabatabai, A.-R., Hudson, R.L., Cao Minh, C., Hertzberg, B.: McRT-STM: a high performance software transactional memory system for a multi-core runtime. In: PPoPP 2006, pp. 187–197 (2006)

Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems Sathya Peri1, and Krishnamurthy Vidyasankar2 1

Indian Institute of Technology Patna, India [email protected] 2 Memorial University, St John’s, Canada [email protected]

Abstract. A generally agreed upon requirement for correctness of concurrent executions in Transactional Memory systems is that all transactions including the aborted ones read consistent values. Opacity is a recently proposed correctness criterion that satisfies the above requirement. Our first contribution in this paper is extending the opacity definition for closed nested transactions. Secondly, we define conflicts appropriate for optimistic executions which are commonly used in Software Transactional Memory systems. Using these conflicts, we define a restricted, conflict-preserving, class of opacity for closed nested transactions the membership of which can be tested in polynomial time. As our third contribution, we propose a correctness criterion that defines a class of schedules where aborted transactions do not affect consistency of the other transactions. We define a conflict-preserving subclass of this class as well. Both the class definitions and the conflict definition are new for nested transactions.

1 Introduction In recent years, Software Transactional Memory (STM) has garnered significant interest as an elegant alternative for developing concurrent code. Importantly, transactions provide a very promising approach for composing software components. Composing simple transactions into a larger transaction is an extremely useful property which forms the basis of modular programming. This is achieved through nesting. A transaction is nested if it is invoked by another transaction. STM systems ensure that transactions are executed atomically. That is, each transaction is either executed to completion in which case it is committed and its effects are visible to other transactions or aborted and the effects of a partial execution, if any, are rolled back. In a closed nested transaction [2], the commit of a sub-transaction is local; its effects are visible only to its parent. When the top-level transaction (of the nestedcomputation) commits, the effects of the sub-transaction are visible to other top-level transactions. The abort of a sub-transaction is also local; the other sub-transactions and the top-level transaction are not affected by its abort.1 To achieve atomicity, a commonly used approach for software transactions is optimistic synchronisation (term used in [6]). In this approach, each transaction has local 1

This work was done when the author was a Post-doctoral Fellow at Memorial University. Apart from Closed nesting, Flat and Open nesting [2] are the other means of nesting in STMs.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 95–106, 2011. c Springer-Verlag Berlin Heidelberg 2011

96

S. Peri and K. Vidyasankar

buffers where it records the values read and written in the course of its execution. When the transaction completes, the contents of its buffers are validated. If the values in the buffers form a consistent view of the memory then the transaction is committed and the values are merged into the memory. If the validation fails, the transaction is aborted and the buffer contents are ignored. The notion of buffers extends naturally to closed nested transactions. When a sub-transaction is invoked, new buffers are created for all the data items it accesses. The contents of the buffers are merged with its parent’s buffers when the sub-transaction commits. A commonly accepted correctness requirement for concurrent executions in STM systems is that all transactions including aborted ones read consistent values. The values resulting from any serial execution of transactions are assumed to be consistent. Then, for each transaction, in a concurrent execution, there should exist a serial execution of some of the transactions giving rise to the values read by that transaction. Guerraoui and Kapalka [5] captured this requirement as opacity. An implementation of opacity has been given in [8]. On the other hand, the recent understanding (Doherty et al [3], Imbs et al [7]) is that opacity is too strong a correctness criterion for STMs. Weaker notions have been proposed: (i) The requirement of a single equivalent serial schedule is replaced by allowing possibly different equivalent serial schedules for committed transactions and for each aborted transaction, and these schedules need not be compatible; and (ii) the effects, namely, the read steps, of aborted transactions should not affect the consistency of the transactions executed subsequently. The first point refines the consistency notion for aborted transactions. (All the proposals insist on a single equivalent serial schedule consisting of all committed transactions.) The second point is a desirable property for transactions in general and a critical point for nested transactions, where the reads of an aborted sub-transaction may prohibit committing the entire top-level transaction. The above proposals in the literature have been made for non-nested transactions. In this paper, we define two notions of correctness and corresponding classes of schedules: Closed Nested Opacity (CNO) and Abort-Shielded Consistency (ASC). In the first notion, read steps of aborted (sub-)transactions are included in the serialization as in opacity [5, 8]. In the second, they are discarded. These definitions turn out to be non-trivial due to the fact that an aborted sub-transaction may have some (locally) committed descendents and similarly some committed ancestors. Checking opacity, like general serializability (for instance, view-serializability), cannot be done efficiently. Very much like restricted classes of serializability allowing polynomial membership test, and facilitating online scheduling, restricted classes of opacity can also be defined. We define such classes along the lines of conflict-serializability for database transactions: Conflict-Preserving Closed Nested Opacity (CP-CNO) and Conflict-Preserving Abort-Shielded Consistency (CP-ASC). Our conflict notion is tailored for optimistic execution of the sub-transactions and not just between any two conflicting operations. We give an algorithm for checking the membership in CP-CNO which can be easily modified for CP-ASC as well. The algorithm uses serialization graphs similar to those in [12]. Using this algorithm an online scheduler implementing these classes can be designed.

Correctness of Concurrent Executions of Closed Nested Transactions

97

We note that all online schedulers (implementing 2PL, timestamp, optimistic approaches, etc.) for database transactions allow only subclasses of conflict-serializable schedules. We believe similarly that all STM schedulers can only allow subclasses of conflict-preserving schedules satisfying opacity or any of its variants. Such schedulers are likely to use mechanisms simpler than serialization graphs as in the database area. An example is the scheduler described by Imbs and Raynal [8]. There have been many implementations of nested transactions in the past few years [2, 10, 1, 9]. However, none of them provide precise correctness criteria for closed nested transactions that can be efficiently verified. In [2], the authors provide correctness criteria for open nested transactions which can be extended to closed nested transactions as well. Their correctness criteria also look for a single equivalent serial schedule of both (read-only) aborted transactions and committed transactions. Roadmap: In Section 2, we describe our model and background. In Section 3, we define CNO, CP-CNO In Section 4, we present ASC and CP-ASC Section 5 concludes this paper.

2 Background and System Model A transaction is a piece of code in execution. In the course of its execution, a nested transaction performs read and write operations on memory and invokes other transactions (also referred to as sub-transactions). A compuation of nested transactions constitutes a computation tree. The operations of the computation are classified as: simplememory operations and transactions. Simple-memory operations are read or write on memory. In this document when we refer to a transaction in general, it could be a toplevel transaction or a sub-transaction. Collectively, we refer to transactions and simplememory operation as nodes (of the computation tree) and denote them as nid . If a transaction tX executes successfully to completion, it terminates with a commit operation denoted as cX . Otherwise it aborts, aX . Abort and commit operations are called terminal operations2. By default, all the simple-memory operations are always considered to be (locally) committed. In our model, transactions can interleave at any level. Hence the child sub-transactions of any transaction can execute in interleaved manner. To perform a write operation on data item x, a closed-nested transaction tP creates a x-buffer (if it is not already present) and writes to the buffer. A buffer is created for every data item tP accesses. When tP commits, it merges the contents of its local buffers with the buffers of its parent. Any peer (sibling) transaction of tP can read the values written by tP only after tP commits. We assume that there exists a hypothetical root transaction of the computation tree, denoted as tR , which invokes all the top-level transactions. On system initialization we assume that there exists a child transaction tinit of tR , which creates and initializes all the buffers of tR that are written or read by any descendant of tR . Similarly, we also assume that there exists a child transaction tf in of tR , which reads the contents of tR ’s buffers when the computation terminates. 2

A transaction starts with a begin operation. In our model we assume that the begin operation is superimposed with the first event of the transaction. Hence, we do not explicitly represent it in our schedules.

98

S. Peri and K. Vidyasankar

Coming to reads, a transaction maintains a read set consisting of all its read operations. We assume that for a transaction to read a data-item, say x, (unlike write) it has access to the buffers of all its ancestors apart from its own. To read x, a nested sub-transaction tN starts with its local buffers. If it does not contain a x-buffer, tN continues to read the buffers of its ancestors starting from its parent until it encounters a transaction that contains a x-buffer. Since tR ’s buffers have been initialized by tinit , tN will eventually read a value for x. When the transaction commits, its read set is merged with its parent’s read set. We will revisit read operations a few subsections later. 2.1 Schedules A schedule is a totally ordered sequence (in real time order) of simple-memory operations and terminal operations of transactions in a computation. These operations are referred to as events of the schedule. A schedule is represented by the tuple evts, nodes, ord, where evts is the set of all events in the schedule, nodes is the set of all the nodes (transactions and simple-memory operations) present in the computation and ord is an unary function that totally orders all the events in the order of execution. Example 1 shows a schedule, S1. In this schedule, the memory operations r2211 (x) and w2212 (y) belong to the transaction t221 . Transactions t22 and t31 are aborted. All the other transactions are committed. It must be noted that the t221 and t222 of t22 are committed sub-transactions of the aborted transaction t22 . Example 1 S1 : r111 (z)w112 (y)w12 (z)c11 r211 (b)r2211 (x)w2212 (y)c221 w212 (y)c21 w13 (y)c1 r2221 (y)w2222 (z)c222 a22 w23 (z)r311 (y)c2 w312 (y)a31 r321 (z)w322 (z)c32 c3 The events of the schedule are the real time representation of the leaves of the computation tree. The computation tree for a schedule S1 is shown in Figure 1. The order of execution of memory operations is from left to right as shown in the tree. The dotted edges represent terminal operations. The terminal operations are not part of the computation tree but are represented here for clarity. tR tinit

t1 t11

c1

t21 c w12 (z)11

r111(z) w112(y)

tf in

t2

t221 r211(b)

w13 (y) c21 w212 (y) c221

r2211(x)w

2212 (y)

t3 c2

t22

t31

t32

w23 (z) t222 a22

c222

r311 (y)

a31

w312(y)

r2221(y)w2222(z)

Fig. 1. Computation tree for Example 1

c3 c32

w (z) r321(z)322

Correctness of Concurrent Executions of Closed Nested Transactions

99

For a closed nested transaction, all its write operations are visible to other transactions only after it commits. In S1, w212 (y) occurs before w13 (y). When t1 commits, it writes w13 (y) onto tR ’s y-buffer. But t2 commits after t1 commits. When t2 commits it overwrites tR ’s y buffer with w212 (y). Thus when transaction t31 performs the read operation r311 (y), it reads the value written by w212 (y) and not w13 (y) even though w13 (y) occurs after w212 (y). To model the effects of commits clearly, we augment a schedule with extra write operations. For each transaction that commits, we introduce a commit-write operation for each data item x the transaction writes to or one of its children commit-writes. This writes the latest value in the transaction’s x-buffer. The commit-writes are added just before the commit operation and represent the merging of the local buffers with its parent’s buffers. Using this representation, the schedule for Figure 1 is: Example 2 112 2212 S2 : r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)r2211 (x)w2212 (y)w221 (y)c221 w212 (y) 212 12 13 2222 w21 (y)c21 w13 (y)w1 (z)w1 (y)c1 r2221 (y)w2222 (z)w222 (z)c222 a22 w23 (z)r311 (y) 322 w221 (y)w223 (z)c2 w312 (y)a31 r321 (z)w322 (z)w32 (z)c32 w332 (z)c3 112 212 Some examples of commit-write in S2 are w11 (y), w21 (y), w223 (z) etc. The com112 mitwrite w11 (y) represents t11 ’s write onto t1 ’s y buffer with the value written by w112 . There are no commit-writes for aborted transactions. Hence the writes of aborted transactions are not visible to its peers. Originally in the computation tree only the leaf nodes could write. With this augmentation of transactions, even non-leaf nodes (i.e. committed transactions) write with commit-write operations. For sake of brevity, we do not represent commit-writes in the computation tree. In the rest of this document, we assume that all the schedules we deal with are augmented with commit-writes. Generalizing the notion of commit-writes to any node of the tree, the commit-write of a simple-memory write is itself. It is nil for a read operation and aborted transactions. Collectively we refer to simple-memory operations along with commit-write operations as memory operations. With commit-write operations, we extend the definition of an operation, denoted as oX , to represent a transaction, a commit-write operation or a simple-memory operation. It can be seen that a schedule partially orders all the transactions and simple-memory operations in the computation. The partial order is called schedule-partial-order and is denoted <S . For a transaction tX in S, we define S.tX .f irst, S.tX .last as the first and last operations of tX . Thus, S.tX .last denotes the terminal operation of the tX . For a simple-memory operation mX , S.mX .f irst = S.mX .last. For two nodes nX , nY , in S: (nX <S nY ) ≡ (S.ord(S.nX .last) < S.ord(S.nY .f irst)).

2.2 Function Definitions For a commit-write operation wX we define its holder, S.holder(wX ) as the transaction tX to which it belongs to. Extending this function to a node (a transaction or simple-memory operation), the holder of a node is itself. For any operation oX , we define S.level(oX ) as the distance of S.holder(oX ) in the tree from the root. From this definition tR is at level 0. The level of a transaction and all its commit-write operations 212 are the same. For instance in Example 2, S2.level(w21 ) = S2.level(t21) = 2.

100

S. Peri and K. Vidyasankar

The functions on a tree, namely parent, children, ancestor, descendant, peer (siblings) can be extended to commit-write operations by defining them for S.holder(oX ) over the tree. For instance in S2 of Example 2, S2.parent(w221 ) = tR and S2.children( w221 = {t21 , t22 , w23 (z)}. Thus transactions and simple-memory operations are children of a transaction. Two commit-writes of the same node are not peers of each other since they have the same holder. For a transaction, tX in a computation, we define its dSet, denoted as S.dSet(tX ) as the set consisting of tX , tX ’s commit-writes, tX ’s begin and terminal operations and dSets of tX ’s descendants (including its children). This set comprises of all the operations in the sub-tree of tX . A simple-memory operation’s dSet is itself. A commitwrite’s dSet is its holder transactions’s dSet. In Example 2, S2.dSet(t2 ) = S2.dSet(w223 212 (z)) = {t2 , t21 , t22 , w23 (z), r211 (b), w212 (y), w21 (y), c21 , t221 , r2211 (x), w2212 (y), 2212 2222 w221 (y), c221 , t222 , r2221 (y), w2222 (z), w222 (z), c222 , a22 , w221 (y), w223 (z), c2 } Next we define a boolean function optVis on two operations oX , oY in a schedule S, denoted as S.optV is(oY , oX ). It is true if oY is a peer of oX or peer of an ancestor of oX , i.e., oY ∈ (S.peers(oX ) ∪ S.peers(S.ansc(oX ))). Otherwise it is false. This definition implies that if (oX ∈ S.dSet(oY )) then S.optV is(oY , oX ) is false. As a result for any commit-write of oY , say wY , S.optV is(wY , oX ) is false as well. One can see that optVis function is not symmetric (but not asymmetric). Hence S.optV is(oY , oX ) does not imply S.optV is(oX , oY ). In S2, S2.optV is(w112 (z), r211 (b)) is true as w112 (z) is a peer of t2 which is an ancestor of r211 (b). Similarly S2.optV is(t3 , r2221 (y)) is true because t3 is a peer of t2 which is an ancestor of r2221 (y). But S2.optV is(r2221 (y), t3 ) is false. We denote S.schOps(tX ) as the set of operations in S.dSet(tX ) which are also present in S.evts. Formally, S.schOps(tX ) = (S.dSet(tX ) ∩ S.evts). We define a few notations based on aborted transactions in a schedule S. For an aborted transaction tX , we denote S.abort(tX ) as the set of all aborted transactions in tX ’s dSet. It includes tX as well, if it is aborted. We define S.prune(tX ) as all the events in the schOps of tX after removing the events of all aborted transactions in tX . Formally, S.prune(tX ) = {S.schOps(tX ) − ( S.schOps(tA ))}. If tX has tA ∈S.abort(tX )

no aborted transaction in its dSet then S.prune(tX ) is same as S.schOps(tX ). If tX itself is an aborted transaction then its pruned set is nil. 2.3 Writes for Read Operations and Well-Formedness For a read operation rX (z) belonging to a transaction tP in S, we associate a write wY (z) as its lastWrite3 or S.lastW rite(rX (z)). The read operation will retrieve the value written by the lastWrite. We want the lastWrite wY (z) to satisfy the properties: (1) wY occurs before rX in the schedule; (2) wY is optVis to rX ; (3) The value written by wY is in the z-buffer of an ancestor (starting from its parent tP ) closest to rX in terms of level and (4) If there are multiple writes satisfying the above conditions then, the wY is closest to rX in the schedule S. The lastWrite definition ensures that all transactions read values only from committed nodes i.e. a committed transaction or a simple-write operation. Having lastWrite be 3

This term is inspired from [2]

Correctness of Concurrent Executions of Closed Nested Transactions

101

optVis to the read operation ensures that the buffer in which the lastWrite writes to is accessible by the read operation. In S2, the lastWrites are: (r111 (z) : winit (z)), (r211 (b) : 2212 winit (b)), (r2211 (x) : winit (x)), (r2221 (y) : w221 (y)), (r311 (y) : w113 (y)), (r321 (z) : 23 2212 w2 (z)). Note that the read r2221 (y) reads from w221 (y) even though w113 (y) is closer 2212 to r2221 (y) in the schedule. This is because w221 (y) is closer to it in terms of level. For a node nP with a read operation rX in its dSet, the read is said to be an externalread if its lastWrite is not in nP ’s dSet. Thus a read operation rX is an external-read of itself. It can be seen that a nested transaction interacts with its peers through externalreads and commit-writes. Thus, a nested transaction can be treated as a non-nested transaction consisting only of its external-reads and commit-writes. The external-reads and commit-writes of a transaction constitute its extOpsSet. A schedule is called well-formed if it satisfies: (1) Validity of Transaction limits: After a transaction executes a terminal operation no operation (memory or terminal) belonging to it can execute; and (2) Validity of Read Operations: Every read operation reads the value written by its lastWrite operation. We assume that all the schedules we deal with are well-formed. 2.4 Serial Schedules For the case of non-nested transactions a serial schedule is a schedule in which all the transactions execute serially (as the name suggests) without any interleaving. For a nested transaction we define a serial schedule SS as: for every transaction tX in SS, its children (both transactions and simple-memory operations) are totally ordered. Formally, ∀tX ∈ SS.trans : {nY , nZ } ⊆ S.children(tX ) : (SS.ord(nY .last) < SS.ord(nZ .f irst)) ∨ (SS.ord(nZ .last) < SS.ord(nY .f irst)). Thus in a serial schedule, all the events in the dSet of a transaction appear contiguously.

3 Conflict Preserving Closed Nested Opacity 3.1 Closed Nested Opacity Guerraoui and Kapalka [5] proposed the notion of opacity as a correctness criterion for software transactions. A schedule, consisting of an execution of transactions, is said to be opaque if there is an equivalent serial schedule such that it respects the original schedule’s real time ordering of the nodes and the lastWrites for every read operation, including the reads of aborted transactions, in the serial schedule is same as in the original schedule. Opacity ensures that all the reads are consistent. An implementation of opacity for non-nested transactions is given in [8] in which aborted transactions are treated as read-only (with read steps executed before the abort) when looking for an equivalent serial schedule consisting of all the transactions. In the context of nested transactions, an aborted transaction can have a committed sub-transaction whose values are read by other sub-transactions. For instance in S2, aborted transaction t22 ’s sub-transactions t221 and t222 are committed. The read operation r2221 (y) of t222 reads from t221 . This shows that some writes of aborted (sub) transactions should also be considered for correctness of other sub transactions. On

102

S. Peri and K. Vidyasankar

the other hand, a committed transaction can have aborted sub-transactions whose write values should be omitted. In our characterization of schedules, aborted transactions do not have commit-writes. Thus an aborted transaction’s writes do not affect any of its peers or ancestors. But committed sub-transactions of an aborted transactions can have commit-writes and other sub-transactions can read from it. Thus using our representation, opacity can be extended to closed nested transactions. Formally, we define a class of schedules called as Closed Nested Opacity or CNO as follows: A schedule S belongs to CNO if there exists a serial schedule SS such that: (1) Event Equivalence: The operations of S and SS are the same. (2) schedule-partial-order Equivalence: For any two nodes nY , nZ that are peers in the computation tree represented by S, if nY occurs before nZ in S then nY occurs before nZ in SS as well. (3) lastWrite Equivalence: The lastWrites of all read operations in S and SS are the same. Even though the definition of CNO is similar to opacity, the condition lastWrite equivalence captures the intricacies of nested transactions. This class ensures that the reads of all the transactions including all the sub-transactions of aborted transactions read consistent values. 3.2 Conflict Notion: optConf Checking opacity, like general serializability (for instance, view-serializability) cannot be done efficiently. Restricted classes of serializability (like conflict-serializability) have been defined based on conflicts which allow polynomial time membership test, and facilitate online scheduling. Along the same lines, we define a subclass of CNO, CPCNO. This subclass is defined based on a new conflict notion optConf for closed nested transactions. It is tailored for optimistic execution of sub-transactions. This notion is similar to the idea of conflicts presented in [4] for non-nested transactions. The conflict notion optConf is defined only between memory operations in extOpsSets (defined in SubSection2.3) of two peer nodes. As explained earlier, a node (or transaction) interacts with its peer nodes through its extOpsSet. Consider two peer nodes nA , nB . For two memory operations mX , mY on the same data buffer in the extOpsSets of nA , nB , S.isOptConf (mX , mY ) is true if mX occurs before mY in S and one of the following conditions hold: (1) r-w conflict: mX is an external-read rX of nA , mY is a commit-write wY of nB or (2) w-r conflict: mX is a commit-write wX of nA and mY is an external-read rY of nB or (3) w-w conflict: mX is a commit-write wX of nA and mY is a commit-write wY of nB . Consider a read rX that is in optConf with a write wY and let rX ’s lastWrite be wL . By defining the conflicts in this manner we ensure that wL is in w-r conflict with rX and if wY is also in w-r conflict with rX , then w-w conflict between wL and wY ensures that wY does not become rX ’s lastWrite in any optConf equivalent serial schedule. Similarly if wY is in r-w conflict with rX then it cannot become rX ’s lastWrite in the equivalent serial schedule. For S2 in Example 2, we get the set of conflicts as: (r111 (z), w12 (z)), (r111 (z), 112 2212 w223 (z)), (w11 (y), w13 (y)), (w112 (z), w223 (z)), (w113 (y), w221 (y)), (w221 (y), r2221 (y)), 13 21 12 (w1 (y), r311 (y)), (r311 (y), w2 (y)), (r311 (y), w312 (y)), (w1 (z), r321 (z)), (w223 (z), r321 (z)), (r321 (z), w322 (z))). It must be noted that there is no optConf between w113 (y)

Correctness of Concurrent Executions of Closed Nested Transactions

103

212 212 and r2221 (y) or between w21 (y) and r2221 (y) even though w113 (y) and w21 (y) are 2212 optVis to r2221 (y). This is because the w221 (y)’s level (which is the lastWrite of 212 r2221 (y)) is greater than w113 (y) and w21 (y). Hence r2221 (y) is not an external-read of 13 212 any peer of w1 (y) or w21 (y) Using optConf, we define a class of schedules called as Conflict-Preserving Closed Nested Opacity or CP-CNO. It differs from CNO in condition (3) in SubSection3.1. The lastWrite equivalence is replaced by optConf Implication: if two memory operations in S are in optConf then they are also in optConf in SS. Since optConf implication subsumes lastWrite equivalence, we have:

Theorem 1. If a schedule S is in the class CP-CNO then it is also in CNO. Benefits of optConf: Traditionally, two memory operations are said to be in conflict if one of them is a (simple) write operation. In STM systems that employ optimistic synchronization, a write of a transaction becomes visible only after it has committed. In this case for conflicts to be meaningful, two memory operations are said to be in conflict if one of them is a commit-write operation (and not a simple-write). Refining the conflict notion further, we define optConf only between an external-read and a commit-write operation (as well as between two commit-write operations). By defining optConf this way, the class CP-CNO is as non-restrictive as possible and yet does not compromise on any desired property. 3.3 Membership Verification Algorithm Now, we describe the algorithm for testing the membership in the class CP-CNO in polynomial time. Our algorithm is based on the graph construction algorithm by Resende and Abbadi [12] but adapted to optConf. For a schedule S, the algorithm constructs a conflict graph based on optConfs, denoted as S.optGraph, and checks for the acyclicity of that graph. We call this optGraphCons algorithm. The graph S.optGraph is constructed as follows: (1) Vertices: It comprises of all the nodes in the computation tree. The vertex for a node nX is denoted as vX . (2) Edges: Consider each transaction tX starting from tR . For each pair of children nP , nQ , (other than tinit and tf in ) in S.children(tX ) we add an edge from vP to vQ as follows: (2.1) Completion edges: If nP <S nQ . (2.2) Conflict edges: For any two memory operations, mY , mZ such that mY is in nP ’s dSet and mZ is in nQ ’s dSet, if S.isOptConf (mY , mZ ) is true. Since the position of the transactions tinit and tf in are fixed in the tree and in any schedule, we do not consider them in our graph construction algorithm. Now, we get the theorem, Theorem 2. For a schedule S, the graph S.optGraph is acyclic if and only if S is in CP-CNO. It must be noted that in our construction all the edges are between vertices corresponding to peer nodes. There are no edges between vertices that correspond to nodes of different levels. Thus the graph constructed consists of disjoint subgraphs. If the graph is acyclic, then an equivalent serial schedule can be constructed by executing topological sort on all the subgraphs [11]. Using this algorithm, it can be verified that S2 is not in CP-CNO. Further, S2 is also not in CNO.

104

S. Peri and K. Vidyasankar

4 Abort-Shielded Consistency Shortcoming of CNO: A single serial schedule involving all transactions, as required in CNO (and opacity) allows for the reads of an aborted transaction to affect the transactions that follow it. This effect is more pronounced in nested transactions. For instance in S2, transactions t1 and t2 write to the variables y and z. The aborted sub-transaction t31 reads y from t1 . The sub-transaction t32 reads z from t2 . As a result there is no equivalent serial schedule having the same lastWrites as in S2 and hence it is not in CNO. For that matter, any sub-transaction of t3 invoked after t31 ’s invocation (like t33 , t34 etc) that reads any variable written by t2 that has also been written by t1 will cause this schedule to be not opaque. In the worst case, all the sub-transactions of t3 invoked after t31 may satisfy this property and a scheduler (implementing CNO) will abort all of them. This effectively aborts t3 . This shows that with CNO, an aborted sub-transaction can cause its top-level transaction to abort. This can be avoided if the read operations of the aborted transactions are ignored as described below. 4.1 ASC Class Definition Let tA be an aborted transaction in a schedule S. If tA should not affect the transactions following it, then tA should be dropped while considering the correctness of the remaining transactions. Generalizing this idea to all aborted transactions, we construct a sub-schedule consisting of events only from committed transactions (and committed sub-transactions whose ancestors have not been aborted). Thus, the sub-schedule consists of all the events from S.prune(tR ) (prune is defined in SubSection2.2) and is denoted as commitSubSchR . The ordering of the events is same as in the original schedule. We check for the correctness of commitSubSchR. The sub-schedule 112 212 commitSubSchR for S2 is: r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)w212 (y)w21 (y) 12 13 21 23 322 32 c21 w13 (y)w1 (z)w1 (y)c1 w23 (z)w2 (y)w2 (z)c2 r321 (z)w322 (z)w32 (z)c32 w3 (z)c3 . As explained in [5], it is necessary that the aborted transaction tA also reads consistent values. To ensure this, we construct another sub-schedule of S denoted as ppref SubSchA (pruned prefix sub-schedule) for tA . We consider the prefix of all the events until tA ’s abort operation. From this prefix we construct the sub-schedule by removing (1) events from transactions that aborted earlier and (2) events from any aborted sub-transaction of tA . Thus, the sub-schedule consists of events from transactions that committed before tA , events from pruned sub-transactions of tA and events from live transactions (i.e., transactions that have not yet terminated) that executed until abort of tA . The ordering among the events is same as in the original schedule S. Finally, for each live transaction we add a commit operation after tA ’s abort operation to the sub-schedule. But we do not add the commit-writes for these transactions. Then we look for the correctness of this sub-schedule. In S2, for the aborted transaction 112 212 (y)c11 r211 (b)w212 (y)w21 (y)c21 t31 , ppref SubSch31 is: r111 (z)w112 (y)w12 (z)w11 12 13 21 23 w13 (y)w1 (z)w1 (y)c1 w23 (z)r311 (y)w2 (y)w2 (z)c2 w312 (y)a31 c3 . Similarly the sub-schedules for every aborted transaction can be constructed. Here all the sub-schedules have events from at most one aborted transaction. One can see that the sub-schedules commitSubSchR and ppref SubSchA for every aborted transaction tA have the property that if any event is in the sub-schedule, then any other

Correctness of Concurrent Executions of Closed Nested Transactions

105

event that is relevant to it is also in the sub-schedule. We call this property as causality completeness. Hence the lastWrite for any read operation in a sub-schedule will be same as the lastWrite as in the original schedule S. It can also be seen that the events of these sub-schedules form a valid sub-tree of the original computation tree represented by S. We verify the correctness of each of these sub-schedules by looking for an equivalent serial sub-schedule which has the same lastWrite for every read operation. Based on these sub-schedules, Abort-Shielded Consistency or ASC is defined. A schedule S belongs to class ASC if there exists a set of sub-schedules of S, denoted as subSchSet, such that the sub-schedules, commitSubSchR and pref SubSchA , for every aborted transaction tA in S, are in subSchSet and for every sub-schedule subS in subSchSet there exists a serial sub-schedule ssubS such that: (1) Sub-Schedule Event Equivalence: The operations of subS and ssubS are the same (2) schedule-partial-order Equivalence: For any two peer nodes nY , nZ in the computation tree represented by subS, if nY occurs before nZ in subS then nY occurs before nZ in ssubS as well. (3) lastWrite Equivalence: For all the read operations in ssubS, the lastWrites are the same as in subS. From this definition we get that, CNO is a subset of ASC. The schedule S2 is in ASC. Using optConfs with pprefSubSch we define a class of schedules, Conflict-Preserving Abort-Shielded Consistency or CP-ASC. It differs from the definition of the class ASC only in the condition (3), which is optConf Implication: If two memory operations in subS are in optConf then they are also in optConf in ssubS. Using the optGraphCons algorithm we can verify if there exists an equivalent serial sub-schedule for each subschedule in subSchSet. Thus checking whether a schedule is in CP-ASC or not, can be done in polynomial time [11]. Further, it can as well be proved that the class CP-CNO is a subset of CP-ASC. The schedule S2 is in CP-ASC. Using the optGraphCons algorithm an elegant online scheduler implementing CPASC can be designed [11]. The scheduler can be implemented in a completely distributed manner. The serialization graph has separate components for each (parent) sub-transaction. Each component can be maintained at a different site (process executing the sub-transaction) autonomously and the checking can be done in a distributed manner.

5 Conclusion Concurrent executions of transactions in Transactional Memory are expected to ensure that aborted transactions, as the committed ones, read consistent values. In addition, it is desirable that the aborted transactions do not affect the consistency for the other transactions. Incorporating these simple-sounding criteria has been non-trivial even for non-nested transactions as can be seen in recent publications [5, 8, 3]. In this paper, we have considered these requirements for closed nested transactions. We have also defined new conflict-preserving classes that allow polynomial time membership test, by means of constructing conflict-graphs and checking acyclicity. Further, a completely distributed STM scheduler can be designed using these conflict preserving classes. Our future work includes the study of how the above two properties manifest in executions with open nested transactions and with non-transactional steps.

106

S. Peri and K. Vidyasankar

References [1] Agrawal, K., Fineman, J.T., Sukha, J.: Nested parallelism in transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 163–174. ACM, New York (2008) [2] Agrawal, K., Leiserson, C.E., Sukha, J.: Memory models for open-nested transactions. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 70–81. ACM, New York (2006) [3] Doherty, S., Groves, L., Luchangco, V., Moir, M.: Towards formally specifying and verifying transactional memory. In: REFINE (2009) [4] Guerraoui, R., Henzinger, T., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) [5] Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 175–184. ACM, New York (2008) [6] Harris, T., Marlow, S., Peyton-Jones, S., Herlihy, M.: Composable memory transactions. In: PPoPP 2005: Proceedings of the tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 48–60. ACM, New York (2005) [7] Imbs, D., de Mendivil, J.R., Raynal, M.: Brief announcement: virtual world consistency: a new condition for stm systems. In: PODC 2009: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, pp. 280–281. ACM, New York (2009) [8] Imbs, D., Raynal, M.: A lock-based stm protocol that satisfies opacity and progressiveness. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 226–245. Springer, Heidelberg (2008) [9] Moss, J.E.B.: Open Nested Transactions: Semantics and Support. In: Workshop on Memory Performance Issues (2006) [10] Ni, Y., Menon, V.S., Adl-Tabatabai, A.-R., Hosking, A.L., Hudson, R.L., Moss, J.E.B., Saha, B., Shpeisman, T.: Open nesting in software transactional memory. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 68–78. ACM, New York (2007) [11] Peri, S., Vidyasankar, K.: Correctness criteria for closed nested transactions (in preperation). Technical report, Memorial University of Newfoundland (2010) [12] Resende, R.F., El Abbadi, A.: On the serializability theorem for nested transactions. Inf. Process. Lett. 50(4), 177–183 (1994)

Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky and Erez Petrank Dept. of Computer Science, Technion - Israel Institute of Technology {anastas,erez}@cs.technion.ac.il

Abstract. We extend state-of-the-art lock-free linked lists by building linked lists with special care for locality of traversals. These linked lists are built of sequences of entries that reside on consecutive chunks of memory. When traversing such lists, subsequent entries typically reside on the same chunk and are thus close to each other, e.g., in same cache line or on the same virtual memory page. Such cache-conscious implementations of linked lists are frequently used in practice, but making them lock-free requires care. The basic component of this construction is a chunk of entries in the list that maintains a minimum and a maximum number of entries. This basic chunk component is an interesting tool on its own and may be used to build other lock-free data structures as well.

1

Introduction

Lock-free (a.k.a. non-blocking) data structures provide a progress guarantee. If several threads attempt to concurrently apply an operation on the structure, it is guaranteed that one of the threads will make progress in ﬁnite time [7]. Many lock-free data structures have been developed since the original notion was presented [11]. Lock-free algorithms are error-prone and modifying existing algorithms requires care. In this paper we study lock-free linked lists and propose a design for a cache-conscious linked list. The ﬁrst design of lock-free linked lists was presented by Valois [12]. He maintained auxiliary nodes in between the list’s normal nodes, in order to resolve concurrent operations’ interference problems. A diﬀerent lock-free implementation of linked lists was given by Harris [6]. His main idea was to mark a node before deleting it in order to prevent concurrent operations from changing its next-entry pointer. Harris’ algorithm is simpler than Valois’s algorithm and his experimental results generally also perform better. Michael [8,10] proposed an extension to Harris’ algorithm that did not assume a garbage collection but reclaimed entries of the list explicitly. To this end, he developed an underlying mechanism of hazard pointers that was later used for explicit reclamation in other data structures as well. An improvement in complexity was achieved by Fomitchev and Rupert [3]. They use a smart retreat upon CAS failure, rather than the standard restart from scratch. In this paper we further extend Michael’s design to allow cache-conscious linked lists. Our implementation partitions the linked list into sub-lists that

Supported by THE ISRAEL SCIENCE FOUNDATION (grant No. 845/06).

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 107–118, 2011. c Springer-Verlag Berlin Heidelberg 2011

108

A. Braginsky and E. Petrank

reside on consecutive areas in the memory, denoted chunks. Each chunk contains several consecutive list entries. For example, setting each chunk to be one virtual page, causes list traversals to form a page-oriented memory access pattern. This partition of the list into sub-lists, each residing on a small chunk of memory is often used in practice (e.g., [1,5]), but there is no lock-free implementation for such a list. Breaking the list into chunks can be trivial if there is no restriction on the chunk size. In particular, if the size of each chunk can decrease to a single element, then clearly, each chunk can trivially reside in a single memory block, Michael’s implementation will do, but no locality improvement will be obtained for list traversals. The sub-list’s chunk that our design provides maintains upper and lower bounds on the number of elements it has. The upper bound simply follows from the size of the memory block on which the chunk is located, and a lower bound is provided by the user. If a chunk grows too much and cannot be held in a memory block, then it is split (in a lock-free manner) creating two chunks, each residing at a separate location. Conversely, if a chunk shrinks below the lower bound, then it is merged (in a lock-free manner) with the previous chunk in the list. In order for the split to create acceptable chunks, it is required that the lower bound (on the number of objects in a chunk) does not exceed half of the maximum number of entries in the chunk. Otherwise, a split would create two chunks that violate the lower bound. A natural optimization of searching for such a list is to quickly jump to the next chunk (without traversing all its entries), if the desired key is not within the key-range of this chunk. This gives us an additional performance improvement since the search progress is done in skips, where the size of each skip is at least the chunk’s minimal boundary. Furthermore the retreat upon CAS failure, in the majority of the cases is done by returning to beginning of the chunk, rather than the standard restart from the beginning of the list. To summarize, the contribution of this paper is the presentation of a lock-free linked list, based on a single word CAS commands, were the keys are unique and ordered. The algorithm assumes no lock-free garbage collector. The list design is locality conscious. The design poses a restriction on the keys and data length. For 64-bit architecture the key is limited to 31 bit, and the data is limited to 32 bit. Organization. In Section 2 we specify the underlying structure we use to implement the chunked linked list. In Section 3 we introduce the freeze mechanism that will serve the split and join operations. In Section 4 we provide the implementation of the linked list functions. A closer look at the freezing mechanism details appear in Section 5 and we conclude in Section 6. More detailed explanations and pseudo-code can be found in full version of this article [2].

2

Preliminaries and Data Structure

A linked list is a data structure that consists of a sequence of data records. Each data record contains a key by which the linked list is ordered. We denote each data record an entry. We think of the linked list as representing a set of keys,

Locality-Conscious Lock-Free Linked Lists

109

Fig. 1. The entry structure

each associated with a data part. Following previous work [4,6], a key cannot appear twice in the list. Thus, an attempt to insert a key that exists in the list fails. Each entry holds the key and data associated with it. Generally, this data is a pointer, or a mapping from the key to a larger piece of data associated with it. Next, we present the underlying data structure employed in the construction. We assume a 64-bit platform in this description. A 32-bit implementation can be derived, by cutting each ﬁeld in half, or by keeping the same structure, but using a wide compare-and-swap, which writes atomically to two consecutive words. The structure of an entry. A list entry consists of a key and a data ﬁelds, and the next pointer (pointing to next entry). These ﬁelds are arranged in two words, where the key and data reside in the ﬁrst word and the next pointer in the second. Three more bits are embedded in these two words. First, we embed the delete bit in the least bit of the next pointer, following Harris [6]. The delete bit is set to mark the logical deletion of the entry. The freeze bits are new in this design. They take a bit from each of the entry’s words and their purpose is to indicate that the entire chunk holding the entry is about to be retired. These three ﬂags consume one bit of the key and two bits from the next pointer. Notice that the three LSBs of a pointer do not really hold information on a 64-bit architecture. The entry structure is depicted in Figure 1. In what follows, we refer to the ﬁrst word as the keyData word, and the second word as the nextEntry word. We further reserve one key value, denoted by ⊥ to signify that the entry is currently not allocated. This value is not allowed as a key in the data structure. As will be discussed in Section 4, an entry is available for allocation if its key is ⊥ and its other ﬁelds are zeroed. The structure of a chunk. The main support for locality stems from the fact that consecutive entries are kept on a chunk, so that traversals of the list demonstrate better locality. In order to keep a substantial number of entries on each chunk, the linked list makes sure that the number of entries in a chunk is always between the parameters min and max. The main part of a chunk is an array that holds the entries in a chunk and may hold up to max entries of the linked list. In addition, the chunk holds some ﬁelds that help manage the chunk. First, we keep one special entry that serves as a dummy header entry, whose next pointer points to the ﬁrst entry in this chunk. The dummy header is not a must, but it simpliﬁes the algorithm’s code. To identify chunks that are too sparse, each chunk has a counter of the number of entries currently allocated in it. In the presence of concurrent mutations, this counter will not always be accurate, but it will always hold a lower bound on the number of allocated

110

A. Braginsky and E. Petrank

counter

entriesArray[MAX]

new

64 bit (word)

…

64 bit (word)

head

key: 9 del. bit: 1

key: 5 del. bit: 0

key: 8 del. bit: 0

mergeBuddy

key: ┴ del. bit: 0

64 bit (word)

...

3 LSBs freezeState

nextChunk 64 bit (word)

key: 1 del. bit: 1

key: 12 del. bit: 0

Fig. 2. The chunk structure

entries in the chunk. When an attempt is made to insert too many entries into a chunk, the chunk is split. When it becomes too small due to deletions, it is merged with a neighboring chunk. We require max > 2·min+1, since splitting a large chunk must create two well-formed new chunks. In practice max will be substantially larger than 2·min to avoid frequent splits and merges. Additional ﬁelds (new, mergeBuddy and freezeState) are needed for running the splits and the merges and are discussed in Section 5. The chunk structure is depicted in Figure 2. The structure of entire list. The entire list consists of a list of chunks. Initially we have a head pointer pointing to an empty ﬁrst chunk. We let the ﬁrst chunk’s min boundary be set to 0, to allow small lists. The list grows and shrinks due to the splitting and merging of the chunks. Every chunk has a pointer nextChunk to the next chunk, or to null if it is the last chunk of the list. The keys of the entries in the chunks never overlap, i.e., each chunk contains a consecutive subset of keys in the set, and a pointer to the next chunk, containing the next subset (with strictly higher keys) in the set. The entire list structure is depicted in Figure 3. We set the ﬁrst key in a chunk as its lowest possible key. Any smaller key is inserted in the previous chunk (except for the ﬁrst chunk that can also get keys smaller than its ﬁrst one.) Hazard pointers. Whole chunks and entries inside a chunk are reclaimed manually. Note that garbage collectors do not typically reclaim entries inside an array. To allow safe (and lock-free) reclamation of entries manually, we employ Michael’s hazard pointers methodology [8,10]. While a thread is processing an entry - and a concurrent reclamation of this entry can foil its actions - the thread registers the location of this entry in a special pointer called a hazard pointer. Reclamation of entries that have hazard pointers referencing them is avoided. Following Michael’s list implementation [10], each thread has two hazard pointers, denoted hp0 and hp1 that aid the processing of entries in a chunk. We further add four more hazard pointers hp2, hp3, hp4, and hp5, to handle the operations of the chunk list. Each thread only updates its own hazard pointers, though it can read the other threads’ hazard pointers.

Locality-Conscious Lock-Free Linked Lists H E A D

Chunk 1 Chunk's head

new, mergeBuddy, freezeState Entry with key 26

…

counter: 6 Entry with key 5

nextChunk

Chunk 2

Entry with key 90

Chunk's head

new, mergeBuddy, freezeState Entry with key 100

…

counter: 10 Entry with key 159

111

nextChunk Entry with key 123

Fig. 3. The list structure

3

Using a Freeze to Retire a Chunk

In order to maintain the minimum and maximum number of entries in a chunk, we devised a mechanism for splitting dense chunks, and for merging a sparse chunk with its predecessor. The main idea in the design of the split and merge lock-free mechanisms is the freezing of chunks. When a chunk needs to be split or merged, it is ﬁrst frozen. No insertions or deletions can be executed on a frozen chunk. To split a frozen chunk, two new chunks are created and the entries of the frozen chunk are copied into them. To merge a frozen chunk with a neighbor, the neighbor is ﬁrst frozen, and then one or two new chunks are allocated and the relevant entries from the two merging chunks are copied into them. Details of the freezing mechanism appear in Section 5. We now review this mechanism in order to allow the presentation of the list operations. The freezing of a chunk comprises three phases: Initiate Freeze. When a thread decides a chunk should be frozen, it starts setting the freeze bits in all its entries one by one. During the time it takes to set all these bits, other threads may still modify the entries not yet marked as frozen. During this phase, only part of the chunk is marked as frozen, but this freezing procedure cannot be reversed, and frozen entries cannot be reused. Stabilizing. Once all entries in a chunk are frozen, allocations and deletions can no longer be executed. At this point, we link the non-deleted entries into a list. This includes entries that were allocated, but not yet connected to the list. All entries that are marked as deleted are disconnected from the list. Recovery. The number of entries in the stabilized list is counted and a decision is made whether to split this chunk or merge it with a neighbor. Sometimes, due to changes that happen during the ﬁrst phase, the frozen chunk becomes a good one that does not require a split or a join. Nevertheless, the retired chunk is never resurrected. We always allocate a new chunk to replace it and copy the appropriate values to the new chunk. Whatever action is decided upon (split, join, or copy chunk) must be carried through. Any thread that fails to insert or delete a key due to the progress of a freeze, joins in helping the freezing of the chunk. However, threads that perform a search, continue to search in frozen chunks with no interference.

112

4

A. Braginsky and E. Petrank

The List Operations: Search, Insert and Delete

We now turn to describe the basic linked list operations. The high-level code for an insertion, deletion, or search of a key is very simple. Each of this operations starts by invoking FindChunk method to ﬁnd the relevant chunk. Then they call SearchInChunk, or InsertToChunk, or DeleteInChunkaccording to the desired operation, and ﬁnally, the hazard pointers hp2, hp3, hp4, and hp5 are nulliﬁed, to release the hazard pointers set by the FindChunk method and allow future reclamation. The main challenge is in the work inside the chunk and the handling of the freeze process, on which we elaborate below. More details appear in [2]. Turning to the operations inside the chunks, the delete and search methods are close to the previous design [10], except for the special treatment of the chunk bounds and the freeze status. For lack of space they are not speciﬁed in this short paper. The details appear in [2]. However, the insert method is quite diﬀerent, because it must allocate an entry in a shared memory (on the chunk), whereas previously, it was assumed that the insert allocates a local space for a new entry and privately prepares it for insertion in the list. For the purpose of handling the entries list in the chunk, we maintain ﬁve variables that are global and appear in all the code below. These variables are global for each thread’s code, but are not shared between threads, and all of them follow Michael’s design [10]. The ﬁrst three per-thread shared variables are (entry** prev), (entry* cur), and (entry* next). The other two are the two pointers (entry** hp0) and (entry** hp1) that point to the two hazard pointers of the thread. All other variables are local to the method that mentioned them. 4.1

The Insert Operation

The InsertToChunk method inserts a key with its associated data into a chunk. It ﬁrst attempts to ﬁnd an available entry and allocate it with the given key. If no available entry exists, a split is executed and the operation is retried. If an entry is obtained, the InsertEntry method is invoked to insert the entry into the list. The insertion will fail if the key already exists in the chunk. In this case InsertToChunk clears the entry to free it for future allocations. The InsertToChunk code is presented in Algorithm 1. It starts by an attempt to ﬁnd an available entry for allocation. A failure occurs when all entries are in use and in this case a freeze is initiated. The Freeze method gets the key and data as an input, and also an input indicating that it is invoked by an insertion operation. This allows the Freeze method to try to insert the key to the newly created chunk. When successful, it returns a null pointer to indicate the completion of the insertion. It also sets a local variable result to indicate whether the completed insertion actually inserted the key or it completed by ﬁnding that the key already existed in the list (which is also a legitimate completion of the insertion operation). If the insertion is not completed by the Freeze method, then it returns a pointer to the chunk on which the insertion should be retried. Connecting the entry to the list is done by InsertEntry. If the entry gets allocated and linked to the list, then the chunk counter is incremented only by

Locality-Conscious Lock-Free Linked Lists

113

Algorithm 1. Insert a key and its associated data into a chunk Bool InsertToChunk (chunk* chunk, key, data) { 1: current = AllocateEntry(chunk, key, data); // Find an available entry 2: while ( current == null ) { // No available entry. Freeze and try again 3: chunk = Freeze(chunk, key, data, insert, &result); 4: if ( chunk == null ) return result; // Freeze completed the insertion. 5: current = AllocateEntry(chunk, key, data); // Otherwise, retry allocation 6: } 7: returnCode = InsertEntry(chunk, current, key); 8: switch ( returnCode ) { 9: case success this: 10: IncCount(chunk); result = true; break; // Increase the chunk’s counter 11: case success other: // Entry was inserted by other thread 12: result = true; break; // due to help in freeze 13: case existed: // This key exists in the list. Reclaim entry 14: if ( ClearEntry(chunk, current) ) // Attempt to clear the entry 15: result = false; 16: else // Failure to clear the entry implies that a freeze thread 17: result = true; // eventually inserts the entry 18: break; 19: } // end of switch 20: *hp0 = *hp1 = null return result; // Clear all hazard pointers and return }

the thread that linked the entry itself. If the key already existed in the list, then ClearEntry attempts to clear the entry for future reuse. However, a rare scenario may foil clearing of the entry. This happens when the other occurrence of the key (which existed previously in the list) gets deleted before our entry gets cleared. Furthermore, a freeze occurs, in which the semi-allocated entry gets linked by other threads into the new chunk’s list. At this point, clearing this entry is avoided, and ClearEntry returns false. In such a scenario, clearing the entry fails and the insert operation succeeds. At the end of InsertToChunk, all hazard pointers are cleared and we return with a code specifying if the insert was successful, or the key previously existed in the list. The allocation of an available entry is executed using the AllocateEntry method (depicted in [2]). An available entry contains ⊥ as a key and zeros otherwise. An available entry is allocated by assigning the key and data values in the keyData word in a single atomic compare-and-swap (CAS) that assumes this word has the ⊥ symbol and zeros in it. An entry whose keyData has the freeze bit set cannot be allocated as it is not properly zeroed. Note also that once an entry is allocated, all the information required for linking it to the list is available to all threads. Thus, if a freeze starts, then all threads may create a stabilized list of the allocated entries in a chunk. The AllocateEntry method searches for an available entry. If no free entry can be found, null is returned.

114

A. Braginsky and E. Petrank

Algorithm 2. Connecting an allocated entry into the list returnCode InsertEntry (chunk* chunk, entry* entry, key) { 1: while ( true) { 2: savedNext = entry→next; 3: // Find insert location and pointers to previous and current entries (prev, cur) 4: if ( Find(chunk, key) ) // This key existed in the list 5: if ( entry == cur ) return success other; else return existed; 6: // If neighborhood is frozen, keep it frozen 7: if ( isFrozen(savedNext) ) markFrozen(cur); // cur will replace savedNext 8: if ( isFrozen(cur) ) markFrozen(entry); // entry will replace cur 9: // Attempt linking into the list. First attempt setting next field 10: if ( !CAS(&(entry→next), savedNext, cur) ) continue; 11: if ( !CAS(prev, cur, entry) ) continue; // Attempt linking 12: return success this; // both CASes were successful 13: } }

Next, comes the InsertEntry method, which takes an allocated entry and attempts to link it to the linked list. The InsertEntry code is presented in Algorithm 2. The input parameter entry is a pointer to an entry that should be inserted. It is already allocated and initiated with key and data. Before searching for the location to which to connect this entry, we memorize this entry’s next pointer. Normally, this should be a null, but in the presence of concurrent executions of InsertEntry (which may happen during a freeze), we must make sure later that the entry’s next pointer was not modiﬁed before we atomically wrote it in Line 10. After saving the current next pointer, we search for the entry’s location via the Find method. If the key already exists in the list, InsertEntry checks whether the returned entry is the same as the one it is trying to insert (by address comparison). The result determines the return code: either the key existed and we failed, or the key was inserted, but not by the current thread. (This can happen during a freeze when all threads attempt to stabilize the frozen list.) Otherwise, the key does not exist, and Find sets the global variable cur with a pointer to the entry that should follow our entry in the list, and the global variable prev with the pointer that should reference our entry. The Find method protects the entries referenced by prev and cur with the hazard pointers hp1 and hp0, respectively. There is no need to protect the newly allocated entry because it cannot be reclaimed by a diﬀerent thread. If any to-be-modiﬁed pointer is marked as frozen, we make sure that its replacement is marked as frozen well. An allocation of an entry can never occur on a frozen entry. However, once the allocation is successful, the new entry may freeze and still InsertEntry should connect it to the list. Finally, two CASs are used to link the entry to the list. Whenever a CAS fails, the insertion starts from scratch.

Locality-Conscious Lock-Free Linked Lists

115

Algorithm 3. The main freeze method chunk* Freeze(chunk* chunk, key, data, triggerType tgr, Bool* res) { 1: CAS(&(chunk→freezeState), no freeze, internal freeze); 2: // At this point, the freeze state is either internal freeze or external freeze 3: MarkChunkFrozen(chunk); 4: StabilizeChunk(chunk); 5: if ( chunk→freezeState == external freeze ) { 6: // This chunk was marked external freeze before Line 1 executed. 7: master = chunk→mergeBuddy; // Get the master chunk 8: // Fix the buddy’s mergeBuddy pointer. 9: masterOldBuddy = combine(null, internal freeze); 10: masterNewBuddy = combine(chunk, internal freeze); 11: CAS(&(master→mergeBuddy), masterOldBuddy, masterNewBuddy); 12: return FreezeRecovery(chunk→mergeBuddy, key, data, merge, chunk, tgr, res); 13: } 14: decision = FreezeDecision(chunk); // The freeze state is internal freeze 15: if ( decision == merge ) mergePartner = FindMergeSlave(chunk); 16: return FreezeRecovery(chunk, key, data, decision, mergePartner, trigger, res); }

5

The Freeze Procedure

We now provide more details about the freeze procedure. The full description is presented in [2]. The freezing process occurs when the number of entries in a chunk exceeds its boundaries. At this point, splitting or merging happens by copying the relevant keys (and data) into a newly allocated chunk (or chunks). This process comprises three phases: initiation, stabilization and recovery. The code for the Freeze method is presented in Algorithm 3. The input parameters are the chunk that needs to be frozen, the key, the data, and the event that triggered the freeze: insert, delete, enslave (if the freeze was called to prepare the chunk for merge with a neighboring chunk), or none (if the freeze is called while clearing an entry). The freeze will attempt to execute the insertion, deletion, or enslaving and will return a null pointer when successful. It will also set an input boolean ﬂag to indicate the return code of the relevant operation. When unsuccessful, it will return a pointer to the new chunk on which the operation should be retried. The Freeze method starts with an attempt to atomically change the freeze state from no freeze to internal freeze. This freeze state of the chunk is normally no freeze and is switched to internal freeze when a freeze process of this chunk begins. But it can also be external freeze when a neighbor requested a freeze on this chunk to allow a merge between the two. Thus, an external freeze can start even when no size violation is detected in this chunk. Whether or not the modiﬁcation succeeds, we know that the freeze state can no longer be no freeze. It can be either internal freeze or external freeze. The Freeze method then calls MarkChunkFrozen to mark each

116

A. Braginsky and E. Petrank

entry in the chunk as frozen and StabilizeChunk to ﬁnish stabilizing the entries list in the chunk. At this point, the entries in the chunk cannot be modiﬁed anymore. Freeze then checks if the freeze is external or internal. An external freeze can occur when a freeze is concurrently executed on the next chunk, and it has already enslaved the current chunk as its merge buddy. In this case, we cooperate with the joint freeze and joint recovery. When the state of the freeze is external, then the current chunk must have its mergeBuddy pointer already pointing to the chunk that initiated the merge, denoted the master. To ﬁnish this freeze, we make sure that the master has its merge buddy properly pointing back at the current chunk. The master chunk’s mergeBuddy pointer must be either null or already pointing to the buddy we found. Thus it is enough to use one CAS command to verify that it is not null. Finally, we execute the recovery phase on the master chunk and return its output. We do not need to check the decision about the freeze of the buddy. It must be a merge. If the freeze is internal, then we invoke FreezeDecision to see what should be done next (Line 14). If the decision is to merge, then we ﬁnd the previous chunk and “enslave” it for a joint merge using the FindMergeSlave method. Finally, the FreezeRecovery method is called to complete the freeze process. Next, we explain each of the stages. The full details including pseudo-code appear in [2]. Marking the chunk as frozen. The MarkChunkFrozen method simply goes over the entries one by one and marks each one as frozen. The setting of the freeze ﬂags is atomic and it is retried repeatedly until successful. By the end of this process all entries (including the free ones) are marked as frozen. Stabilizing the chunk. After all the entries in the chunk are marked as frozen, new entries cannot be allocated and existing entries cannot be marked as deleted. However, the frozen chunk may contain allocated entries that were not yet linked, and entries that were marked as deleted, but which have not yet been disconnected. The StabilizeChunk method disconnects all deleted entries and links all allocated ones. It uses the Find method to disconnect all entries that are marked as deleted. Such entries do not need to be reclaimed (when marked as frozen), but they should not be copied to the new chunk. Next, StabilizeChunk attempts to connect entries. It goes over all entries and searches for ones that are disconnected, but neither reclaimed nor deleted. Each such entry is linked to the list by invoking InsertEntry, which will only fail if the key already exists in a diﬀerent entry in the chunk’s list. In this case, this entry should indeed not be connected to the stabilized list. Reaching a decision. After stabilizing the chunk, everything is frozen, the list is completely connected, and nothing changes in the chunk anymore. At this point, we need to decide whether or not splitting or merging is required. To that end, a count is performed and a decision is made by comparison to min and max. It may happen that the resulting count is higher than min and lower than max, and then no operation is required. Nevertheless, the frozen chunk is never resurrected. Instead, we copy the chunk to a new chunk in the (upcoming) recovery stage.

Locality-Conscious Lock-Free Linked Lists

117

Making the recovery. Once a decision is reached, a recovery starts. The recovery procedure allocates a chunk (or two) and copies the relevant information into the new chunk (or chunks). If a merge is involved, the previous chunk in the list is ﬁrst frozen (under an external freeze) and both chunks bring entries for the merge. Several threads may perform the freeze procedure concurrently, but all of them will make the same recovery decision about the freeze, as the frozen stabilized chunk looks the same to all threads. A thread that performs the recovery creates a local chunk (or chunks) into which it copies the relevant entries. At this point all threads create the same new chunk (or chunks). But now, each thread performs the operation with which it initiated the freeze on the new chunks. It can be an insert, delete, or enslave. Performing the operation is easy because the new chunks are local to this thread and no race can occur. (Enslaving a chunk is simply done by modifying its freeze state from no freeze to external freeze and registration of the merge buddy.) But the success of making the local operation visible in the data structure is determined by whether the thread succeeds in creating a link to its new chunks in the frozen chunk, as explained next. After creating the new chunks locally and executing the original operation on them, there is an attempt to atomically insert the address of its local chunk into a dedicated pointer in the frozen chunk (new). When two chunks are created, the second one is locally linked to the ﬁrst one by the nextChunk ﬁeld. If the insertion is successful, then this thread has also completed the the operation it was performing (insert, delete, or enslave). If the insertion is unsuccessful, then this means that a diﬀerent thread has already completed the installation of new chunks and this thread’s local new chunks will not be used (i.e., can be reclaimed). In this case, the thread must try its operation again from scratch. According to the number of (live) entries on the frozen chunk there are three ways to recover from the freeze. Case I: min< count < max. In this case, the required action is to allocate a new chunk and copy all of the entries from the frozen chunk to the new chunk. Next we perform the insert, delete, or enslave operation on the local new chunk and attempt to link it to the frozen one. Case II: count == min. In this case we need to merge the frozen chunk with its previous chunk. We assume that the previous chunk has already been frozen by an external freeze before the recovery is executed, and that the freeze states in both chunks are properly set so that no thread can interfere with the freeze process. We start by checking the overall number of entries in these two chunks, to decide if the merged entries will ﬁt into one or two chunks. We then allocate a second new chunk, if needed, and perform the (local) copy to the new chunk or chunks. When copying into two new chunks, we split the entries evenly, and return the smallest key in the second chunk as the separating key. As before, we perform the original operation that started the freeze and try to create a link from the old chunk to the new chunk or chunks.

118

A. Braginsky and E. Petrank

Case III: count == max. In this case we need to split the old chunk into two new chunks. The basic operations of this case resemble those of the previous cases. We allocate two new chunks, perform the split locally, perform the original operation, and attempt to link the new chunks to the old one.

6

Conclusion

We have presented a chunking and freezing mechanisms that build a cacheconscious lock-free linked list. Our list consists of chunks, each containing consecutive list entries. Thus, a traversal of the list stays mostly within a chunk’s boundary (a virtual page or a cache line), and therefore, the traversal enjoys a reduced number of page faults (or cache misses) compared to a traversal of randomly allocated nodes, each containing a single entry. Maintaining a linked list in chunks is often used in practice (e.g., [1,5]) but a lock-free implementation of a cache-conscious linked list has not been available heretofore. We believe that the building blocks of this list, i.e., the chunks and the freeze operation, can be used for building additional data structures, such as lock-free hash functions, and others.

References 1. Unrolled Linked Lists, http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx 2. Full Version of Locality-Conscious Lock-Free Linked Lists, http://www.cs.technion.ac.il/~ erez/Papers/lf-linked-list-full.pdf 3. Fomitchev, M., Rupert, E.: Lock-free linked lists and skip lists. In: Proc. PODC (2004) 4. Fraser, K.: Practical lock-freedom, Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory (February 2004) 5. Frias, L., Petit, J., Roura, S.: Lists revisited: Cache-conscious STL lists. J. Exp. Algorithmics 14, 3.5–3.27 (2009) 6. Harris, T.L.: A pragmatic implementation of non-blocking linked-lists. In: Proc. PODC (2001) 7. Herlihy, M.: Wait-free synchronization. TOPLAS (1991) 8. Michael, M.M.: High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. In: Proc. SPAA (2002) 9. Michael, M.M.: Safe memory reclamation for dynamic lock-free objects using atomic reads and writes. In: Proc. PODC (2002) 10. Michael, M.M.: Hazard pointers: Safe memory reclamation for lock-free objects. TPDS (June 2004) 11. Herlihy, M., Shavit, N.: The art of multiprocessor programming. Morgan Kaufmann, San Francisco (2008) 12. Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proc. PODC (1995) 13. Treiber, R.K.: Systems programming: Coping with parallelism, Research report RJ 5118, IBM Almaden Research Center (1986)

Specification and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock Vibhor Bhatt and Prasad Jayanti Department of Computer Science, Dartmouth College, NH, USA

Abstract. Brandenburg and Anderson [1,2] recently introduced a phase-fair readers/writers lock [1,2], where read and write phases alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Thus, neither class of processes–readers or writer–has priority over the others, and no process starves. Brandenburg and Anderson [1,2] informally specify a phase fair lock and present an algorithm to implement it with O(n) remote memory reference complexity (RMR), where n is the number of processes in the system. In this work we give a rigorous specification of a phase fair lock and present an algorithm that implements it with O(1) RMR complexity.

1 Introduction Mutual exclusion [3] is a well-studied, fundamental problem in distributed computing. Here processes repeatedly cycle through four sections of code–Remainder Section, Try Section, Critical Section (CS), and Exit Section–in that order, and the problem consists of designing the code for the Try and Exit sections so that the mutual exclusion property–at most one process is in the CS at any time–is satisfied. Readers/Writers Exclusion [4] is a well known variant of Mutual exclusion, commonly used in operating systems and in parallel applications to implement shared data structures. In Readers/Writers Exclusion, processes are divided into two classes–readers and writers–and the exclusion property is revised to allow for more concurrency: multiple readers can be in the CS at the same time, although no process may be in the CS at the same time as a writer. Starting from the earliest paper [4], most works on Readers/Writers Exclusion studied the problem in three natural variants–one in which readers have priority over writers, one in which the writers have priority and one in which neither class of processes– readers or writers–has priority over the other. This work deals with the third variant. When neither class has priority, Brandenburg and Anderson [1,2] suggested a desirable property, which they called the phase-fairness property, that requires read and write phases to alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Their algorithm to realize this property has O(n) remote memory reference complexity (RMR complexity), where n is the number of M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 119–130, 2011. c Springer-Verlag Berlin Heidelberg 2011

120

V. Bhatt and P. Jayanti

processes in the system. (A memory reference to a shared variable X by a processor p is considered remote in Cache Coherent (CC) machines if X is not in p’s cache; it is considered remote in Distributed Shared Memory (DSM) machines if X is at a memory module of a different processor. The RMR complexity of an algorithm is the worst-case number of remote memory references that a process makes in order to execute the Try and Exit sections once.) Our paper makes two contributions. Brandenburg and Anderson stated the phasefairness property only informally, and did not include all of the elements that one would intuitively associate with phase-fairness. Our first contribution is a more comprehensive and rigorous specification of the phase-fairness property. Our second contribution is an algorithm that achieves this property with O(1) RMR complexity on CC machines,1 in contrast to their O(n) RMR complexity algorithm. The rest of the paper is organized as follows. Section 2 describes the model and states the basic properties required by the Reader-Writer Exclusion problem, followed by a rigorous formulation of the phase-fairness properties. Section 3 describes related work. The last two sections contain the algorithms. Section 4 describes the single-writer algorithm satisfying the phase fairness properties. This algorithm and its description are taken almost verbatim from [5]. Section 5 shows how to transform the single-writer algorithm into multi-writer algorithm satisfying the basic and phase-fairness properties. Proofs are omitted due to space constraint.

2 Model and Specification of the Reader-Writer Problem The system consists of a set of n asynchronous processes {p0 . . . pn−1 }, communicating with each other through shared variables that support the atomic operations read, write, and fetch&add (F&A). Each process is labeled a reader or a writer,2 and its program is a loop that consists of two sections of code—Try section and Exit section. We say a process is in the Remainder section if its program counter is at the first statement of the Try section; and it is in the Critical section (CS) if its program counter is at the first statement of the Exit section. The Try section, in turn, consists of two code fragments—a doorway, followed by a waiting room—with the requirement that the doorway is a bounded “straight line” code [6]. Intuitively, a process “registers” its request for the CS by executing the doorway, and then busywaits in the waiting room until it has the “permission” to enter the CS. Initially, all processes are in their remainder section. Each execution of the Try and Exit sections by a process is called an attempt; it is a read attempt (respectively, write attempt) if the process is a reader (respectively, writer). An attempt by a process p spans from a time t when p executes the first statement of its Try section to the earliest time t > t when p completes the Exit section. An attempt A 1

2

For DSM machines, Danek and Hadzilacos’ lower bound proof for 2-Session Group Mutual Exclusion implies that there is no O(1) RMR complexity algorithm for Readers/Writers exclusion. Our algorithms work even when this labeling is not static, i.e., when the same process performs read attempts sometimes and write attempts some other times. We assume static labeling only for simplicity.

Specification and Constant RMR Algorithm

121

is active in a configuration C of a run if A starts before C and does not complete before C. The following definitions of “precedence” and “enabled” will be useful for defining fairness properties in the next section. If A is an attempt by a process p, henceforth we write “A completes the doorway at time t” as a shorthand for “p completes the doorway at time t during the attempt A.” Definition 1. If A and A are any two attempts in a run (possibly by different processes), A doorway precedes A if A completes the doorway before A begins the doorway. A and A are doorway concurrent if neither doorway precedes the other. Definition 2. A process p is enabled to enter the CS in configuration C if p is in the Try section in C and there is an integer b such that, in all runs from C, p enters the CS in at most b of its own steps. The Phase-fair Reader-Writer Problem is to design the code for the Try and the Exit sections so that properties P1 through P5 stated below and the properties PF1 and PF2 of Subsection 2.2 hold in all runs of the algorithm. – (P1). Mutual Exclusion : If a writer is in the CS at any time, then no other process is in the CS at that time. – (P2). Bounded Exit : There is an integer b such that in every run, every process completes the Exit section in at most b of its steps. – (P3). First-Come-First-Served (FCFS) among writers : If w and w are any two write attempts in a run and w doorway precedes w , then w does not enter the CS before w. – (P4). First-In-First-Enabled (FIFE) among readers : Let r and r be any two read attempts in a run such that r doorway precedes r . If r enters the CS before r, then r is enabled to enter the CS at the time r enters the CS. – (P5). Concurrent Entering : Informally, if all writers are in the remainder section, readers should be able to enter the CS in a bounded number of steps. More precisely: There is an integer b such that, if σ is any run from a reachable configuration such that all writers are in the remainder section in every configuration in σ, then every read attempt in σ executes at most b steps of the Try section before entering the CS. Finally, we state the liveness property. When one class of processes (e.g., readers) has priority over the other class (e.g., writers), the starvation of processes belonging to the lower priority class is unavoidable. Therefore, instead of starvation-freedom, we require the weaker livelock-freedom property, which is appropriate in all three cases of readerpriority, writer-priority, and no-priority. Livelock-freedom guarantees that, under the standard assumption that no process crashes in the middle of the Try, CS or Exit section, some process in the Try section will eventually enter the CS and some process in the Exit section will eventually enter the Remainder section. – (P6). Livelock-freedom : If no process crashes in an infinite run, then infinitely many attempts complete in that run.

122

V. Bhatt and P. Jayanti

2.1 Reader-Priority and Writer-Priority Formulations In a recent paper [5] we presented the reader- and writer-priority formulations and presented constant RMR algorithms for those cases. This submission studies the nopriority formulation, described next. 2.2 Phase-Fairness Properties When neither readers nor writers have priority over the other, a most natural additional property to require is starvation-freedom—that no reader or writer gets stuck forever in the Try or Exit section. However, Brandenburg and Anderson have pointed out that we could demand more—that readers and writers take turns fairly, while still allowing for concurrency (by enabling multiple readers to cohabit the CS).[1,2]. Specifically, if readers are waiting when a writer leaves the CS, then all such waiting readers should be allowed to enter the CS before the next writer may enter the CS, i.e., the “session” should switch from being a “write session” to a ‘read session.” Likewise if a read session is in progress and one or more writers are waiting, then no new readers should be allowed into the CS. These “fair switching” properties were stated informally in Brandenburg and Anderson’s work, and we formulate these rigorously below. – ( PF1). Fair switch from writer to readers : If at some time in a run a write attempt w is in the CS and a read attempt r is in the waiting room, then r enters the CS before any write attempt w = w enters the CS in the future. – ( PF2). Fair switch from readers to writer : If at time t a read attempt is in the CS and a write attempt is in the waiting room, then some write attempt enters the CS in the future before any read attempt initiated after t enters the CS. Our quest is to identify properties that are desirable in any algorithm that aims to ensure fairness between readers and writers. In this quest, the two properties stated above may be considered necessary, but they are surely not sufficiently strong. To see this, consider a scenario where a writer w is in the CS while a set W of writers and a set R of readers are in the waiting room. When w leaves the CS, the first property blocks writers from entering the CS until all readers in R enter the CS, but it makes no guarantee about how quickly these waiting readers enter the CS. In particular, even after w completes the Exit section and goes back to the Remainder section, the writers in W may temporarily block the readers in R from entering the CS without violating the above properties. So we state a stronger property below that guarantees that, once w’s writing session is over, every reader in R will be able to enter the CS in a bounded number of its own steps. We consider w’s writing session to be over as soon as either of the following two events happens: (i) w goes back to the Remainder section, or (ii) some reader or writer enters the CS after w leaves the CS. – ( PF3). Fast switch from writer to readers : Suppose that at some time t a write attempt w is in the CS and a read attempt r is in the waiting room, and t > t is the earliest time when w is completed or some attempt a = w is in the CS. Then, at time t , either r is in the CS or r is enabled to enter the CS.

Specification and Constant RMR Algorithm

123

The next lemma states that this property is stronger than the “Fair switch from writer to readers” property stated earlier. Lemma 1. If an algorithm satisfies Mutual Exclusion and Fast switch from writer to readers, then it satisfies Fair switch from writer to readers.

3 Related Work Courtois et al. first posed and solved the Readers/Writers problem [4]. Mellor-Crummey and Scott’s algorithms [7] and their variants [8] are queue-based; they have constant RMR complexity, but do not satisfy Concurrent Entering. Anderson’s algorithm [1] and Danek and Hadzilacos’ algorithm [9] satisfy Concurrent Entering, but they have O(n) and O(log n) RMR complexity, respectively, where n is the number of processes. The first O(1) RMR Reader-Writer lock algorithm satisfying concurrent entering (P5) was designed by Bhatt and Jayanti [5]. In that work they studied all the three variations of the problem—reader-priority, writer-priority and starvation-free cases. Brandenburg and Anderson’s recent work [1,2], which is most closely related to this paper, noted that Mellor-Crummey and Scott’s queue-based algorithm [7] limit concurrency because of its strict adherence to a FIFO order among all waiting readers and writers. For instance if the queue contains a writer between two readers, then the two readers cannot be in the CS together. To overcome this drawback Brandenburg and Anderson proposed “phase-fairness” that requires readers and writers to take turns in a fair manner. The fair and fast switch properties ( PF1- PF3) stated in the section 2 are intended to rigorously capture the informal requirements stated in their work. Their algorithm in [1] has O(n) RMR complexity and satisfies PF3 (and hence also PF1), but not PF2. Their phase-fair queue based algorithm in [2] has constant RMR complexity and satisfies PF1, but not PF2 or PF3; it also fails to satisfy Concurrent Entering (P5). An algorithm in [5] has constant RMR complexity and satisfies all of PF1, PF2 and PF3 (and all of P1-P5), but it supports only a single writer. We use this algorithm as a building block to design a constant RMR algorithm that supports multiple writers and readers and satisfies all the properties from Section 2. As with Brandenburg and Anderson’s algorithm in [1], our algorithm also uses fetch&add primitives.3

4 Single Writer Algorithm Satisfying Phase-Fair Properties In this section, we present an algorithm that supports only a single writer and multiple readers, and satisfies the phase-fair properties((P 1) − (P 6)and(P F 2), (P F 3)) and an additional property called writer priority defined as follows. (WP1). Writer Priority : If a write attempt w doorway precedes a read attempt r, then r does not enter the CS before w.4 3

4

Without the use of synchronization instructions like fetch&add, it is well known that constant RMR complexity even for the standard mutual exclusion problem is not possible [10,11,12]. This property by itself implies PF2.

124

V. Bhatt and P. Jayanti

This algorithm is very similar to the algorithm presented in Section 5 of [5], and the description given in this section is almost verbatim from Section 5.1 of [5]. We encourage a full reading of Section 5 of [5] to understand the final algorithm given in Section 5 of this paper. The overall idea is as follows. The writer can enter the CS from two sides, 0 and 1. It never changes its side during one attempt of the CS. The writer also toggles its side for every new attempt. To enter from a certain side, say 1, the writer sets the shared variable D to 1. Then it waits for the readers from the previous side (in this case side 0) to exit the critical and the Exit section. The last reader to exit from side 0 lets the writer in the CS. Once the writer is done with the CS from side 1, it lets the readers waiting from side 1 into the CS, using variable Gate described later. The readers in their Try section set their side d equal to D. Then they increment their count in side d and attempt for the CS from side d. In order to enter the CS from side d, they busy wait on the variable Gate[d] until it is true. When the readers are exiting, they decrement their count from side d and the last exiting reader wakes up the writer. Now we describe the shared variables used in the algorithm. procedure Write-lock() R EMAINDER SECTION 1. prevD ← D 2. currD ← prevD 3. D ← currD 4. Permit[prevD] ← false 5. if (F&A(C[prevD], [1, 0]) = [0, 0]) 6. wait till Permit[prevD] 7. F&A(C[prevD], [−1, 0]) 8. Gate[prevD] ← false 9. ExitPermit ← false 10. if (F&A(EC, [1, 0]) = [0, 0]) 11. wait till ExitPermit 12. F&A(EC, [−1, 0]) C RITICAL S ECTION 13. Gate[currD] ← true

procedure Read-lock() R EMAINDER SECTION 14. d ← D 15. F&A(C[d], [0, 1]) 16. d ← D 17. if(d = d ) 18. F&A(C[d ], [0, 1]) 19. d←D 20. if(F&A(C[d], [0, −1]) = [1, 1]) 21. Permit[d] ← true 22. wait till Gate[d] C RITICAL S ECTION 23. F&A(EC, [0, 1]) 24. if(F&A(C[d], [0, −1]) = [1, 1]) 25. Permit[d] ← true 26. if(F&A(EC, [0, −1]) = [1, 1]) 27. ExitPermit ← true

Fig. 1. Single-Writer Multi-Reader Algorithm satisfying Starvation Freedom and Writer-Priority. The doorway of Write-lock comprises of Lines 1-3. The doorway of Read-lock comprises of Lines 14-21.

4.1 Shared Variables and Their Purpose All the shared variable names start with upper case and the local variables start with lower case. D: A single bit read/write variable written only by the writer. This variable denotes the side from which the writer wants to attempt for the CS.

Specification and Constant RMR Algorithm

125

Gate[d]5 : Boolean read/write variable written only by the writer. Gate[d] denotes whether the side d is open for the readers to enter the CS. Before entering the CS, a reader has to wait till Gate[d] = true (Gate from side d is open), where d is the side from which the reader is attempting for the CS. Permit[d]: Boolean read/write variable written and read by both readers and the writer. The writer busy waits on Permit[d] to get the permission from the readers (from side d) to enter the CS. The idea is that the last reader to exit side d, will wake up the writer using Permit[d]. ExitPermit: Boolean read/write variable written and read by both readers and the writer. It is similar to Permit, with the difference that it is used by the writer to wait for all the readers to leave the Exit section. C[d], d ∈ {0, 1} : A fetch&add variable read and updated both by the writer and readers. C[d] has two components [writer-waiting, reader-count].6 writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers from side d to leave the CS, it is only updated by the writer. reader-count denotes the number of readers currently registered in side d. EC : A fetch&add variable read and updated both by the writer and readers. Similar to C[d], it has two components [writer-waiting, reader-count]. writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers to complete the Exit section. reader-count denotes the number of readers currently in the Exit section. Following theorem summarizes the properties of this algorithm. Theorem 1 (Single-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 1 implements a Single-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3). The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(1) number of shared variables that support read, write, fetch&add.

5 Multi-Writer Multi-Reader Phase-Fair Algorithm In this section we will describe how we construct our multi-writer multi-reader phasefair lock using the single-writer writer-priority lock given in Figure 1. We denote this single-writer lock given in Figure 1 by SW in the rest of this section. In all the algorithms we discuss in this section the code of the Read − lock() procedure will be identical to the algorithm given in SW . The writers on the other hand will use a Mutual Exclusion lock to ensure only one writer accesses the underlying SW . More precisely a writer first needs to enter the CS of the Mutual Exclusion lock and then compete with the readers in the single-writer protocol from Figure 1. The Mutual Exclusion lock we use was designed by T. Anderson [13]. It is a constant RMR Mutual Exclusion lock satisfying P3 and P6. We use the procedures acquire(M ) and release(M ) to denote the Try and the Exit section of this lock. 5

6

The algorithm given in Section 5 of [5] had only one Gate ∈ {0, 1} variable. The value of Gate at any time denoted the side opened for the readers. This change is required to make the final algorithm in Section 5 of this paper (which transforms this single-writer algorithm to multi-writer algorithm) to work correctly. Both the components of C[d] (and EC) are stored in a single word.

126

V. Bhatt and P. Jayanti

Notations used in this section: We denote SW-Write-try (respectively, SW-Read-try) for the Try section code of the writer (respectively, reader) in the single-writer algorithm given in the Figure 1. Similarly we use SW-Write-exit (respectively, SW-Read-exit) for the Exit section code of the writer (respectively, reader). We first present a simple but incorrect multi-writer algorithm in Figure 2. This algorithm is exactly the same as the multi-writer starvation free algorithm from [5]. The readers simply execute SW-Read-try followed by SW-Read-exit. The writers first obtain a mutual exclusion lock M , then execute SW-Write-try followed by SW-Write-exit and finally exit M . As far as the underlying single-writer protocol is concerned there is only one writer executing at any time and it executes exactly the same steps as in the multiwriter version. Hence one can easily see that the algorithm satisfies (P1)-(P6). In fact it also satisfies the fast switch from writer to readers (PF3); say a writer is in the CS (say from side d), all the readers in the waiting room are waiting for Gate[d] to open. So when the writer opens Gate[d] in SW-Read-exit(), all the waiting readers get enabled. procedure Write-lock() R EMAINDER SECTION 1. acquire(M ) 2. SW-Write-try() C RITICAL SECTION 3. SW-Write-exit() 4. release(M )

procedure Read-lock() R EMAINDER SECTION 5. SW-Read-try() CRITICAL SECTION

6.

SW-Read-exit()

Fig. 2. Simple but incorrect Phase-Fair Multi-Writer Multi-Reader algorithm

But this algorithm does not satisfy Fair switch from readers to writer (PF2). To see this consider the following scenario. Say a reader is in the CS and no other processes are active. Then a writer w enters the Try section and executes the doorway of the lock M , hence w is in the waiting room of the multi-writer algorithm. At this point all the writers are still in Remainder section of SW , because the only active writer w has not even started SW -Write-lock. This means that any new reader who begins its Try section now can go past w in to the CS, thus violating (PF2). As mentioned in the previous section SW satisfies (WP1) property: if the writer completes the doorway before a reader starts its doorway, then the reader does not enter the CS before the writer. Also note that in the doorway of SW , the writer just flips the direction variable D and at that point Gate[D] is closed. So a tempting idea to overcome the troubling scenario described above is to make the incoming writer execute the doorway (toggle D) before it enters the waiting room of the multi-writer algorithm (essentially the waiting room in acquire(M )). One obvious problem with this approach is that the direction variable D will keep on flipping as the new writers enter the waiting room, thus violating the invariants and correctness of SW . Another idea is that the exiting writer, say w, before exiting SW (which just comprises of opening Gate[currD]), flips the direction variable D. This way only the readers currently waiting are enabled but any new readers starting their Try section will be blocked. This idea will work in the cases when there is some writer in the Try section when w is exiting. If w is the only active writer in the system, and if flips the direction

Specification and Constant RMR Algorithm

127

variable D in the Exit section, then Gate[D] will be closed till the next writer flips D again. Hence, starvation freedom and concurrent entry might be in danger. One way to prevent this might be that w before opening Gate[d], checks if there are any writers in the Try section, and only if it sees presence of a writer in the Try section it flips the direction, else it does not. One inherent difficulty with this idea is that if w does not see any writers in the Try section, and is poised to open Gate[d] for the waiting readers, and just then bunch of writers enter the waiting room, and w opens the Gate for the waiting readers, the property PF2 might be in jeopardy. In this case, one might be tempted to think that one of these writers in the Try section should flip the direction. But which of these writers should flip the direction? Should the writer with the best priority (one with smallest token in the lock M ) say w∗ flip the direction? But if w∗ sleeps before flipping direction and many other writers enter the waiting room while some reader is in the CS, again (PF2) is in danger. From all these discussions above we can see the challenges in designing a multiwriter algorithm satisfying all of the phase fairness properties. In particular (PF2) property seems hard to achieve. Before we describe our correct phase-fair multi-writer algorithm presented in Figure 3, we lay down some essential invariants necessary for any phase-fair algorithm in which the readers simply execute Read-lock procedure of SW , and the writers obtain a mutual exclusion lock M before executing Write-lock procedure of SW . i. if there are no writers active, then Gate[D] = true. (required for starvation freedom and concurrent entry) ii. if there is a reader is in the CS, and a writer in the waiting room, then Gate[D] = false. (required for PF2) Now we are ready to describe our correct phase-fair multi-writer algorithm given in Figure 3. First we give the overall idea. 5.1 Informal Description of the Algorithm in Figure 3 The algorithm given in the Figure 3 is on the same lines as the discussion above. Both readers and writers use the underlying single-writer protocol SW . The Read-Lock procedure is same as in SW . The writers first obtain a Mutual exclusion lock M and then execute the Write-lock procedure of SW . We make one slight but crucial change in the way processes execute SW ; we replace the direction variable D with a fetch&add variable Y . Also when a process wants to know the current direction it calls the procedure GetD() which in tun accesses Y to determine the appropriate direction. The crux of the algorithm lies in the use of Y and we will explain that later. We denote the Lines 4-12 of SW , i.e., waiting room of Write-lock by SW -wwaiting. Similarly, SW -r-exit corresponds to the Exit section of the reader in SW , i.e., Lines 23-27 of SW . 5.2 Shared Variables and Their Purpose The shared variables Gate, C, EC, Permit, ExitPermit and writer’s local variable currD and prevD have the same type and purpose as in SW . As mentioned earlier the direction

128

V. Bhatt and P. Jayanti

Shared Variables Y is a fetch&add variable with two components [dr ∈ {0, 1}, wcount ∈ N], initialized to [0, 0] ExitPermit is a Boolean read/write variable ∀d ∈ {0, 1}, Permit[d] is a Boolean read/write variable ∀d ∈ {0, 1}, Gate[d] is a Boolean read/write variable, initially Gate[0] is true and Gate[1] is false EC is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initially EC = [0, 0] ∀d ∈ {0, 1}, C[d] is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initialized to [0, 0] procedure Write-lock() R EMAINDER SECTION 1. F&A(Y, [0, 1]) 2. acquire(M ) 3. currD ← GetD() 4. prevD ← currD 5. SW -w-waiting() CRITICAL SECTION

6. 7. 8.

F&A(Y, [1, −1]) Gate[currD] ← true release(M )

procedure Read-lock() procedure GetD() R EMAINDER SECTION 19. (dr, wc) ← Y 9. d ← GetD() 20. if (wc = 0) 10. F&A(C[d], [0, 1]) 21. return dr 11. d ← GetD() 22. return dr 12. if(d = d ) 13. F&A(C[d ], [0, 1]) 14. d ← GetD() 15. if(F&A(C[d], [0, −1]) = [1, 1]) 16. Permit[d] ← true 17. wait till Gate[d] CRITICAL SECTION

18. SW -r-exit() Fig. 3. Phase-Fair Multi-Writer Multi-Reader Algorithm. The doorway of Write-lock comprises of Line 1 and the doorway of acquire(M ). The doorway of Read-lock comprises of Lines 9-16.

variable D from SW has been replaced by a new fetch&add variable Y . We now describe this variable in detail. Y : fetch&add variable updated only by the writers and read both by the readers and the writers. Y has two components [dr, wcount]. The component dr ∈ {0, 1} is used to indicate the direction of the writers (one corresponding to the replaced direction variable D). Intuitively dr is the direction of the last writer to be in the CS. The wcount component keeps the count of the writers in the Try section and CS. The writer in the beginning of the Try section increments the wcount component, and in its Exit section it decrements the wcount component and flips the dr component. We assume the dr bit is the most significant bit of the word, hence the writer has to only add 1 at the most significant bit to flip it. How is Y able to give the appropriate direction? When a writer flips the dr component in the Exit section, there are two possibilities: either no writer is in the Try section or some writer is present in the Try section. In the former case, this flipping of the direction should not change the “real” direction and in the later case it should. Here is where the component wcount comes into play. If no writer is present and some reader r reads Y to determine the direction (Lines 9, 11 or 14), r will notice the wcount component of Y to be zero, hence it will infer the direction to be dr, i.e., the direction of the last writer to be in the CS. On the other hand if some writer is present in the Try section or CS,

Specification and Constant RMR Algorithm

129

then wcount = 0, so r can infer that some writer started an attempt since the last writer has exited, hence it will take the direction as the complement of dr. This mapping of the value of Y to the appropriate direction of the writer is done by the procedure GetD() (Lines 19-22). Now we explain the algorithm in detail line by line. 5.3 Line by Line Commentary The Read-lock procedure is exactly the same as in SW with the only difference that instead of reading D at Lines 9, 11 or 14, the reader takes the return value from the procedure GetD(), which in turn extracts the direction based on the value of Y as described above. The doorway of the reader comprises of Lines 9-16. Now we explain the Write-lock procedure in detail. In the Write-lock procedure, a writer w first increments the wcount component of Y (Line 1). Note that if wcount of Y was zero just before w executes Line 1, then w has implicitly flipped the direction, and if it was non-zero already then the direction is unchanged. Now w tries to acquire the lock M (Line 2) and proceeds to Line 3 when it is in the CS of lock M . Also note that, the configuration of SW at this point is as if w has already executed its doorway (of SW ), we will get into this detail more when we explain the code of the Exit section of Write-lock (Lines 6-8). w sets its local variable currD and prevD appropriately (Lines 3-4). Then w executes the waiting room of SW (Lines 4-12 of Figure 1) to compete with the readers (Line 5). Once out of this waiting room, w enters the CS as it is assured that no process (reader or a writer) is present in the CS. Before we describe the Exit section, note that all the readers currently in waiting room are waiting on Gate[currD], and at this point both Gate[0], Gate[1] are closed. In the first statement on the Exit section, the writer flips the dr component of Y and at the same time decrements the wcount component (Line 6).7 Note that at this point (when PCw = 7), if there are no active writers other than w in the system, wcount would be zero and the direction would be exactly the same as currD. Hence the invariant i. mentioned in the previous subsection holds. Similarly, if there is some writer present in the Try section (Y.wcount > 0), then w has flipped the direction to currD, essentially w has executed the doorway of SW for the next writer. At Line 7, w enables the waiting readers by opening Gate[currD]. Note at this point if there is a writer in the waiting room, the direction is equal to currD and Gate[currD] is closed, hence the invariant ii. from previous subsection holds. Then finally w releases the lock M (Lines 8). Following theorem summarizes the properties of this algorithm. Theorem 2 (Multi-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 3 implements a Multi-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3) using the lock M from [13] and the algorithm in Figure 1. The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(m) number of shared variables that support read, write, fetch&add, where m is the number of writers in the system. 7

We assume that the dr bit is the most significant bit of the word storing Y , and any overflow bit is simply dropped. Hence w has to only fetch&add [1, −1] to Y to atomically decrement wc and flip dr.

130

V. Bhatt and P. Jayanti

References 1. Brandenburg, B.B., Anderson, J.H.: Reader-writer synchronization for shared-memory multiprocessor real-time systems. In: ECRTS 2009: Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems, Washington, DC, USA, pp. 184–193. IEEE Computer Society, Los Alamitos (2009) 2. Brandenburg, B.B., Anderson, J.H.: Spin-based reader-writer synchronization for multiprocessor real-time systems. Submitted to the Real-Time Systems (December 2009), http://www.cs.unc.edu/˜anderson/papers/rtj09-for-web.pdf 3. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Commun. ACM 8(9), 569 (1965) 4. Courtois, P.J., Heymans, F., Parnas, D.L.: Concurrent control with “readers” and “writers”. Commun. ACM 14(10), 667–668 (1971) 5. Bhatt, V., Jayanti, P.: Constant rmr solutions to reader writer synchronization. In: PODC 2010: Proceeding of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, pp. 468–477. ACM, New York (2010) 6. Lamport, L.: A new solution of Dijkstra’s concurrent programming problem. Commun. ACM 17(8), 453–455 (1974) 7. Mellor-Crummey, J.M., Scott, M.L.: Scalable reader-writer synchronization for sharedmemory multiprocessors. In: PPOPP 1991: Proceedings of the third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 106–113. ACM, New York (1991) 8. Lev, Y., Luchangco, V., Olszewski, M.: Scalable reader-writer locks. In: SPAA 2009: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pp. 101–110. ACM, New York (2009) 9. Hadzilacos, V., Danek, R.: Local-spin group mutual exclusion algorithms. In: Liu, H. (ed.) DISC 2004. LNCS, vol. 3274, pp. 71–85. Springer, Heidelberg (2004) 10. Cypher, R.: The communication requirements of mutual exclusion. In: SPAA 1995: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 147–156. ACM, New York (1995) 11. Attiya, H., Hendler, D., Woelfel, P.: Tight RMR lower bounds for mutual exclusion and other problems. In: STOC 2008: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 217–226. ACM, New York (2008) 12. Anderson, J.H., Kim, Y.-J.: An improved lower bound for the time complexity of mutual exclusion. Distrib. Comput. 15(4), 221–253 (2002) 13. Anderson, T.E.: The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1(1), 6–16 (1990)

On the Performance of Distributed Lock-Based Synchronization Yuval Lubowich and Gadi Taubenfeld The Interdisciplinary Center, P.O. Box 167, Herzliya 46150, Israel {yuval,tgadi}@idc.ac.il

Abstract. We study the relation between two classical types of distributed locking mechanisms, called token-based locking and permission-based locking, and several distributed data structures which use locking for synchronization. We have proposed, implemented and tested several lock-based distributed data structures, namely, two different types of counters called find&increment and increment&publish, a queue, a stack and a linked list. For each one of them we have determined what is the preferred type of lock to be used as the underling locking mechanism. Furthermore, we have determined which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures. Keywords: Locking, synchronization, distributed mutual exclusion, distributed data structures, message passing, performance analysis.

1 Introduction 1.1 Motivation and Objectives Simultaneous access to a data structure shared among several processes, in a distributed message passing system, must be synchronized in order to avoid interference between conflicting operations. Distributed mutual exclusion locks are the de facto mechanism for concurrency control on distributed data structures. A process accesses the data structure only while holding the lock, and hence the process is guaranteed exclusive access. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locks. These locks can be grouped into two main classes: token-based locks and permission-based locks. In token-based locks, a single token is shared by all the processes, and a process acquires the lock (i.e., is allowed to enter its critical section) only when it possesses the token. Permission-based locks are based on the principle that a process acquires the lock only after having received “enough” permissions from other processes. Our first objectives is: Objective one. To determine which one of the two locking techniques – tokenbased locking or permission-based locking – is more efficient. Our strategy to achieve this objective is to implement one classical token-based lock (SuzukiKasami’s lock [28]), and two classical permission-based locks (Maekawa’s lock [10] and Ricart-Agrawala’s lock [22]), and to compare their performance. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 131–142, 2011. c Springer-Verlag Berlin Heidelberg 2011

132

Y. Lubowich and G. Taubenfeld

It is possible to trivially implement a lock by letting a single pre-defined process (machine) to act as an “arbiter” or even to let all the data structures reside in the local memory of a single process and letting this process impose a definite order between concurrent operations. Such a centralized solution might be preferred in some situations, although it limits the degree of concurrency, imposes an extra load on the arbiter, and is less robust. In this work, we focus on fully distributed implementations of locks. Locks are just a tool used when implementing various distributed applications, thus, our second objective has to do with implementing lock-based data structures. Objective two. To propose, implement and test several distributed data structures, namely, two different types of counters, a queue, a stack and a linked list; and for each one of the data structures to determine what is the preferred mutual exclusion lock to be used as the underling locking mechanism. Furthermore, the worst-case message complexity of one of the permissionbased locks (i.e., Maekawa’s lock) is better than the worst-case message complexity of the other two locks. It would be interesting to find out whether this theoretical result be reflected in our performance analysis results. In a shared memory implementation of a data structure, the shared data is usually stored in the shared memory. But, who should hold the data in a distributed message passing system? one process? all of them? In particular, when implementing a distributed counter, who should hold the current value of the counter? one process? all of them? To address this question, for the case of a shared counter, we have implemented and compared two types of shared counter: A find&increment counter, where only the last process to update the counter needs to know its value; and an increment&publish counter, where everybody should know the value of the counter after each time the counter is updated. We notice that in the find&increment counter, once the counter is updated there is no need to “tell” its new value to everybody, but in order to update such a counter one has to find its current value first. In the increment&publish counter the situation is the other way around. Objective three. To determine which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures (such as a queue, a stack or a linked list). We point out that, in our implementations of a queue, a stack, and a linked list, the shared data is distributed among all the processes; that is, all the items inserted by a process are kept in the local memory of that process. 1.2 Experimental Framework and Performance Analysis In order to measure and analyze the performance of the proposed five data structures and of the three locks, we have implemented and run each data structure with each one of the three implemented distributed mutual exclusion locks as the underling locking mechanism. We have measured each data structure’s performance, when using each one of the locks, on a network with one, five, ten, fifteen and twenty processes, where

On the Performance of Distributed Lock-Based Synchronization

133

each process runs in a different node. A typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s lock as its underling locking mechanism. The queue, stack, and linked list were also tested using each one of the two counters as a building block in order to determine which of the two counter performers better when used as a building block for implementing other high level data structures. Special care was taken to make the experiments more realistic by preventing runs which would display an overly optimistic performance; for example, preventing runs where a process completes several operations while acquiring and holding the lock once. Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. 1.3 Our Findings The experiments, as reported in Section 5, lead to the following conclusions: 1. Permission-based locking always outperforms token-based locking. I.e, each one of the two permission-based locks always outperforms the token-based lock. This result about locks may suggest that, in general when implementing distributed data structures, it is better to take the initiative and search for information (i.e., ask for permissions) when needed, instead of waiting for your turn. 2. Maekawa’s permission-based lock always outperforms Ricart-Agrawala and SuzukiKasami locks, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list. Put another way, for each one of the five data structures, the preferred lock to be used as the underling locking mechanism is always Maekawa’s lock. The worst-case message complexity of Maekawa’s lock is better than the worstcase message complexity of both Ricart-Agrawala and Suzuki-Kasami locks; thus, the performance analysis supports and confirms the theoretical analysis. 3. The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. This result about counters may suggest that, in general when implementing distributed data structures, it is more efficient to actively search for information only when it is needed, instead of distributing it in advance. As expected, all the data structures exhibit performance degradation as the number of processes grows.

2 Distributed Mutual Exclusion Algorithms We consider a system which is made up of n reliable processes, denoted p1 , . . . , pn , which communicate via message passing. We assume that the reader is familiar with the definition of the mutual exclusion problem [4, 29]. The three mutual exclusion algorithms (i.e., locks) implemented in this work satisfy the mutual exclusion and the starvation freedom requirements.

134

Y. Lubowich and G. Taubenfeld

The first published distributed mutual exclusion algorithm, due to Lamport [8], is based on the notion of logical clocks. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locking algorithms [19, 20, 25, 26]. These algorithms can be grouped into two main classes: token-based algorithms [13, 18, 28, 27] and permission-based algorithms [3, 8, 10, 12, 22, 24]. – In token-based algorithms, a single token is shared by all the processes, and a process is allowed to enter its critical section only when it possesses the token. A process continues to hold the token until its execution of the critical section is over, and than it may pass it to some other process. – Permission-based algorithms, are based on the principle that a process may enter its critical section only after having received “enough” permissions from other processes. Some permission based algorithms require that a process receives permission for all of the other processes whereas others, more efficient algorithms, require a process to receive permissions from a smaller group. Below we describe the basic principles of the three known distributed mutual exclusion algorithm that we have implemented. 2.1 Suzuki-Kasami’s Token-Based Algorithm In Suzuki and Kasami’s algorithm [28], the privilege to enter a critical section is granted to the process that holds the PRIVILEGE token (which is always held by exactly one process). Initially process p1 has the privilege . A process requesting the privilege sends a REQUEST message to all other processes. A process receiving a PRIVILEGE message (i.e. the token) is allowed to enter its critical section repeatedly until the process passes the PRIVILEGE to some other process. A REQUEST message of process pj has the form REQUEST(j, m) where j is the process identifier and m is a sequence number which indicates that pj is requesting its (m + 1)’th critical section invocation. Each process has an array RN of size n, where n is the number of processes. This array is used to record the largest sequence of number ever received from each one of the other processes. When a REQUEST(j, m) message is received by pi , the process updates RN by executing RN [j] = max(RN [j], m). A PRIVILEGE message has the form of PRIVILEGE(Q, LN ) where Q is a queue of requesting processes and LN is an array of size n such that, LN [j] is the sequence number of the request of pj granted most recently. When pi finishes executing its critical section, the array LN , contained in the last PRIVILEGE message received by pi , is updated by executing LN [i] = RN [i], indicating that the current request of pi has been granted. Next, every process pj such that RN [j] = LN [j] + 1, is appended to Q provided that pj is not already in Q. When these updates are completed, if Q is not empty then PRIVILEGE(tail(Q), LN ) is send to the process at the head of Q. If Q is empty then pi retain the privilege until some process requests it. The algorithm requires, in the worst case, n message exchanges per mutual exclusion invocation: (n − 1) REQUEST messages and one PRIVILEGE message. In the best case, when the process requesting to enter its critical section already holds the privilege token, the algorithm requires no messages at all.

On the Performance of Distributed Lock-Based Synchronization

135

2.2 Ricart-Agrawala’s Permission-Based Algorithm The first permission based algorithm, due to Lamport [8], has 3(n − 1) message complexity. Ricart and Agrawala had modified Lamport’s algorithm and were able to achieve 2(n − 1) message complexity [22]. In this algorithm, when a process, say pi , wants to enter its critical section, it sends a REQUEST(m, i) message to all other processes. This message contains a sequence number m and the process’ identifier i, which are then used to define a priority among requests. Process pj , upon receipt of a REQUEST message from process pi , sends an immediate REPLY message to pi if either pj itself has not requested to enter its critical section, or pj ’s request has a lower priority than that of pi . Otherwise, process pj defers its REPLY (to pi ) until its own (higher priority) request is granted. Process pi enters its critical section when it receives REPLY messages from all the other n − 1 processes. When pi releases the critical section, it sends a REPLY message to all deferred requests. Thus, a REPLY message from process pi implies that pi has finished executing its critical section. This algorithm requires only 2(n − 1) messages per critical section invocation: n − 1 REQUEST messages and n − 1 REPLY messages. 2.3 Maekawa’s Permission-Based Algorithm In Maekawa’s algorithm [10], process pi acquires permission to√enter its critical section from a set of processes, denoted Si ,√which consists of at most n processes that act as arbiters. The algorithm uses only c n messages per critical section invocation, where c is a constant between 3 for light traffic and 5 for heavy traffic. Each process can issue a request at any time. In order to arbitrate requests, any two requests from different processes must be known to at least one arbitrator process. Since process pi must obtain permission to enter its critical section from every process in Si , the intersection of every pair of sets Si and Sj must not be empty, so that processes in Si ∩ Sj can serve as arbitrators between conflicting requests of pi and pj . There are many efficient constructions of the sets S1 ,..., Sn (see for example√[7, 23, 15]). The construction used in our implementation is as follows: Assume that n is a positive of √ integer √ (if not then few dummy processes can be added). Consider a matrix √ size n × n, where the value of an entry (i, j) in the matrix is (i − 1) × n + j. Clearly, for every k ∈ {1, ..., n} there is exactly one entry, √ √ denoted (ik , jk ), whose value is k. The unique entry (ik , jk ) is (k/ n, k (mod n) + 1). For each k ∈ {1, ..., n}, a subset Sk is defined to be the set of values on the row and the column passing √ through (ik , jk ). Clearly, Si ∩ Sj = ∅ for all pairs i and j (and the size of each set is 2 n − 1). Thus, whenever two processes pi and pj try to enter their critical sections, the arbiter processes in Si ∩ Sj will grant access to only one of them at a time, and thus the mutual exclusion property is satisfied. By carefully designing the algorithm deadlock is also avoided.

3 Distributed Data Structures We have proposed and implemented five distributed data structures: two types of counters, a queue, a stack and a linked-list. Each one of these data structures implementations

136

Y. Lubowich and G. Taubenfeld

makes use of an underlying locking mechanism. As already mentioned, we have implemented the three mutual exclusion algorithms described in the previous section, and for each of the five data structures determined what is the preferred mutual exclusion algorithm to be use for locking. Various lock-based data structures have been proposed in the literature mainly for use in databases, see for example [2, 5, 9]. A distributed dictionary structure is studied in [16]. Below we describe the five distributed data structures that we have studied. All the data structures are linearizable. Linearizability means that, although several processes may concurrently invoke operations on a linearizable data structure, each operation appears to take place instantaneously at some point in time, and that the relative order of non-concurrent operations is preserved [6]. Two counters. A shared counter is a linearizable data structure that supports the single operation of incrementing the value of the counter by one and returning its previous value. We have implemented and compared two types of shared counter: 1. A find&increment counter. In this type of a counter only the last process to update the counter needs to know its value. In the implementation, a single lock is used, and only the last process to increment the counter knows its current value. A process p that tries to increment the shared counter first acquires the lock. Then p sends a FIND message to all other processes. When the process that knows the value of the counter receives a FIND message, it replies by sending a message with the value of the counter to p. When p receives the message, it increments the counter and releases the lock. (We notice that p can keep on incrementing the counter’s value until it gets a FIND message.) 2. An increment&publish counter. In this counter everybody should know the value of the counter each time it is updated. In the implementation, a single lock is used. A process that tries to increment the shared counter first acquires the lock. Then, it raises the counter value, sends messages to all other processes informing them of the new counter value, gets acknowledgements, and releases the lock. A queue. A distributed queue is a linearizable data structure that supports enqueue and dequeue operations, by several processes, with the usual queue semantics. We have implemented a distributed queue which consists of local queues residing in the individual processes participating in the distributed queue. A single lock and a shared counter are used for the implementation. Each element in a queue has a timestamp that is generated using the shared counter. An ENQUEUE operation is carried out by raising the counter’s value by one and enqueuing an element in the local queue along with the counter’s value. A DEQUEUE operation is carried out by first acquiring the lock, locating the process that holds the element with the lowest timestamp, removing this element from this process’ local queue, and releasing the lock. A stack. A distributed stack is a linearizable data structure that supports push and pop operations, by several processes, with the usual stack semantics. We have implemented a distributed stack which is similar to the distributed queue. A single lock and a shared counter are used for the implementation. It consists of local stacks residing in the

On the Performance of Distributed Lock-Based Synchronization

137

individual processes participating in the distributed stack. Each element in the stack has a timestamp that is generated by the shared counter. A PUSH operation is carried out by incrementing the counter value by one and pushing the element in the local stack along with the counter’s value. A POP operation is carried out by acquiring the lock, locating the process that contains the element with the highest timestamp, removing this element from its local stack, an releasing the lock. A linked list. A distributed linked list is a linearizable data structure that supports insertion and deletion of elements from any point in the list. We have implemented a list which consists of a sequence of elements, each containing a data field and two references (“links”) pointing to the next and previous elements. Each element can reside in any process. The distributed list also supports the operations “traverse list”; and “size of list”. The list contains a head and a tail “pointers” that can be sent to requesting processes. Each pointer maintains a reference to a certain process and a pointer to a real element stored in that process. Manipulating the list requires that a lock be acquired. A process that needs to insert an element to the head of the list acquires the lock, and sends a request for the “head pointer” to the rest of the processes. Whenever a process that holds the “head pointer” receives the message, it immediately replies by sending the pointer to the requesting process. Once the requesting process has the “head pointer” inserting the new element is purely a matter of storing it locally and modifying the “head pointer” to point to the new element (the new element of course now points to the element previously pointed by the “head pointer”). Deleting an element from the head of list is done much the same way. Inserting or deleting elements from the list requires a process to acquire the (single) lock, traverse the list and manipulate the list’s elements. A process is able to measure the size of the list by acquiring the lock and then querying the other processes about the size of their local lists.

4 The Experimental Framework Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. We measured each data structure’s performance, using each one of the distributed mutual exclusion algorithms, by running each data structure on a network with one, five, ten, fifteen and twenty processes, where each process runs in a different node of the network. For example a typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s algorithm as its underling locking mechanism. All tests were implemented using Coridan’s messaging middleware technology called MantaRay. MantaRay’s is a fully distributed server-less architecture where processes running in the network are aware of one another and as a result are able to send messages back and forth directly. We have tested each of the implementations in hours, and sometimes days long, of executions on various number of processes (machines).

138

Y. Lubowich and G. Taubenfeld

5 Performance Analysis and Results All the experiments done on the data structures we have implemented, start with an initially empty data structure (queue, stack etc.) to which processes have performed a series of operations. For example, in the case of a queue, the processes performed a series of enqueue/dequeue operations. Each process enqueued an element, did “something else” and repeated for a million times. After that, the process dequeued an element, did “something else” and repeated for a million times again. The “something else” consisted of approximately 30 mSeconds of doing nothing and waiting. As with the tests done on the locking algorithms, this served in making the experiments more realistic in preventing long runs by the same process which would display an overly optimistic performance, as a process may complete several operations while holding the lock. The time a process took to complete the “something else” is not reported in our figures. The experiments, as reported below, lead to the following conclusions: – Maekawa’s permission-based algorithm always outperforms Ricart-Agrawala and Suzuki-Kasami algorithms, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list; – The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. As expected, the data structures exhibit performance degradation as the number of processes grows. 5.1 Counters The two graphs in Figure 1, show the time one process spends performing a single count up operation averaged over one million operations for each process using each of the three locking algorithms implemented. As can be seen, the counters perform worse when using Ricart-Agrawala algorithm and perform best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the find&increment counter behaves and scales better than the increment&publish counter when the number of processes grows. The observation that the find&increment counter is better than the increment&publish counter will become also clear when examining the results for the queue and stack implementations that make use of shared counters as building blocks. 5.2 A Queue The two graphs in Figure 2, show the time one process spends performing a single enqueue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the two counters, the queue performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. It is clear that the queue performs better when using the find&increment counter than when using increment&publish counter.

On the Performance of Distributed Lock-Based Synchronization

Increment&publish Counter

450

450

400

400

350

350

300

300

mSeconds

mSeconds

Find&increment Counter

250 200 150 Suzuki Kasami

50 15

150 Ricart Agrawala Suzuki Kasami Maekawa

0

0 10 Processes

200

50

Maekawa

5

250

100

Ricart Agrawala

100

1

139

1

20

5

10 15 Processes

20

Fig. 1. The time one process spends performing a single count up operation averaged over one million operations per process, in the find&increment counter and in the increment&publish counter

Queue - Enqueue Operation Employing Increment&publish Counter

450

450

400

400

350

350

300

300 mSeconds

mSeconds

Queue - Enqueue Operation Employing Find&increment Counter

250 200

250 200 150

150 Ricart Agrawala

100

Suzuki Kasami

50

Maekawa 0 1

5

10 Processes

15

20

Ricart Agrawala

100

Suzuki Kasami

50

Maekawa 0 1

5

10 Processes

15

20

Fig. 2. The time one process spends performing an enqueue operation averaged over one million operations per process, in a queue employing the find&increment counter and in a queue employing the increment&publish counter

The dequeue operation does not make use of a shared counter. Figure 3bb shows the time one process spends performing a single dequeue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the enqueue operation, the dequeue operation is the slowest when using Ricart-Agrawala algorithm, and is the fastest when using Maekawa’s algorithm. 5.3 A Stack As expected, the graphs of the performance analysis results for a stack are almost the same as those presented in the previous subsection for a queue, and hence omitted from this abstract. As in all previous examples, the stack performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the stack performs better when using the find&increment counter than when using the increment&publish counter.

140

Y. Lubowich and G. Taubenfeld Queue - Dequeue Operation 450 400 350 mSeconds

300 250 200 150 Ricart Agrawala

100

Suzuki Kasami

50

Maekawa

0 1

5

10 Processes

15

20

Fig. 3. The time one process spends performing a dequeue operation averaged over one million operations per process

5.4 A Linked List The linked list we have implemented does not make use of a shared counter. Rather it uses the locking algorithm directly to acquire a lock before manipulating the list itself. The graphs in Figure 4, show the time one process spends performing a single insert operation or a single delete operation, averaged over one million operations for each process using each of the three locking algorithms implemented. As in all previous examples, the linked list performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm as the underling locking mechanism. Linked List Delete Operation

350

350

300

300

250

250 mSeconds

mSeconds

Linked List Insert Operation

200 150 Ricart Agrawala

100

200 150 Ricart Agrawala

100

Suzuki Kasami -

Suzuki Kasami 50

50 Maekawa -

Maekawa 0

0 1

5

10 Processes

15

20

1

5

10 Processes

15

20

Fig. 4. The time one process spends performing an insert operation or delete operation averaged over one million operations per process in a linked list

6 Discussion Data structures such as shared counters, queues and stacks are ubiquitous in programming concurrent and distributed systems, and hence their performance is a matter of concern. While the subject of data structures is a very hot research topic in recent years

On the Performance of Distributed Lock-Based Synchronization

141

in the context of concurrent (shared memory) systems, this is not the case for distributed (message passing) systems. In this work, we have studied the relation between classical locks and specific distributed data structures which use locking for synchronization. The experiments consistently revealed that the implementation of Maekawa’s lock is more efficient than that of the other two locks, and that the implementation of the find&increment counter is consistently more efficient than that of the increment&publish counter. The fact that Maekawa’s lock performs better is, in part, due to the fact that its worst-case message complexity is better. The results suggest that, in general, it is more efficient to actively search for information (or ask for permissions) only when it is needed, instead of distributing it to everybody in advance. Thus, we expect to find similar type of results also for different experimental set ups. For our investigation, it is important to implement and use the locks as completely independent building blocks, so that we can compare their performance. In practice, various optimizations are possible. For example, when implementing the find&increment counter using a lock, a simple optimization would be to store the value of the counter along with the lock. Thus, when a process requests and obtains a lock, it obtains the current value of the counter along with the lock, thereby eliminating the need for any find messages. Future work would entail implementing and evaluating other locking algorithms [3, 14], and fault tolerant locking algorithms that do not assume an error-free network [1, 11, 18, 21, 17]. It would also be interesting to consider additional distributed lockbased data structures, and different experimental set ups. When using locks, the granularity of synchronization is important. Our implementations are examples of coarse-grained synchronization, as they allow only one process at a time to access the data structure. It would be interesting to consider data structures which use fine-grained synchronization in which it is possible to lock “small pieces” of a data structure, allowing several processes with non-interfering operations to access it concurrently. Coarse-grained synchronization is easier to program but is less efficient and is not fault-tolerant compared to fine-grained synchronization.

References 1. Agrawal, D., El-Abbadi, A.: An efficient and fault-tolerant solution for distributed mutual exclusion. ACM Transactions on Computer Systems 9(1), 1–20 (1991) 2. Bayer, R., Schkolnick, M.: Concurrency operations on B-trees. Acta Informatica 1(1), 1–21 (1977) 3. Carvalho, O.S.F., Roucairol, G.: On mutual exclusion in computer networks. Communications of the ACM 26(2), 146–147 (1983) 4. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Communications of the ACM 8(9), 569 (1965) 5. Ellis, C.S.: Distributed data structures: A case study. IEEE Transactions on Computers c34(12), 1178–1185 (1985) 6. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. on Programming Languages and Systems 12(3), 463–492 (1990) 7. Ibaraki, T., Kameda, T.: A theory of coteries: Mutual exclusion in distributed systems. IEEE Transactions on Parallel and Distributed Systems 4(7), 779–794 (1993)

142

Y. Lubowich and G. Taubenfeld

8. Lamport, L.: Time, clocks, and the order of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978) 9. Lehman, P.L., Yao, S.B.: Efficient locking for concurrent operations on B-trees. ACM Transactions on Database √ Systems 6(4), 650–670 (1981) 10. Maekawa, M.: A N algorithm for mutual exclusion in decentralized systems. ACM Transactions on Computer Systems 3(2), 145–159 (1985) 11. Mishra, S., Srimani, P.K.: Fault-tolerant mutual exclusion algorithms. Journal of Systems and Software 11(2), 111–129 (1990) 12. Mizuno, M., Mesterenko, M., Kakugawa, H.: Lock-based self-stabilizing distributed mutual exclusion algorithm. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 708–716 (1996) 13. Naimi, M., Trehel, M.: An improvement of the log n distributed algorithm for mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 371–375 (1987) 14. Neilsen, M.L., Mizuno, M.: A DAG-based algorithm for distributed mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 354–360 (1991) 15. Neilsen, M.L., Mizuno, M.: Coterie join algorithm. IEEE Transactions on Parallel and Distributed Systems 3(5), 582–590 (1992) 16. Peleg, D.: Distributed data structures: A complexity-oriented structure. In: van Leeuwen, J., Santoro, N. (eds.) WDAG 1990. LNCS, vol. 486, pp. 71–89. Springer, Heidelberg (1991) 17. Rangarajan, S., Tripathi, S.K.: A robust distributed mutual exclusion algorithm. In: Toueg, S., Kirousis, L.M., Spirakis, P.G. (eds.) WDAG 1991. LNCS, vol. 579, pp. 295–308. Springer, Heidelberg (1992) 18. Raymond, K.: A tree-based algorithm for distributed mutual exclusion. ACM Transactions on Computer Systems 7(1), 61–77 (1989) 19. Raynal, M.: Algorithms for mutual exclusion. The MIT Press, Cambridge (1986); Translation of: Algorithmique du parall´elisme (1984) 20. Raynal, M.: Simple taxonomy for distributed mutual exclusion algorithms. Operating Systems Review (ACM) 25(2), 47–50 (1991) 21. Reddy, R.L.N., Gupta, B., Srimani, P.K.: New fault-tolerant distributed mutual exclusion algorithm. In: Proc. of the ACM/SIGAPP Symp. on Applied Computing, pp. 831–839 (1992) 22. Ricart, G., Agrawala, A.K.: An optimal algorithm for mutual exclusion in computer networks. CACM 24(1), 9–17 (1981); Corrigendum in CACM 24(9), 578 (1981) 23. Shou, D., Wang, S.D.: A new transformation method for nondominated coterie design. Information Sciences 74(3), 223–246 (1993) 24. Singhal, M.: A dynamic information-structure mutual exclusion algorithm for distributed systems. IEEE Transactions on Parallel and Distributed Systems 3(1), 121–125 (1992) 25. Singhal, M.: A taxonomy of distributed mutual exclusion. Journal of Parallel and Distributed Computing 18(1), 94–101 (1993) 26. Singhal, M., Shivaratri, N.G.: Advanced concepts in operating systems: distributed, database and multiprocessor operating systems. McGraw-Hill, Inc., New York (1994) 27. van de Snepscheut, J.L.A.: Fair mutual exclusion on a graph of processes. Distributed Computing 2, 113–115 (1987) 28. Suzuki, I., Kasami, T.: A distributed mutual exclusion algorithm. ACM Transactions on Computer Systems 3(4), 344–349 (1985) 29. Taubenfeld, G.: Synchronization Algorithms and Concurrent Programming, 423 pages. Pearson/Prentice-Hall (2006) ISBN 0-131-97259-6

Distributed Generalized Dynamic Barrier Synchronization Shivali Agarwal1, Saurabh Joshi2 , and Rudrapatna K. Shyamasundar3 1

IBM Research, India Indian Institute of Technology, Kanpur Tata Institute of Fundamental Research, Mumbai 2

3

Abstract. Barrier synchronization is widely used in shared-memory parallel programs to synchronize between phases of data-parallel algorithms. With proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new languages such as X10 wherein the processes participating in barrier synchronization are not known a priori, and the processes in distinct “places” don’t share memory. Thus, the challenge here is to not only achieve barrier synchronization in a distributed setting without any centralized controller, but also to deal with dynamic nature of such a synchronization as processes are free to join and drop out at any synchronization phase. In this paper, we describe a solution for the generalized distributed barrier synchronization wherein processes can dynamically join or drop out of barrier synchronization; that is, participating processes are not known a priori. Using the policy of permitting a process to join only in the beginning of each phase, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k + 1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and (ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a ﬁnite number of phases. The above protocol is further generalized to multiple groups of processes (possibly non-disjoint) engaged in barrier synchronization.

1

Introduction

Synchronization and coordination play an important role in parallel computation. Language constructs for eﬃcient coordination of computation on shared memory multi-processors, and multi-core processors are of growing interest. There are a plethora of language constructs used for realizing mutual exclusion, point-to-point synchronization, termination detection, collective barrier synchronization etc. Barrier [8] is one of the important busy-wait primitives used to ensure that none of the processes proceed beyond a particular point in a computation until all have arrived at that point. A software implementation of the barrier using shared variables is also referred to as phase synchronization [1,7]. The issues of remote references while realizing barriers has been treated exhaustively in the seminal work [3]. Barrier synchronization protocols, either centralized and M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 143–154, 2011. c Springer-Verlag Berlin Heidelberg 2011

144

S. Agarwal, S. Joshi, and R.K. Shyamasundar

distributed, have been proposed earlier for the case when processes that have to synchronize are given a priori [7][5][13][6][14]. With the proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new distributed shared memory based languages such as X10 [16] wherein the processes participating in barrier synchronization are not known a priori. Some of the recent works that address dynamic number of processes for barrier synchronization are [20][18][21]. More details on existing work on barrier synchronization can be found in section 6. Surprisingly, a distributed solution to the phase synchronization problem in such dynamic environments has not yet been proposed. In this paper, we describe a distributed solution to the problem of barrier synchronization used as an underlying synchronization mechanism for achieving phase synchronization where processes are dynamically created (in the context of nested parallelism). The challenge arises in arriving at the common knowledge of the processes that want to participate in phase synchronization for every phase in a de-centralized manner such that there are some guarantees on the progress and starvation freedom properties of the processes in addition to the basic correctness property. 1.1

Phase Synchronization Problem

The problem of phase synchronization [7] is described below: Consider a set of asynchronous processes where each process executes a sequence of phases; a process begins its next phase only upon completion of its previous phase (for the moment let us ignore the constitution of a phase). The problem is to design a synchronization scheme which guarantees the following properties: 1. No process begins it’s (k +1)th phase until all processes have completed their k th phase, k ≥ 0. 2. No process will be permanently blocked from executing it’s (k + 1)th phase if all processes have completed their k th phase, k ≥ 0. The set of processes that have to synchronize can be either given a priori which remains unchanged or the set can be a dynamic set in which new processes may join as and when they want to phase synchronize or existing processes may drop out of phase synchronization. In this paper, we describe a distributed solution for the dynamic barrier synchronization in the context of phase synchronization, wherein processes can dynamically join or drop out of phase synchronization. Using the policy of permitting a process to join in a phase subsequent to the phase of registration, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k +1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and

Distributed Generalized Dynamic Barrier Synchronization

145

(ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a ﬁnite number of phases. Our protocol establishes a bound of at most two phases from the phase it registered it’s intention to join1 the phase synchronization. The lower bound is one phase. The correctness of the solution is formally established. The dynamic barrier synchronization algorithm is further generalized to cater to groups of barrier synchronization processes.

2

Barrier Synchronization with Dynamic Set of Processes

We consider a distributed system which gets initialized with a non-empty set of processes. New processes can join the system at will and existing processes may drop out of the system when they are done with their work. They carry out individual computations in phases and synchronize with each other at the end of each phase. Since it is a distributed system with no centralized control and no a priori knowledge of number of processes in the system, each process has to dynamically discover the new processes that have joined the system in such a manner that the new process can start synchronizing with them in ﬁnite amount of time. The distributed barrier synchronization protocol described below deals with this issue of including new processes in the ongoing phase synchronization in a manner that progress of existing as well as newly joined processes is ensured. It also handles the processes that drop out of the system so that existing processes know that they do not have to wait on these for commencing the next phase. Note that there is no a priori limit on the number of processes. The abstract linguistic constructs for registration and synchronization of processes is described in the following. 2.1

Abstract Language

We base our abstract language for the barrier synchronization protocol on X10. The relevant syntax is shown in ﬁgure 1 and explained below: < P rogram > < P roc > < clockDec > < stmtseq > < basic-stmt > clock-id

::=< async P roc > || < async P roc > ::=< clockDec >; < stmtseq > | clocked < clock-id >< stmtseq > ::= new clock c1, c2.. ::=< basic-stmt > | < basic-stmt >< stmtseq > ::= async P roc|atomic stmt|seq stmt|c.register|c.drop|next ::= c1, c2, ...

Fig. 1. Abstract Clock Language for Barrier Synchronization 1

Starvation Freedom is guaranteed only for processes that are registered.

146

S. Agarwal, S. Joshi, and R.K. Shyamasundar

• Asynchronous activities: The keyword to denote asynchronous processes is async. The async is used with an optional place expression and a mandatory code block. A process that creates another process is said to be the parent of the process it creates. • Clock synchronization: Special variables of type clock are used for barrier synchronization of processes. A clock corresponds to the notion of a barrier. A set of processes registered with a clock synchronize with each other w.r.t. that clock. A barrier synchronization point in a process is denoted by next. If a process is registered on multiple clocks, then next denotes synchronization on all of those. This makes the barrier synchronization deadlock-free. The abstraction of phase synchronization through clocks enables to form groups of processes such that groups can merge or disjoin dynamically for synchronization. Some important points regarding dynamic joining rule for phase synchronization are: – A process registered on clock c, can create a child process synchronizing on c via async clocked c {body}. The child process joins the phase synchronization in phase k + 1, if the parent is in phase k while executing async. – A process can register with a clock c using c.register. It will join the phase synchronization from phase k + 1 or k + 2 if the clock is in phase k at the time of registration. Some important points regarding dropping out of phase synchronization are: - A process that drops in phase k is dropped out in the same phase and is not allowed to create child processes that want to join phase synchronization in that phase. Note that this does not restrict the expressiveness of the language in any way and ensures a clean way of dropping out. - A process that registers after dropping loses the information of parent and is treated as a process whose parent is not clocked on c. - An implicit c.drop is assumed when a process registered on clock c terminates. We now provide a solution in the form of a protocol for distributed dynamic barrier synchronization problem that provably obeys the above mentioned dynamic joining rules.

3

Distributed Barrier Synchronization Solution

The distributed barrier synchronization protocol for a single clock is given in ﬁgure 2. The ﬁgures respectively describe the protocol for barrier operations like initialization, synchronization and drop. The notations used in the solution are described below. Notation: - We denote ith process by Ai (also referred as process i). - Phases are tracked in terms of k − 1, k, k + 1, · · · . - We use the guarded command notation [2] for describing our algorithm as it is easy to capture interleaving execution and termination in a structured manner.

Distributed Generalized Dynamic Barrier Synchronization

147

Assumption: The processes do not have random failures and will always call c.drop if they want to leave the phase synchronization. 3.1

Correspondance between Protocol Steps and Clock Operations

The correspondence of the clock operations with the protocol operations is given below: • new clock c: Creation of a clock(barrier) corresponds to creation of a special process Ac that executes as follows where the code blocks INIT c, SYNC and CONSOLIDATE are shown in ﬁgure 2. INIT_c; while (true) { SYNC; CONSOLIDATE;}

Note on Ac : It is a special process that exists till the program terminates. It acts like a point of contact for processes in case of explicit registration through c.register as seen below without introducing any centralized control. • next: A process Ai already in phase synchronization performs a next for barrier synchronization. A next corresponds to: SYNC;

CONSOLIDATE;

• Registration through clocked: A process Ai can either register through clocked at time of creation in which case it gets into the list Aj .registered of its parent process, Aj . In this case, Ai joins from the next phase. The speciﬁc code that gets executed in the parent for a clocked process is: INIT_i; A_j.registered:=A_j.registered+A_i;

The code that gets executed in Ai is: while (!A_i.proceed);

• Registration through c.register: If Ai registers like this, then it may join phase synchronization within atmost the next two phases. Following code gets executed in Ai : INIT_i; A_c.registered:=A_c.registered+A_i; while (!A_i.proceed);

• c.drop : Process Ai drops out of phase synchronization through c.drop (see ﬁg. 2). The code that gets executed is: DROP;

Note: 1) Though we assume Ac to exist throughout the program execution, our algorithm is robust with respect to graceful termination of Ac , that is, it terminates after completing CONSOLIDATE and there are no processes in Ac .registered upon consolidation. The only impact on phase synchronization being that no new processes can register through c.register. 2) The assignments are done atomically.

148

S. Agarwal, S. Joshi, and R.K. Shyamasundar

3.2 How the Protocol Works The solution achieves phase synchronization by ensuring that the set of processes that enter a phase is a common knowledge to all the processes. Attaining common knowledge of the existence of new processes and the non-existence of dropped processes in every phase is the non-trivial part of the phase synchronization protocol in a dynamic environment. The machinery built to solve this problem is shown in ﬁgures 2 and described below in detail. Protocol Variables: Ac – The special clock process for clock c Ai .executing – The current phase the process i is executing Ai .proceed – used for allowing new processes to join the active phase; Ai .next – the next phase that process i wants to execute Ai .Iconcurrent – set of processes executing the phase Ai .executing; Ai .newIconcurrent – the set of new processes that will be part of the next phase. Ai .registered – the set of new processes that want to enter phase synchronization with Ai . Ai .newsynchproc – set of new processes registered with process i that will synhronize from the next phase. Ai .drop – when a process wants to drop (or terminates), it sets Ai .drop to true and exits. Ai .checklist – the subset of Ai .Iconcurrent carried to the next phase for synchronization. Ai .Iconcurrent denotes the set of processes that Ai is synchronizing with in a phase.

This set may shrink or expand after each phase depending on if an existing process drops or a new process joins respectively. The variable Ai .newsynchproc is assigned to the set that process Ai wants the other processes to include for synchronization from the next phase onwards. The variable Ai .newIconcurrent is used to accumulate the processes that will form Ai .Iconcurrent in the next phase. Ai .executing denotes the current phase of Ai and Ai .next denotes the next phase that the process will move to. INIT c: This initializes the special clock process Ac that is started at the creation of a clock c. Note that the clock process is initialized to itself for the set of initial processes that it has to synchronize with. Ac .proceed is set to true to start the synchronization. INIT i: When a process registers with a clock, Ai .proceed is set to f alse. The newly registered process waits for Ai .proceed to be made true which is done in CONSOLIDATE block of the process that contains Ai in its registered set. Rest of the variables are also set properly in this CONSOLIDATE block. In the following, we explain the protocol for SYNC and CONSOLIDATE. SYNC: This is the barrier synchronization stage of a process and performs the following main functions: 1) checks if all the processes in the phase are ready to move to the next phase; Ai .next is used to denote the completion of phase and check for others in the phase, 2) informs the other processes about the new processes that have to join from the next phase, 3) establishes if the processes have exchanged the relevant information so that it can consolidate the information required for the execution of the next phase.

Distributed Generalized Dynamic Barrier Synchronization

149

The new processes that are registered with Ai form the set Ai .newsynchproc. This step is required to capture the local snapshot. Note that for processes other than clock process, Ai .registered will be same as Ai .newsynhproc. However for the special clock process Ac , Ac .registered may keep on changing during the SYNC execution. Therefore, we need to take a snapshot so that consistent set of processes that have to be included from the next phase can be conveyed to other processes that are present in the synhcronization. The increment of Ai .next denotes that eﬀectively the process has completed the phase and is preparing to move to the next phase. Note that after this operation the diﬀerence between Ai .next and Ai .executing becomes 2 denoting the transition. The second part of SYNC is a do-od loop that forms the crux of barrier synchronization. There are three guarded commands in this loop which are explained below. 1. The ﬁrst guarded command checks if there exists a process j in Ai .Iconcurrent that has also reached barrier synchronization. If the value of Aj .next is greater or equal to Ai .next, then it implies that Aj has also reached the barrier point. If this guard is evaluated true, then that process is removed from Ai .Iconcurrent and the new processes that registered with Aj are added to the set Ai .newIconcurrent. 2. The second guard checks if any process in Ai .Iconcurrent has dropped out of synchronization and accordingly the set Ai .newIconcurrent is updated. 3. The third guard is true if the process j has not yet reached the barrier synchronization point. The associated statement with this guard is a no-op. It is this statement which forms the waiting part for barrier synchronization. By the end of this loop, Ai .Iconcurrent shall only contain Ai . The current phase denoted by Ai .executing is incremented to denote that process can start with the next phase. However, to ensure that the local snapshot captured in Ai .newsynchproc is properly conveyed to the other processes participating in phase synchronization, another do-od loop is executed that checks if processes have indeed moved to the next phase by incrementing Ai .executing. CONSOLIDATE: After ensuring that Ai has synchronized on the barrier, a ﬁnal round of consolidation is performed to prepare Ai for executing in the next phase. This phase consolidation is described under label CONSOLIDATE. The set of processes that Ai needs to phase synchronize are in Ai .newIconcurrent, therefore, Ai .Iconcurrent is assigned to Ai .newIconcurrent. All the new processes that will join from Ai .executing are signalled to proceed after initializing them properly. The set Ai .registered is updated to ensure that it has only those new processes that got registered after the value of Ai .registered was last read in SYNC. This is possible because of the explicit registration that is allowed through the special clock process. DROP: Ai .drop is set to true so that the guarded command in SYNC can become true appropriately. The restriction posed on a drop command ensures that Ai .registered will be empty and thus the starvation freedom guarantee is preserved.

150

S. Agarwal, S. Joshi, and R.K. Shyamasundar

Protocol for Process i INIT c: (* Initialization of clock process *) Ac .executing, Ac .next, Ac .Iconcurrent, Ac .registered, Ac .proceed, Ac .drop := 0, 1, Ac , ∅, true, f alse; INIT i: (* Initialization of Ai that performs a registration *) Ai .proceed := f alse; SYNC : (*CHECK COMPLETION of Ai .executing by other members*) Ai .newsynchproc := Ai .registered; Ai .newIconcurrent := Ai .newsynchproc + Ai ; Ai .next := Ai .next + 1; Ai .checklist := ∅; do Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent∧ i = j ∧ Ai .next ≤ Aj .next → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; Ai .newIconcurrent := Ai .newIconcurrent+Aj .newsynchproc +; {Aj }; Ai .checklist := Ai .checklist + Aj [] Ai .Iconcurrent = ∅ ∧ Aj .drop → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; [] Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent ∧ Ai .next > Aj .next (* no need to check i = j *) → skip; od; Ai .executing := Ai .executing + 1; (* Set the current phase *) do (* Check for completion of phase in other processes *) Ai .checklist = ∅ ∧ j ∈ Ai .checklist ∧ Ai .executing == Aj .executing → Ai .checklist := Ai .checklist − {Aj } od; CONSOLIDATE: (* CONSOLIDATE processes for the next phase *) Ai .Iconcurrent := Ai .newIconcurrent; Ai .registered := Ai .registered − Ai .newsynchproc; for all j ∈ Ai .newsynchproc do Aj .executing, Aj .next, Aj .Iconcurrent, Aj .registered, Aj .drop := Ai .executing, Ai .next, Ai .Iconcurrent, ∅, f alse; Aj .proceed := true; DROP : (* Code when Process Ai calls c.drop*) Ai .proceed := f alse; Ai .drop := true;

Fig. 2. Action of processes in phase synchronization

4

Correctness of the Solution

The proof obligations for synchronization and progress are given below. We have provided the proof in semi-formal way in the style of [1] which is available in the longer version of the paper [22]. The symbol ‘→’ denotes leads to. – Synchronization We need to show that the postcondition of SYNC;CONSOLIDATE; (corresponding to barrier synchronization) for processes that have proceed set to true is : {∀i, j((Ai .proceed = true ∧ Aj .proceed = true) ⇒ Ai .executing = Aj .executing)}

Distributed Generalized Dynamic Barrier Synchronization

151

– Progress

Property 1: The progress for processes already in phase synchronization is given by (k is used to denote current phase) the following property which says that if all the processes have completed phase k , then each of the processes move to a phase greater than k if they do not drop out. P1: {∀i(Ai .drop = f alse ∧ ∀j(Aj .drop = f alse ⇒ Aj .executing ≥ k) → (Ai .executing ≥ k + 1))}

Property 2: The progress for new processes that want to join the phase synchronization is given by the following property which says that a process that gets registered with a process involved in phase synchronization will also join the phase synchronization. P2: {∃i((Ai .proceed = f alse ∧ ∃j(i ∈ Aj .registered)) → Ai .proceed = true)} Complexity Analysis: The protocol in it’s simplest form has a remote message complexity of O(n2 ) where n is the upper bound on the number of processes that can participate in the barrier synchronization in any phase. This bound can be improved in practice by optimizing the testing of the completion of a phase by each of the participating processes. The optimization is brieﬂy explained in the following. When a process Ai checks for completion of phase in other process, say Aj , and it ﬁnds that Ai .executing < Aj .executing, then it can actually come out of the do-od loop by copying Aj .newIconcurrent that has the complete information about the processes participating in the next phase. This optimization can have the best case of O(n) messages and is very straightforward to embed in the proposed protocol. Note that in any case, atleast n messages are always required to propagate the information.

5

Generalization: Multi-clock Phase Synchronization

In this section, we outline the generalization of the distributed dynamic barrier synchronization for multiple clocks. 1) There is a special clock process for each clock. 2) The processes maintain protocol variables for each of the clocks that they register with. 3) A process can register with multiple clocks through C.register, where C denotes a set of clocks, as the ﬁrst operation before starting with phase synchronized computation. The notation C.register denotes that for all clocks c such that c∈C, perform c.register. The corresponding code is: for each c in C INIT_i_c; A_c.registered:=A_c.registered+A_i; while (!A_i_c.proceed);

Some important restrictions to avoid deadlock scenarios are: i) C.register, when C contains more than one clock, can only be done by the process that creates the clocks contained in C.

152

S. Agarwal, S. Joshi, and R.K. Shyamasundar

ii) If a process wants to register with a single clock c that is in use for phase synchronization by other processes, it will have to drop all it’s clocks and then it can use c.register to synchronize on the desired clock. Note that the clock c need not be re-created. iii) Subsequent child processes should use clocked to register with any subset of the multiple clocks that the parent is registered with. This combined with (iv) below avoids the deadlock scenarios of the likes of mobile barriers [20]. iv) For synchronization, the process increments the value of Ai .next for each registered clock, then executes the guarded loop for each of the clocks before it can move to CONSOLIDATE stage. The SYNC part of the protocol for multi-clock is very similar to single-clock except for an extra loop to run the guarded command loop for each of the clocks. A process clocked on multiple clocks results in synchronization of all the processes that are registered with these clocks. This is evident from the second do-od loop in SYNC part of the barrier synchronization protocol. For example, if a process A1 is synchronizing on c1, A2 on c1 and c2 and A3 on c2, then A1 and A3 also get synchronized as long as A2 does not drop one or both of the clocks. These clocks can be thus thought of as forming a group and A2 can be thought of as pivot process. In the following, we state the guarantees of synchronization provided by the protocol: 1) A process that synchronizes on multiple clocks can move to the next phase only when all the processes in the group formed by the clocks have also completed their current phase. 2) Two clock groups that do not have a common pivot process but have a common clock may diﬀer in phase by atmost one. Note that it cannot exceed one because that would imply improper synchronization between processes clocked on same clock which as has been proved above to be impossible in our protocol. 3) A new process registered with multiple clocks starts in the next phase (from the perspective of a local observer) w.r.t. each of the clocks individually.

6

Comparing with Other Dynamic Barrier Schemes

The clock syntax resembles that of X10 but diﬀers in the joining policy. An X10 activity that registers with a clock in some phase starts the synchronization from the same phase. The advantage of our dynamic joining policy (starting from the next phase) is that when a process starts a phase, it exactly knows the processes that it is synchronizing with in the phase. This makes it simpler to detect the completion of a phase in a distributed set-up. Whether to join in same phase or the next phase is more a matter of semantics rather than expressiveness. If there is a centralized manager process to manage phase synchronization, then the semantics of starting a newly registsered activity in same phase is feasible. However, for a distributed phase synchronization protocol with dynamically joining processes, the semantics of starting from the next phase is more eﬃcient.

Distributed Generalized Dynamic Barrier Synchronization

153

The other clock related works [19], [18] are directed more towards eﬃcient implementations of X10 like clocks rather than dealing with synchronization in a distributed setting. Barriers in JCSP [21] and occam-pi [20] do allow process to dynamically join and resign from barrier synchronization. Because the synchronization is barrier speciﬁc in both JCSP ( using barrier.sync() ) and occam-pi ( using SYNC barrier, it is a burden on the programmer to write a deadlock free program which is not the case here, as the use of next achieves synchronization over all registered clocks. JCSP and occam-pi barriers achieve linear time synchronization due to centralized control of barriers which is also possible in the optimized version of our protocol. Previous work on barrier implementation has focussed on algorithms that work on pre-speciﬁed number of processes or processors. The Butterﬂy barrier algorithm [9], Dissemination algorithm [10][5], Tournament algorithm [5], [4] are some of the earlier algorithms. Most of them emphasized on how to reduce the number of messages that need to be exchanged in order to know that all the processes have reached the barrier. Some of the more recent works on barrier algorithms in software are described in [6][11][12][15][14],[3]. As contrasted to the literature, our focus has been on developing algorithms for barrier synchronization where processes dynamically join and drop out; thus, processes that can be in a barrier synchronization need not be known a priori.

7

Conclusions

In this paper, we have described a solution for distributed dynamic phase synchronization that is shown to satisfy properties of progress and starvation freedom. To our knowledge, this is the ﬁrst dynamic distributed multi-processor synchronization algorithm wherein we have the established properties of progress, starvation freedom and shown the dependence of the progress on the entry strategies (captured through process registration). A future direction is to consider fault tolerance in the context of distributed barrier synchronization for dynamic number of processes.

References 1. Chandy, K.M., Misra, J.: Parallel program Design: A Foundation. Addison-Wesley, Reading (1988) 2. Dijkstra, E.W.: Guarded commands, non-determinacy and formal derivation of programs. Communications of the ACM 18(8) (August 1975) 3. Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared -memory multiprocessors. ACM TOCS 9(1), 21–65 (1991) 4. Feldmann, A., Gross, T., O’Hallaron, D., Stricker, T.M.: Subset barrier synchronization on a private-memory parallel system. In: SPAA (1992) 5. Hensgen, D., Finkel, R., Manbet, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1), 1–17 (1988) 6. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008)

154

S. Agarwal, S. Joshi, and R.K. Shyamasundar

7. Misra, J.: Phase Synchronization, Notes on Unity, 12-90, U. Texas at Austin (1990) 8. Tang, P., Yew, P.C.: Processor Self scheduling for multiple-nested parallel loops. In: Proc. ICPP, pp. 528–535 (Augus 1986) 9. Brooks III, E.D.: The butterﬂy barrier. International Journal of Parallel Programming 15(4) (1986) 10. Han, Y., Finkel, R.: An Optimal Scheme for Disseminating Information. In: Proc. of 17th International Conference on Parallel Processing (1988) 11. Scott, M.L., Michael, M.M.: The Topological Barrier: A Synchronization Abstraction for Regularly-Structured Parallel Applications Tech. Report TR605, Univ. of Rochester (1996) 12. Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. International Journal of Parallel Programming 18(3) (1990) 13. Livesey, M.: A Network Model of Barrier Synchronization Algorithms. International Journal of Parallel Programming 20(1) (February 1991) 14. Xu, H., McKinley, P.K., Ni, L.M.: Eﬃcient implementation of barrier synchronization in wormhole-routed hypercube multicomputers. In: Proceedings of the 12th International Conference on, vol. 9(12) (June 1992) 15. Yang, J.-S., King, C.-T.: Designing Tree-Based Barrier Synchronization on 2D Mesh Networks. IEEE Transactions on Parallel and Distributed Systems 9(6) (1998) 16. Saraswat, V., Jagadeesan, R.: Concurrent clustered programming. In: Jayaraman, K., de Alfaro, L. (eds.) CONCUR 2005. LNCS, vol. 3653, pp. 353–367. Springer, Heidelberg (2005) 17. Uniﬁed Parallel C Language,http://www.gwu.edu/~ upc/ 18. Shirako, J., Peixotto, M.D., Sarkar, V., Scherer, W.N.: Phasers: a uniﬁed deadlockfree construct for collective and point-to-point synchronization. In: ICS, pp. 277– 288 (2008) 19. Vasudevan, N., Tardieu, O., Dolby, J., Edwards, S.A.: Compile-Time Analysis and Specialization of Clocks in Concurrent Programs. In: CC, pp. 48–62 (2009) 20. Welsch, P., Barnes, F.: Mobile Barriers for occam-pi: Semantics, Implementation and Application. Communicating Process Architecture (2005) 21. Welsch, P., Brown, N., Moores, J., Chalmers, K., Sputh, B.H.C.: Integrating and Extending JCSP. In: CPA, pp. 349–370 (2007) 22. Agarwal, S., Joshi, S., Shyamasundar, R.K.: Distributed Generalized Dynamic Barrier Synchronization, Longer Version, http://www.tcs.tifr.res.in/~ shyam/Papers/dynamicbarrier.pdf

A High-Level Framework for Distributed Processing of Large-Scale Graphs Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal VU University Amsterdam {ekr,kielmann,wanf,bal}@cs.vu.nl

Abstract. Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present H IP G, a distributed framework that facilitates high-level programming of parallel graph algorithms by expressing them as a hierarchy of distributed computations executed independently and managed by the user. H IP G programs are in general short and elegant; they achieve good portability, memory utilization and performance.

1 Introduction We live in a world of graphs. Some graphs exist physically, for example transportation networks or power grids. Many exist solely in electronic form, for instance a state space of a computer program, the network of Wikipedia entries, or social networks. Graphs such as protein interaction networks in bioinformatics or airplane triangulations in engineering are created by scientists to represent real-world objects and phenomena. With the increasing abundance of large graphs, there is a need for a parallel graph processing language that is easy to use, high-level, and memory- and computation-efficient. Real-world graphs reach billions of nodes and keep growing: the World Wide Web expands, new proteins are being discovered, and more complex programs need to be verified. Consequently, graphs need to be partitioned between memories of multiple machines and processed in parallel in such a distributed environment. Real-world graphs tend to be sparse, as, for instance, the number of links in a web page is small compared to the size of the network. This allows for efficient storage of edges with the source nodes. Because of their size, partitioning graphs into balanced fragments with small a number of edges spanning different fragments is hard [1, 2]. Parallelizing graph algorithms is challenging. The computation is typically driven by a node-edge relation in an unstructured graph. Although the degree of parallelism is often considerable, the amount of computation per graph’s node is generally very small, and the communication overhead immense, especially when many edges spawn different graph chunks. Given the lack of structure of the computation, the computation is hard to partition and locality is affected [3]. In addition, on a distributed memory machine good load balancing is hard to obtain, because in general work cannot be migrated (part of the graph would have to be migrated and all workers informed). While for sequential graph algorithms a few graph libraries exist, notably the Boost Graph Library [4], for parallel graph algorithms no standards have been established. The current state-of-the-art amongst users wanting to implement parallel graph algorithms M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 155–166, 2011. c Springer-Verlag Berlin Heidelberg 2011

156

E. Krepska et al.

is to either use the generic C++ Parallel Graph Boost Library (PBGL) [5, 6] or, most often, create ad-hoc implementations, which are usually structured around their communication scheme. Not only does the ad-hoc coding effort have to be repeated for each new algorithm, but it also results in obscuring the original elegant concept. The programmer spends considerable time tuning the communication, which is prone to errors. While it may result in a highly-optimized problem-tailored implementation, the code can only be maintained or modified with substantial effort. In this paper we propose H IP G, a distributed framework aimed at facilitating implementations of HIerarchical Parallel Graph algorithms that operate on large-scale graphs. H IP G offers an interface to perform structure-driven distributed graph computations. Distributed computations are organized into a hierarchy and coordinated by logical objects called synchronizers. The H IP G model supports, but is not limited to, creating divide-and-conquer graph algorithms. A H IP G parallel program is composed automatically from the sequential-like components provided by the user. The computational model of H IP G, and how it can be used to program graph algorithms, is explained in Section 2, where we present three graph algorithms in increasing order of complexity: reachability search, finding single-source shortest paths and strongly connected components decomposition. These are well-known algorithms explained for example in [7]. Although the user must be aware that a H IP G program runs in a distributed environment, the code is high-level: explicit communication is not exposed by the API. Parallel composition is done in a way that does not allow race conditions, so that no locks or thread synchronization code are necessary from the user’s point of view. These facts, coupled with the use of an object-oriented language, makes for an easy-to-use, but expressive, language to code hierarchical parallel graph algorithms. We have implemented H IP G in the Java language. We discuss this choice as well as details of the implementation in Section 3. Using H IP G we have implemented algorithms presented in Section 2 and we evaluate their performance in Section 4. We processed graphs of size of the order of 109 of nodes on our cluster and obtained good performance. The H IP G code of the most complex example discussed in this paper, the strongly connected components decomposition, is an order of magnitude shorter than the hand-written C/MPI version of this program and three times shorter than the corresponding implementation in PBGL—See Section 5 for a discussion of the related work in the field of distributed graph processing. H IP G’s current limitations and future work are discussed in the concluding Section 6.

2 The H IP G Model and Programming Interface The input to a H IP G program is a directed graph. H IP G partitions the graph in a number of equal-size chunks and divides chunks between workers that are made responsible for processing nodes they own. A chunk consists of a number of nodes uniquely identified by pairs (chunk, index). H IP G uses the object-oriented paradigm of programming— namely, nodes are objects. Each node has arbitrary data and a number of outgoing edges associated and co-located with it. The target node of an edge is called a neighbor. In the current setup, the graph cannot be modified at runtime, but new graphs can be created.

A High-Level Framework for Distributed Processing of Large-Scale Graphs interface MyNode extends Node { public void visit(); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { boolean visited = false; public void visit() { if (!visited) { visited = true; for (MyNode n : neighbors()) n.visit(); } } }

Fig. 1. Reachability search in H IP G

157

p s visit()

q

visit()

visit()

visit()

Node s receives visit() and sends it to its neighbors.

t

visit() visit()

visit()

visit()

r

Neighbors forward the message to their neighbors. Fig. 2. Illustration of the reachability search

Graphs are commonly processed starting at a certain graph node and by following the structure of the graph, i.e. the node-edge relationship, until all reached nodes are processed. H IP G allows to process graphs this way by offering a seamless interface to execute methods on local and remote nodes. When necessary these method calls are automatically translated by H IP G into messages. In Section 2.1 we show how the methods can be used to create a distributed graph computation in H IP G. More complex algorithms require managing more than one such distributed computations. In particular, the objective of a divide-and-conquer graph algorithm is to divide computation on a graph into several sub-computations on sub-graphs. H IP G enables creation of sub-algorithms by introducing synchronizers—logical objects that manage distributed computations. The concept and API of a synchronizer are explained further in this section: in Section 2.2 we show how to use a single synchronizer, and in Section 2.3 an entire hierarchy of synchronizers is created to solve a divide-and-conquer graph problem. 2.1 Distributed Computation H IP G allows to implement graph computations with only regular methods executed on graph nodes. Typically, the user initializes the first method, which in turn executes methods on its neighbor nodes. In general, a node can execute methods on any node of which the unique identifier is known. To implement a graph computation, the user extends the provided LocalNode class with custom fields and methods. In a local node, neighbor nodes can be accessed with neighbors(), or inNeighbors() for incoming edges. Under the hood, the methods executing on remote nodes are automatically translated by H IP G into asynchronous messages. On reception of such a message, an appropriate method is executed, which thus acts as a message handler. The order of received messages cannot be predicted. Method parameters are automatically serialized, and we strive to make the serialization efficient. Distributed computation terminates when there are no more messages present in the system, which is detected automatically. Since

158

E. Krepska et al.

interface MyNode extends Node { public void found(SSSP sp, int d); } class MyLocalNode extends LocalNode<MyNode> implements MyNode { int dist = −1; public void found(SSSP sp, int d) { if (dist < 0) { dist = d; sp.Q.add(this); } } public void found0(SSSP sp, int d) { for (MyNode n : neighbors()) n.found(sp, d); } }

class SSSP extends Synchronizer { Queue<MyLocalNode> Q = new Queue(); int localQsize; public SSSP(MyLocalNode pivot) { if (pivot != null) Q.add(pivot); localQsize = Q.size(); } @Reduce public int GlobalQSize(int s) { return s + Q.size(); } public void run() { int depth = 0; do { for (int i = 0; i < localQsize; i++) Q.pop().found0(this, depth); barrier(); depth++; localQsize = Q.size(); } while (GlobalQSize(0) > 0); } }

Fig. 3. Single-source shortest paths (breadth-first search) implemented in H IP G

messages are asynchronous, returning a value of a method can be realized by sending a message back to the source. Typically, however, a dedicated mechanism, discussed later in this section, is used to compute the result of a distributed computation. Example: Reachability search. In a directed graph, a node s is reachable from node t if a path from t to s exists. Reachability search computes the set of nodes reachable from a given pivot. A reachability search implemented in H IP G (Fig. 1) consists of an interface MyNode that represents any node and a local node implementation MyLocalNode. The visit() method visits a node and its neighbors (Fig. 2). The algorithm is initiated by pivot.visit(). We note that, if it was not for the unpredictable order of method executions, the code for visit() could be understood sequentially. In particular, no locks or synchronization code were needed. 2.2 Coordination of Distributed Computations A dedicated layer of a H IP G algorithm coordinates the distributed computations. Its main building block is a synchronizer, which is a logical object that manages distributed computations. A synchronizer can initiate a distributed computation and wait for its termination. After a distributed computation has terminated, the synchronizer typically computes global results of the computation by invoking a global reduction operation. For example, the synchronizer may compute the global number of nodes reached by the computation, or a globally elected pivot. Synchronizers can execute distributed computations in parallel or one after another.

A High-Level Framework for Distributed Processing of Large-Scale Graphs F B(V ) : p = pick a pivot f rom V F = F W D(p) B = BW D(p) Report (F ∩ B) asSCC In parallel : F B(F \ B) F B(B \ F ) F B(V \ (F ∪ B))

159

V

F

B p

Fig. 4. FB: a divide-and-conquer algorithm to search for SCCs

To implement a synchronizer, the user subclasses Synchronizer and defines a run() method that, conceptually, will execute sequentially on all processors. Termination detection is provided by barrier(). The reduce methods, annotated @Reduce, must be commutative, as the order, in which they are executed, cannot be predicted. Example: Single-source shortest paths. Fig. 3 shows an implementation of a parallel single-source shortest paths algorithm. For simplicity, each edge has equal weight, so that the algorithm is in fact a breadth-first search [7]. We define an SSSP synchronizer, which owns a queue Q that represents the current layer of graph nodes. The run() method loops over all nodes in the current layer to create the next layer. The barrier blocks until the current layer is entirely processed. GlobalQSize computes the global size of Q by summing the sizes of queues Q on all processors. The algorithm terminates when all layers have been processed. 2.3 Hierarchical Coordination of Distributed Computations The key idea of the H IP G coordination layer is that synchronizers can spawn any number of sub-synchronizers to solve graph sub-problems. Therefore, the coordination layer is, in fact, a tree of executing synchronizers, and thus a hierarchy of distributed algorithms. All synchronizers execute independently and in parallel. The order, in which synchronizers progress cannot be predicted, unless they are causally related or explicitly synchronized. The user starts a graph algorithm by spawning root synchronizers. The system terminates when all synchronizers terminate. The H IP G parallel program is composed automatically from the two components provided by the user, namely node methods (message handlers) and the synchronizer code (coordination layer). Parallel composition is done in a way which does not allow race conditions. No explicit communication or thread synchronization is needed. Example: Strongly connected components. A strongly connected component (SCC) of a directed graph is a maximal set of nodes S such that there exists a path in S between any pair of nodes in S. In Fig. 4 we describe FB [8], a divide-and-conquer graph algorithm for computing SCCs. FB partitions the problem of finding SCCs of a set of nodes V into three sub-problems on three disjoint subsets of V . First an arbitrary pivot node is selected from V . Two sets F and B are computed as the sets of nodes that are, respectively, forward and backward reachable from the pivot. The set F ∩B is an SCC.

160

E. Krepska et al.

interface MyNode extends Node { public void fwd(FB fb, int f, int b); public void bwd(FB fb, int f, int b); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { int labelF = −1, labelB = −1; public void fwd(FB fb, int f, int b) { if (labelF == fb.ff && (labelB == b||labelB == fb.bb)){ labelF = f; fb.F.add(this); for (MyNode n : neighbors()) n.fwd(); } } public void bwd(FB fb, int f, int b) { if (labelB == fb.bb && (labelF == f||labelF == fb.ff)){ labelB = b; fb.B.add(this); for (MyNode n : inNeighbors()) n.bwd(); } } }

class FB extends Synchronizer { Queue<MyLocalNode> V, F, B; int ff, bb; FB(int f, int b, Queue<MyLocalNode> V0) { V = V0; F,B = new Queue(); ff = f; bb = b; } @Reduce MyNode SelectPivot(MyNode p) { return (p==null && !V.isEmpty())? V.pop():null; } public void run() { MyNode pivot = SelectPivot(null); if (pivot == null) return; int f = 2∗getId(), b = f+1; if (pivot.isLocal()) { pivot.fwd(this, f, b); pivot.bwd(this, f, b); } barrier(); spawn(f, bb, new FB(F.ﬁlterB(b)); spawn(ff, b, new FB(B.ﬁlterF(f)); spawn(f, b, new FB(V.ﬁlterFuB(f, b)); } }

Fig. 5. Implementation of the FB algorithm in H IP G

All SCCs remaining in V must be entirely contained either within F \B or within B\F or within the complement set V \(F ∪B). The H IP G implementation of the FB algorithm is displayed in Fig. 5. The FB creates subsets F and B of V by executing forward and backward reachability searches from a global pivot. Each set is labeled with a unique pair of integers (f,b). FB spawns three sub-synchronizers to solve sub-problems on F \B, B \F and V \(F ∪B). We note that the algorithm in Fig. 5 reflects the original elegant algorithm in Fig. 4. The entire H IP G program is 113 lines of code, while a corresponding C/MPI application (see Section 4) has over 1700 lines, and the PBGL implementation has 341 lines.

3 Implementation H IP G is designed to execute in a distributed-memory environment. We chose to implement it in Java because of the portability and performance (due to just-in-time compilation) as well as excellent software support of the language, although Java required us to carefully ensure that the memory is utilized efficiently. We used the Ibis [9] messagepassing communication library and the Java 6 virtual machine implemented by Sun [10]. Partitioning an input graph into equal-size chunks means that each chunk contains similar number of nodes and edges (currently, minimization of the number of edges spawning different chunks is not taken into account). Each worker stores one chunk in the form of an array of nodes. Outgoing edges are not stored within the node object. This would be impractical due to memory overhead (in 64-bit HotSpot this overhead is 16 B per object). As a compromise, nodes are objects but edges are not—rather, they are all stored in a single large integer array. We note that, although this structure is not elegant, it is transparent to the user, unless explicitly requested, e.g. when the program needs to be highly optimized. In addition, as most of the worker’s memory is used to store the graph, we tuned the garbage collector to use a relatively small young generation size (5–10% of the heap size).

A High-Level Framework for Distributed Processing of Large-Scale Graphs

161

After reading the graph, a H IP G program typically initiates root synchronizers, waits for completion, and handles the computed results. A part of the system that executes synchronizers we refer to as a worker. A worker consists of one main thread that emulates the abstraction of independent executions of synchronizers by looping over an array of active synchronizers and making progress with them in turn. When all synchronizers have terminated, the worker returns control to the user’s main program. We describe the implementation from the synchronizer’s point of view. A synchronizer is given a unique identifier, determined on spawn. Each synchronizer can take one of the three actions: either it communicates while waiting for a distributed routine to finish; or it proceeds when the distributed routine is finished; or it terminates. The bulk of synchronizer’s communication consists of messages that correspond to methods executed on graph nodes. Such messages contain identifiers of the synchronizer, the graph, the node and the executed method, followed by serialized method parameters. The messages are combined in non-blocking buffers and flushed repeatedly. Besides communicating, synchronizers perform distributed routines. Barriers are implemented with the distributed termination detection algorithm by Safra [11]. When a barrier returns, it means that no messages that belong to the synchronizer are present in the system. The reduce operation is also implemented by token traversal [12] and the result announced to all workers. Before a H IP G program can be executed, its Java bytecode has to be instrumented. Besides optimizing object serialization by Ibis [9], the graph program is modified: methods are translated into messages, neighbor access is optimized, and synchronizers are rewritten so that no separate thread is needed for each synchronizer instance. The latter is done by translating the blocking routines into a checkpoint followed by a return. This way a worker can execute a synchronizer’s run() method step-by-step. The instrumentation is part of the provided H IP G library, and needs to be called before execution. No special Java compiler is necessary. Release. More implementation details and a GPL release of H IP G can be found at http://www.cs.vu.nl/~ekr/hipg.

4 Memory Utilization and Performance Evaluation In this section we report on the results of experiments conducted with H IP G. The evaluation was carried out on the VU-cluster of the DAS-3 system [13]. The cluster consists of 85 dual-core, dual-CPU 2.4 GHz Opteron compute nodes, each equipped with 4 GB of memory. The processors are interconnected with Myri-10G (MX) and 1G Ethernet links. The time to initialize workers and input graphs was not included in the measurements. All graphs were partitioned randomly—meaning that if a graph is partitioned in p chunks, a graph node is assigned to a chunk with probability p1 . The portion of remote edges is thus p−1 p , which is very high (87-99% in used graphs) and realistic when modeling an unfavorable partitioning (many edges spawning different chunks). We start with the evaluation of performance of applications that almost solely communicate (only one synchronizer spawned). Visitor, the reachability search (see Section 2.1) was started at the root node of a large binary tree directed towards the leaves. SSSP, the single-source shortest paths (breadth first search) (see Section 2.2), was started at the root node of the binary tree, and at a random node of a synthetic social

162

E. Krepska et al.

Table 1. Performance of V ISITOR and SSSP

60

Appl. Workers Input Time[s] Mem[GB]

50

Visitor Visitor Visitor Visitor

8 16 32 64

Bin-27 Bin-28 Bin-29 Bin-29

19.1 21.4 24.5 16.9

2.8 2.9 3.1 2.1

SSSP SSSP SSSP SSSP

8 16 32 64

Bin-27 Bin-28 Bin-29 Bin-29

31.5 38.0 42.5 29.8

2.8 3.0 3.2 2.4

SSSP SSSP SSSP SSSP

8 16 32 64

LN-80 LN-160 LN-320 LN-640

30.8 33.7 34.6 38.5

1.3 1.5 1.7 2.0

Perfect speedup Visitor SSSP SSSP-LN

40 30 20 10 0

10

20

30

40

50

60

#processors

Fig. 6. Speedup of V ISITOR and SSSP

network. The results are presented in Tab. 1 and Fig. 6. We tested both applications on 8–64 processors on Myrinet. To obtain more fair results, rather than keeping the problem size constant and dividing the input into more chunks, we doubled the problem size with doubling the number of processors (Tab. 1, with the exception of Bin–30 that should have been run on 64 processors but did not fit the memory). Thanks to this we avoid spurious improvement due to better cache behavior, keep the heap filled, but also avoid too many small messages that occur if the stored portion of a graph is small. We normalized the results for the speedup computation (Fig. 6). We used binary trees, Bin– n, of height n = 27..29 that have 0.27–1.0 ·109 nodes and edges. The LN–n graphs are random directed graphs with degrees of nodes sampled from the log-normal distribution ln N (4, 1.3), aimed to resemble real-world social networks [14, 15]. An LN–n graph has n · 105 nodes and n · 1.27 · 106 edges. We used LN–n graphs of size n = 80..640 and thus up to 64·106 nodes and 8·109 edges. In each experiment, all edges of the input graphs were visited. Both applications achieved about 60% efficiency on a binary tree graph on 64 processors, which is satisfactory for an application with little computation, O(n), compared to O(n) communication. The efficiency achieved by SSSP on LN–n graphs reaches almost 80%, as the input is more randomized, and has a small diameter compared to a binary tree, which reduces the number of barriers performed. To evaluate the performance of hierarchical graph algorithms written in H IP G, we ran the OBFR-MP algorithm that decomposes a graph into SCCs [16]. OBFR-MP is a divide-and-conquer algorithm like FB [8] (see Section 2.3), but processes the graph in layers. We compared the performance of the OBFR-MP implemented in H IP G against a highly-optimized C/MPI version of this program used for performance evaluation in [16] and kindly provided to us by the authors. The H IP G version was implemented to maximally resemble the C/MPI version: the data structures used and messages sent are the same. Here, we are not interested in the speedup of the MPI implementation of OBFR-MP, on which we don’t have any influence. Rather, we want to see the difference in performance between an optimized C/MPI version and H IP G version of the same application. In general, we found that the H IP G version was substantially faster when compared with MPI implementations that used sockets. The detailed results are presented in Tab. 2. We used two different implementations of MPI over Myrinet: the MPICH-MX implementation provided by Myricom that directly accesses

A High-Level Framework for Distributed Processing of Large-Scale Graphs

163

Table 2. Performance comparison of the OBFR-MP SCC-decomposition algorithm tested on three LnLnTm graphs. OM (OpenMPI) and P4 are socket-based MPI implementations, while the MX MPI implementation directly uses the Myrinet interface. Time is given in seconds.

p 4 8 16 32 64

MX 36.6 26.6 96.5 40.0 24.1

L487L487T5 Myri Eth OM HipG P4 HipG 141.4 41.1 94.8 45.7 81.6 22.1 82.5 30.0 60.5 48.4 179.0 37.0 57.3 39.1 163.4 41.0 46.7 24.4 234.6 41.8

MX 69 73 89 136 128

L10L10T16 Myri Eth OM HipG P4 HipG 255 148 302 225 280 226 462 330 376 315 804 506 661 485 1794 851 646 277 1659 461

MX 45.1 34.5 37.1 30.1 32.0

L60L60T11 Myri Eth OM HipG P4 HipG 152.9 47.3 110.8 98.8 99.8 46.8 111.5 116.0 128.6 60.4 216.2 125.9 82.0 57.4 214.7 171.8 108.8 66.1 311.4 141.2

the interface, and OpenMPI that goes through TCP sockets. On Ethernet we used the standard MPI implementation (P4). We tested OBFR-MP on synthetic graphs called LmLmTn, which are in essence trees of height n of SCCs, such that each SCC is a lattice (m + 1) × (m + 1). An LmLmTn graph has thus (2n+1 − 1) SCCs, each of size (m + 1)2 . The performance of the OBFR-MP algorithm strongly depends on the SCC-structure of the input graph. We used three graphs: one with a small number of large SCCs, L487L487T5; one with a large number of small SCCs, L10L10T16; and one that balances the number of SCCs and their size, L60L60T11. Each graph contains a little over 15·106 nodes and 45·106 edges. The performance of the C/MPI application running over MX is the fastest, as it has the smallest software stack. The OpenMPI and P4 MPI implementations offer a more realistic comparison as they use a deeper software stack (sockets) like H IP G: H IP G ran on average 2.2 times faster than the C/MPI in this case. Most importantly, the speedup or slowdown of H IP G follows the speedup or slowdown of the C/MPI application run over MX, which suggests that the overhead of H IP G will not explode with further scaling of the application. The communication pattern of many graph algorithms is an intensive all-to-all communication. Generally, message sizes decrease with the increase of the number of processors. Good performance results from balancing the size of flushed messages and the frequency of flushing: too many flushes decrease performance, while too few flushes cause other processors to stall. Throughput on 32 processors over MX for the V ISITOR application on Bin-29 is constant (not shown): the application sends 16 GB in 24 s. A worker’s memory is divided between the graph, the communication buffers and the memory allocated by the user’s code in synchronizers. On a 64-bit machine, a graph node uses 80 B in V ISITOR and on average 1 KB in SSSP, including the edges and all overhead. Tab. 1 presents the maximum heap size used by a V ISITOR/SPPP worker. Expectedly, it remains almost constant. SSSP uses more memory than Visitor, because it stores a queue of nodes (see Section 2.2). The results in this section do not aim to prove that we obtained the most efficient implementations of the V ISITOR, SSSP or OBFR-MP algorithms. When processing large-scale graphs, the speedup is of secondary importance; it is of primary importance to be able to store the graph in memory and process it in acceptable time. We aimed to show that large-scale graphs can be handled by H IP G and satisfactory performance can be obtained with little coding effort, even for complex hierarchical graph algorithms.

164

E. Krepska et al.

5 Related Work H IP G is a distributed framework aimed at providing users with a way to code, with little effort, parallel algorithms that operate on partitioned graphs. An analysis of other platforms suitable for the execution of graph algorithms is provided in an inspiring paper by Lumsdaine et al. [3] that, in fact, advocates using massively multithreaded shared-memory machines for this purpose. However, such machines are very expensive and software support is lacking [3]. The library in [17] realizes this concept on a Cray machine. Another interesting alternative would be to use the partitioned global address space languages like UPC [18], X10 [19] or ZPL [20], but we are not aware of support for graph algorithms in these languages, except for a shared memory solution [21] based on X10 and Cilk. The Bulk Synchronous Parallel (BSP) model of computation [22] alternates work and communication phases. We know of two BSP-based libraries that support the development of distributed graph algorithms: CGMgraph and Pregel. CGMgraph [23] uses the unified communication API and parallel routines offered by CGMlib, which is conceptually close to MPI [24]. In Google’s Pregel [15] the graph program is a series of supersteps. In each superstep the Compute(messages) method, implemented by the user, is executed in parallel on all vertices. The system supports fault-tolerance consisting of heartbeats and checkpointing. Impressively, Pregel is reported to be able to handle billions of nodes and use hundreds of workers. Unfortunately, it is not available for download. Pregel is similar to H IP G in two aspects: the vertex-centered programming and composing the parallel program automatically from user-provided simple sequential-like components. However, the repeated global synchronization phase in the Bulk Synchronous Parallel model, although suitable for many applications, is not always desirable. H IP G is fundamentally different from BSP in this respect, as it uses asynchronous messages with computation synchronized on the user’s request. Notably, H IP G can simulate the BSP model as we did in the SSSP application (Section 2.2). The prominent sequential Boost Graph Library (BGL) [4] gave rise to a parallelization that adopts a different approach to graph algorithms. Parallel BGL [5,6] is a generic C++ library that implements distributed graph data structures and graph algorithms. The main focus is to reuse existing sequential algorithms, only applying them to distributed data structures, to obtain parallel algorithms. PBGL supports a rich set of parallel graph implementations and property maps. The system keeps information about ghost (remote) vertices, although that works well only if the number of edges spanning different processors is small. Parallel BGL offers a very general model, while both Pregel and H IP G trade expressiveness (for example neither offers any form of remote read) for more predictable performance. ParGraph [25] is another parallelization of BGL, similar to PBGL, but less developed; it does not seem to be maintained. We are not aware of any work directly supporting the development divide-and-conquer graph algorithms. To store graphs we used the SVC-II distributed graph format advocated in [26]. Graph formats are standardized only within selected communities. In case of large graphs, binary formats are typically preferable to text-based formats, as compression is not needed. See [26] for a comparison of a number of formats used in the formal methods community. A popular text format is XML, which is used for example to store Wikipedia [27]. RDF [28] is used to represent semantic graphs in the form of

A High-Level Framework for Distributed Processing of Large-Scale Graphs

165

triples (source, edge, target). Contrastingly, in bioinformatics, graphs are stored in many databases and integrating them is ongoing research [29].

6 Conclusions and Future Work In this paper we described H IP G, a model and a distributed framework that allows users to code, with little effort, hierarchical parallel graph algorithms. The parallel program is automatically composed of sequential-like components provided by the user: node methods and synchronizers, which coordinate distributed computations. We realized the model in Java and obtained short and elegant implementations of several published graph algorithms, good memory utilization and performance, as well as out-of-the box portability. Fault-tolerance has not been implemented in the current implementation of H IP G, as the programs that we executed so far run on a cluster and were not mission-critical. A solution using checkpointing could be implemented, in which, when a machine fails, a new machine is requested and the entire computation restarted from the last checkpoint. Such a solution is standard and similar to the one used in [15]. Creating a checkpoint takes somewhat more effort, because of the lack of global synchronization phases in H IP G. Creating a consistent image of the state space could be done either by freezing the entire computation or with a distributed snapshot algorithm in the background such as the one by Lai-Yang [12]. Distributed snapshot poses overhead on messages, which however can be minimized when using message combining, which is the case in H IP G. H IP G is work in progress. We would like to improve speedup by using better graph partitioning methods, e.g. [1]. If needed, we could implement graph modification during runtime, although in all cases that we looked at, this could be solved by creating new graphs during execution, which is possible in H IP G. We are currently working on providing tailored support for multicore processors and extending the framework to execute on a grid. Currently the size of the graph that can be handled is limited to the amount of memory available. Therefore, we are interested if a portion of a graph could be temporarily stored on disk without completely sacrificing efficiency [30]. Acknowledgments. We thank Jaco van de Pol who initiated this work and provided C code, and Ceriel Jacobs for helping with the implementation.

References 1. Karypis, G., Kumar, V.: A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J. of Par. and Distr. Computing 48(1), 71–95 (1998) 2. Feige, U., Krauthgamer, R.: A polylog approximation of the minimum bisection. SIAM Review 48(1), 99–130 (2006) 3. Lumsdaine, A., Gregor, D., Hendrickson, B., Berry, J.: Challenges in parallel graph processing. PPL 17(1), 5–20 (2007) 4. Siek, J., Lee, L.-Q., Lumsdaine, A.: The Boost Graph Library. Addison-Wesley, Reading (2002) 5. Gregor, D., Lumsdaine, A.: The parallel BGL: A generic library for distributed graph computations. In: Parallel Object-Oriented Scientific Computing (2005)

166

E. Krepska et al.

6. Gregor, D., Lumsdaine, A.: Lifting sequential graph algorithms for distributed-memory parallel computation. OOPSLA 40(10), 423–437 (2005) 7. Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (1990) 8. Fleischer, L., Hendrickson, B., Pinar, A.: On identifying strongly connected components in parallel. In: Rolim, J.D.P. (ed.) IPDPS-WS 2000. LNCS, vol. 1800, pp. 505–511. Springer, Heidelberg (2000) 9. Bal, H.E., Maassen, J., van Nieuwpoort, R., Drost, N., Kemp, R., van Kessel, T., Palmer, N., Wrzesi´nska, G., Kielmann, T., van Reeuwijk, K., Seinstra, F., Jacobs, C., Verstoep, K.: Real-world distributed computing with Ibis. IEEE 43(8), 54–62 (2010) 10. The Java SE HotSpot virtual machine. java.sun.com/products/hotspot 11. Dijkstra, E.: Shmuel Safra’s version of termination detection. Circulated privately (January 1987) 12. Tel, G.: Introduction to distributed algorithms. Cambridge University Press, Cambridge (2000) 13. Distributed ASCI Supercomputer DAS-3, www.cs.vu.nl/das3 14. Pennock, D.M., Flake, G.W., Lawrence, S., Glover, E.J., Giles, C.L.: Winners don’t take all: Characterizing the competition for links on the web. PNAS 99(8), 5207–5211 (2002) 15. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010) 16. Barnat, J., Chaloupka, J., van de Pol, J.: Improved distributed algorithms for SCC decomposition. In: PDMC 2007. ENTCS, vol. 198(1), pp. 63–77 (2008) 17. Berry, J., Hendrickson, B., Kahan, S., Konecny, P.: Graph software development and performance on the MTA-2 and Eldorado. At 48-th Cray Users Group meeting (2006) 18. Coarfa, C.: et al. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In: PPoPP 2005, pp. 36–47. ACM, New York (2005) 19. Charles, P.: et al. X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA, pp. 519–538. ACM, New York (2005) 20. Chamberlain, B.L., Choi, S.-E., Lewis, E.C., Snyder, L., Weathersby, W.D., Lin, C.: The case for high-level parallel programming in ZPL. IEEE Comput. Sci. Eng. 5(3), 76–86 (1998) 21. Cong, G., Kodali, S., Krishnamoorthy, S., Lea, D., Saraswat, V., Wen, T.: Solving large, irregular graph problems using adaptive work-stealing. In: ICPP, pp. 536–545. IEEE, Los Alamitos (2008) 22. Valiant, L.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990) 23. Chan, A., Dehne, F., Taylor, R.: CGMgraph/CGMlib: Implementing and testing CGMgraph alg. on PC clusters and shared memory machines. J. of HPC App., 19(1):81–97 (2005) 24. MPI Forum: MPI: A message passing interface. J of Supercomp Appl. 8(3/4), 169–416 (1994) 25. Hielscher, F., Gottschling, P.: ParGraph library. pargraph.sourceforge.net (2004) 26. Blom, S., van Langevelde, I., Lisser, B.: Compressed and distributed file formats for labeled transition systems. In: PDMC 2003. ENTCS, vol. 89, pp. 68–83 (2003) 27. Denoyer, L., Gallinari, P.: The Wikipedia XML corpus. SIGIR Forum 40(1), 64–69 (2006) 28. Resource description framework, http://www.w3.org/RDF 29. Joyce, A.R., Palsson, B.O.: The model organism as a system: Integrating ’omics’ data sets. Nat. Rev. Mol. Cell. Biol. 7(3), 198–210 (2006) 30. Hammer, M., Weber, M.: To store or not to store reloaded: Reclaiming memory on demand. In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J. (eds.) FMICS 2006 and PDMC 2006. LNCS, vol. 4346, pp. 51–66. Springer, Heidelberg (2007)

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Ankur Narang1 , Abhinav Srivastava1 , Naga Praveen Kumar1, and Rudrapatna K. Shyamasundar2 1

2

IBM Research - India, New Delhi Tata Institute of Fundamental Research, Mumbai

Abstract. With the advent of many-core architectures efficient scheduling of parallel computations for higher productivity and performance has become very important. Distributed scheduling of parallel computations on multiple places1 needs to follow affinity and deliver efficient space, time and message complexity. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. In this paper, we address this challenge by using a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing. This paper presents an online algorithm for affinity driven distributed scheduling of multi-place2 parallel computations. Theoretical analysis of the expected and probabilistic lower and upper bounds on time and message complexity of this algorithm has been provided. On well known benchmarks, our algorithm demonstrates 16% to 30% performance gain as compared to Cilk [6] on multi-core Intel Xeon 5570 architecture. Further, detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. To the best of our knowledge, this is the first time affinity driven distributed scheduling algorithm has been designed and theoretically analyzed in a multi-place setup for many core architectures.

1 Introduction The exascale computing roadmap has highlighted efficient locality oriented scheduling in runtime systems as one of the most important challenges (”Concurrency and Locality” Challenge [10]). Massively parallel many core architectures have NUMA characteristics in memory behavior, with a large gap between the local and the remote memory latency. Unless efficiently exploited, this is detrimental to scalable performance. Languages such as X10 [9], Chapel [8] and Fortress [4] are based on partitioned global address space (PGAS [11]) paradigm. They have been designed and implemented as part of DARPA HPCS program3 for higher productivity and performance on many-core massively parallel platforms. These languages have in-built support for initial placement of threads (also referred as activities) and data structures in the parallel program. 1 2

3

Place is a group of processors with shared memory. Multi-place refers to a group of places. For example, with each place as an SMP(Symmetric MultiProcessor), multi-place refers to cluster of SMPs. www.highproductivity.org/

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 167–178, 2011. c Springer-Verlag Berlin Heidelberg 2011

168

A. Narang et al.

Therefore, locality comes implicitly with the program. The run-time systems of these languages need to provide efficient algorithmic scheduling of parallel computations with medium to fine grained parallelism. For handling large parallel computations, the scheduling algorithm (in the run-time system) should be designed to work in a distributed fashion. This is also imperative to get scalable performance on many core architectures. Further, the execution of the parallel computation happens in the form of a dynamically unfolding execution graph. It is difficult for the compiler to always correctly predict the structure of this graph and hence perform correct scheduling and optimizations, especially for data-dependent computations. Therefore, in order to schedule generic parallel computations and also to exploit runtime execution and data access patterns, the scheduling should happen in an online fashion. Moreover, in order to mitigate the communication overheads in scheduling and the parallel computation, it is essential to follow affinity inherent in the computation. Simultaneous consideration of these factors along with low time and message complexity, makes distributed scheduling a very challenging problem. In this paper, we address the following affinity driven distributed scheduling problem. Given: (a) An input computation DAG (Fig. 1) that represents a parallel multithreaded computation with fine to medium grained parallelism. Each node in the DAG is a basic operation such as and/or/add etc. and is annotated with a place identifier which denotes where that node should be executed. Each edge in the DAG represents one of the following: (i) spawn of a new thread or, (ii) sequential flow of execution or, (iii) synchronization dependency between two nodes. The DAG is a strict parallel computation DAG (synchronization dependency edge represents an activity waiting for the completion of a descendant activity, details in section 3); (b) A cluster of n SMPs (refer Fig. 2) as the target architecture on which to schedule the computation DAG. Each SMP 4 also referred as place has fixed number(m) of processors and memory. The cluster of SMPs is referred as the multi-place setup. Determine: An online schedule for the nodes of the computation DAG in a distributed fashion that ensures the following: (a) Exact mapping of nodes onto places as specified in the input DAG. (b) Low space, time and message complexity for execution. In this paper, we present the design of a novel affinity driven, online, distributed scheduling algorithm with low time and message complexity. The algorithm assumes initial placement annotations on the given parallel computation with the consideration of load balance across the places. The algorithm controls the online expansion of the computation DAG. Our algorithm employs an efficient remote spawn mechanism across places for ensuring affinity. Randomized work stealing within a place helps in load balancing. Our main contributions are: – We present a novel affinity driven, online, distributed scheduling algorithm. This algorithm is designed for strict multi-place parallel computations. – Using theoretical analysis, we prove that the lower bound of the expected execution k time is O(maxk T1k /m + T∞,n ) and the upper bound is O( k (T1k /m + T∞ )), where k is a variable that denotes places from 1 to n, m denotes the number of processors per place, T1k denotes the execution time on a single processor for place 4

Symmetric MultiProcessor: group of processors with shared memory.

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

169

k, and T∞,n denotes the execution time of the computation on n places with infinite processors on each place. Expected and probabilistic lower and upper bounds for the message complexity have also been provided. – On well known parallel benchmarks (Heat, Molecular Dynamics and Conjugate Gradient), we demonstrate performance gains of around 16% to 30% over Cilk on multi-core architectures. Detailed analysis shows the scalability of our algorithm as well as efficienct space utilization.

2 Related Work Scheduling of dynamically created tasks for shared memory multi-processors has been a well studied problem. The work on Cilk [6] promoted the strategy of randomized work stealing. Here, a processor that has no work (thief ) randomly steals work from another processor (victim) in the system. [6] proved efficient bounds on space (O(P · S1 )) and time (O(T1 /P + T∞ )) for scheduling of fully-strict computations (synchronization dependency edges go from a thread to only its immediate parent thread, section 3) in an SMP platform; where P is the number of processors, T1 and S1 are the time and space for sequential execution respectively, and T∞ is the execution time on infinite processors. We consider locality oriented scheduling in distributed enviroments and hence are more general than Cilk. The importance of data locality for scheduling threads motivated work stealing with data locality [1] wherein the data locality was discovered on the fly and maintained as the computation progressed. This work also explored initial placement for scheduling and provided experimental results to show the usefulness of the approach; however, affinity was not always followed, the scope of the algorithm was limited to only SMP environments and its time complexity was not analyzed. [5] analyzed the time complexity (O(T1 /P + T∞ )) for scheduling general parallel computations on SMP platforms but does not consider locality oriented scheduling. We consider distributed scheduling problem across multiple places (cluster of SMPs) while ensuring affinity and also provide time and message complexity bounds. [7] considers work-stealing algorithms in a distributed-memory environment, with adaptive parallelism and fault-tolerance. Here task migration was entirely pull-based (via a randomized work stealing algorithm) hence it ignored affinity and also didn’t provide any formal proof for the resource utilization properties. The work in [2] described a multi-place(distributed) deployment for parallel computations for which initial placement based scheduling strategy is appropriate. A multi-place deployment has multiple places connected by an interconnection network where each place has multiple processors connected as in an SMP platform. It showed that online greedy scheduling of multi-threaded computations may lead to physical deadlock in presence of bounded space and communication resources per place. However, the computation did not respect affinity always and no time or communication bounds were provided. Also, the aspect of load balancing was not addressed even within a place. We ensure affinity along with intra-place load balancing in a multi-place setup. We show empirically, that our algorithm has efficient space utilization as well.

170

A. Narang et al.

3 System and Computation Model The system on which the computation DAG is scheduled is assumed to be cluster of SMPs connected by an Active Message Network (Fig. 2). Each SMP is a group of processors with shared memory. Each SMP is also referred to as place in the paper. Active Messages ((AM)5 is a low-level lightweight RPC(remote procedure call) mechanism that supports unordered, reliable delivery of matched request/reply messages. We assume that there are n places and each place has m processors (also referred to as workers). The parallel computation to be dynamically scheduled on the system, is assumed to be specified by the programmer in languages such as X10 and Chapel. To describe our distributed scheduling algorithm, we assume that the parallel computation has a DAG(directed acyclic graph) structure and consists of nodes that represent basic operations like and, or, not, add and so forth. There are edges between the nodes in the computation DAG (Fig. 1) that represent creation of new activities (spawn edge), sequential execution flow between the nodes within a thread/activity (continue edge) and synchronization dependencies (dependence edge) between the nodes. In the paper we refer to the parallel computation to be scheduled as the computation DAG. At a higher level, the parallel computation can also be viewed as a computation tree of activities. Each activity is a thread (as in multi-threaded programs) of execution and consists of a set of nodes (basic operations). Each activity is assigned to a specific place (affinity as specified by the programmer). Hence, such a computation is called multi-place computation and DAG is referred to as place-annotated computation DAG (Fig. 1: v1..v20 denote nodes, T 1..T 6 denote activities and P 1..P 3 denote places). Based on the structure of dependencies between the nodes in the computation DAG, there can be following types of parallel computations: (a) Fully-strict computation: Dependencies are only between the nodes of a thread and the nodes of its immediate parent thread; (b) Strict computation: Dependencies are only between the nodes of a thread and the nodes of any of its ancestor threads; (c) Terminally strict computation: (Fig. 1). Dependencies arise due to an activity waiting for the completion of its descendants. Every dependency edge, therefore, goes from the last instruction of an activity to one of its ancestor activities with the following restriction: In a subtree rooted at an activity called Γr , if there exists a dependence edge from any activity in the subtree to the root activity Γr , then there cannot exist any dependence edge from the activities in the subtree to the ancestors of Γr . The following notations are used in the paper. P = {P1 , · · · , Pn } denote the set of places. {Wi1 , Wi2 ..Wim } denote the set of workers at place Pi . S1 denotes the space required by a single processor execution schedule. Smax denotes the size in bytes of the largest activation frame in the computation. Dmax denotes the maximum depth of the computation tree in terms of number of activities. T∞,n denotes the execution time of k the computation DAG over n places with infinite processors at each place. T∞ denotes the execution time for activities assigned to place P using infinite processors. Note k k k that, T∞,n ≤ T . T denotes the time taken by a single processor for the 1 1≤k≤n ∞ activities assigned to place k. 5

Active Messages defined by the AM-2: http://now.cs.berkeley.edu/AM/active_messages.html

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Single Place with multiple processors

171

Multiple Places with multiple processors per place

T1 @ P1 v1

v2

v14

v18

v19

T2 @ P2

v20

PE

v3

v6

v9

v13

v15

v16

PE

SMP Node PE

Spawn edge

T6 @ P1

L2 Cache

….

SMP Node PE

PE

….

PE

Continue edge

v17

Dependence edge

System Bus

Memory

…

Memory

L2 Cache

T3 @ P3 v4

T5 @ P2

T4 @ P3 v5

v7

v8

v10

v11

v12

PE

PE

SMP

Fig. 1. Place-annotated Computation Dag

Interconnect (Active Message Network)

SMP Cluster

Fig. 2. Multiple Places: Cluster of SMPs

4 Distributed Scheduling Algorithm Consider a strict place-annotated computation DAG. The distributed scheduling algorithm described below schedules activities with affinity, at only their respective places. Within a place, work-stealing is enabled to allow load-balanced execution of the computation sub-graph associated with that the place. The computation DAG unfolds in an online fashion in a breadth-first manner across places when the affinity driven activities are pushed onto their respective remote places. For space efficiency, before a placeannotated activity is pushed onto a place, the remote place buffer (FAB, see below) is checked for space utilization. If the space utilization of the remote buffer (FAB) is high then the push gets delayed for a limited amount of time. This helps in appropriate spacetime tradeoff for the execution of the parallel computation. Within a place, the online unfolding of the computation DAG happens in a depth-first manner to enable efficient space and time execution. Sufficient space is assumed to exist at each place, so physical deadlocks due to lack of space cannot happen in this algorithm. Each place maintains a Fresh Activity Buffer (FAB) which is managed by a dedicated processor (different from workers) at that place. An activity that has affinity for a remote place is pushed into the FAB at that place. Each worker at a place has a Ready Deque and a Stall Buffer (refer Fig. 3). The Ready Deque of a processor contains the activities of the parallel computation that are ready to execute. The Stall Buffer contains the activities that have been stalled due to dependency on another activities in the parallel computation. The FAB at each place as well as the Ready Deque at each worker use a concurrent deque implementation. An idle worker at a place will attempt to randomly steal work from other workers at the same place (randomized work stealing). Note that an activity which is pushed onto a place can move between workers at that place (due to work stealing) but can not move to another place and thus obeys affinity at all times. The distributed scheduling algorithm is given below. At any step, an activity A at the rth worker (at place i), Wir , may perform the following actions: 1. Spawn: (a) A spawns activity B at place,Pj , i = j: A sends AM(B) (active message for B) to the remote place. If the space utilization of FAB(j) is below a given threshold, then AM(B) is successfully inserted in the FAB(j) (at Pj ) and A continues

172

A. Narang et al. Place(j)

Worker(m)

Place(i) FAB

Remote Spawn

Worker(m)

request (AM(B))

FAB

Worker(2) Dedicated Processor

Worker(1)

Spawn accept

Stall Buffer Ready Deque

Ready Deque

Stall Buffer

Worker(2)

Dedicated Processor

Worker(1)

Fig. 3. Affinity Driven Distributed Scheduling Algorithm

execution. Else, this worker waits for a limited time, δt , before retrying the activity B spawn on place Pj (Fig. 3). (b) A spawns B locally: B is successfully created and starts execution whereas A is pushed into the bottom of the Ready Deque. 2. Terminates (A terminates): The worker at place Pi , Wir , where A terminated, picks an activity from the bottom of the Ready Deque for execution. If none available in its Ready Deque, then it steals from the top of other workers’ Ready Deque. Each failed attempt to steal from another worker’s Ready Deque is followed by attempt to get the topmost activity from the FAB at that place. If there is no activity in the FAB then another victim worker is chosen from the same place. 3. Stalls (A stalls): An activity may stall due to dependencies in which case it is put in the Stall Buffer in a stalled state. Then same as Terminates (case 2) above. 4. Enables (A enables B): An activity, A, (after termination or otherwise) may enable a stalled activity B in which case the state of B changes to enabled and it is pushed onto the top of the Ready Deque. 4.1 Time Complexity Analysis The time complexity of this affinity driven distributed scheduling algorithm in terms of number of throws during execution is presented below. Each throw represents an attempt by a worker(thief ) to steal an activity from either another worker(victim) or FAB at the same place. Lemma 1. Consider a strict place-annotated computation DAG with work per place, T1k , being executed by the distributed scheduling algorithm presented in section 4. Then, the execution (finish) time for place,k, is O(T1k /m+Qkr /m+Qke /m), where Qkr denotes the number of throws when there is at least one ready node at place k and Qke denotes the number of throws when there are no ready nodes at place k. The lower bound on the executiontime of the full computation is O(maxk (T1k /m + Qkr /m)) and the upper bound is O( k (T1k /m + Qkr /m)). Proof Sketch: (Token based counting argument) Consider three buckets at each place in which tokens are placed: work bucket where a token is placed when a worker at that place executes a node of the computation DAG; ready-node-throw bucket where a token is placed when a worker attempts to steal and there is at least one ready node at that place; null-node-throw bucket where a token is placed when a worker attempts to steal and there are no ready nodes at that place (models the wait time when there is no work

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

173

at that place). The total finish time of a place can be computed by counting the tokens in these three buckets and by considering load balanced execution within a place (using randomized work stealing). The upper and lower bounds on the execution time arise from the structure of the computation DAG and the structure of the online schedule generated. The detailed proof is presented in [3]. Next, we compute the bound on the number of tokens in the ready-node-throw bucket using potential function based analysis [5]. Our unique contribution is in proving the lower and upper bounds of time complexity and message complexity for the multiplace affinity driven distributed scheduling algorithm presented in section 4 that involves both intra-place work stealing and remote place affinity driven work pushing. Let there be a non-negative potential associated with each ready node in the computation dag. If the execution of node u enables node v, then edge(u,v) is called the enabling edge and u is called the designated parent of v. The subgraph of the computation DAG consisting of enabling edges forms a tree, called the enabling tree. During the execution of the affinity driven distributed scheduling algorithm 4, the weight of a node u in the enabling tree, w(u) is defined as (T∞,n − depth(u)). For a ready node, u, we define φi (u), the potential of u at timestep i, as: φi (u) = 32w(u)−1 , if u is assigned; =3

2w(u)

, otherwise

(4.1a) (4.1b)

All non-ready nodes have 0 potential. The potential at step i, φi , is the sum of the potential of all the ready nodes at step i. When an execution begins, the only ready node is the root node with potential, φ(0) = 32T∞,n −1 . At the end the potential is 0 since there are no ready nodes. Let Ei denote the set of workers whose Ready Deque is empty at the beginning of step i, and let Di denote the set of all other workers with non-empty Ready Deque. Let, Fi denote the set of all ready nodes present across the FAB at all places. The total potential can be partitioned into three parts as follows: φi = φi (Ei ) + φi (Di ) + φi (Fi )

(4.2)

Actions such as assignment of a node from Ready Deque to the worker for execution, stealing nodes from the top of victim’s Ready Deque and execution of a node, lead to decrease of potential. The idle workers at a place do work-stealing alternately between stealing from Ready Deque and stealing from the FAB. Thus, 2m throws in a round consist of m throws to other workers’s Ready Deque and m throws to the FAB. For randomized work-stealing one can use the balls and bins game [3] to compute the expected and probabilistic bound on the number of throws. Using this, one can show that whenever m or more throws occur for getting nodes from the top of the Ready Deque of other workers at the same place, the potential decreases by a constant fraction of φi (Di ) with a constant probability. The component of potential associated with the FAB at place Pk , φki (Fi ), can be shown to deterministically decrease for m throws in a round. Furthermore, at each place the potential also drops by a constant factor of φki (Ei ). The detailed analysis of decrease of potential for each component is given in [3]. Analyzing the rate of decrease of potential and using Lemma 1 leads to the following theorem. Theorem 1. Consider a strict place-annotated computation DAG with work per place k, denoted by T1k , being executed by the affinity driven multi-place distributed scheduling

174

A. Narang et al.

algorithm (section 4). Let the critical-path length for the computation be T∞,n . The lower bound on the expected execution time is O(maxk T1k /m + T∞,n ) and the upper k bound is O( k (T1k /m + T∞ )). Moreover, for any > 0, the lower bound for the k execution time is O(maxk T1 /m + T∞,n + log(1/)) with probability at least 1 − . Similar probabilistic upper bound exists. Proof Sketch: For the lower bound, we analyze the number of throws (to the ready-nodethrow bucket) by breaking the execution into phases of θ(P = mn) throws (O(m) throws per place). It can be shown that with constant probability, a phase causes the potential to drop by a constant factor. More precisely, between phases i and i + 1, P r{(φi − φi+1 ) ≥ 1/4.φi } > 1/4 (details in [3] ). Since the potential starts at φ0 = 32T∞,n −1 and ends at zero and takes integral values, the number of successful phases is at most (2T∞,n − 1) log4/3 3 < 8T∞,n . Thus, the expected number of throws per place gets bounded by O(T∞,n · m), and the number of throws is O(T∞,n · m) + log(1/)) with probability at least 1 − (using Chernoff Inequality). Using Lemma 1 we get the lower bound on the expected execution time as O(maxk T1k /m + T∞,n ). The detailed proof and probabilistic bounds are presented in [3] . For the upper bound, consider the execution of the subgraph of the computation at each place. The number of throws in the ready-node-throw bucket per place can be k similarly bounded by O(T∞ ·m). Further, the place that finishes the execution in the end, can end up with the number of tokens in the null-node-throw bucket equal to the tokens in the work buckets and the read-node-throw buckets of all other places. Hence, the finish time for this place, which is also the execution time of the full computation DAG k is O( k (T1k /m + T∞ )). The probabilistic upper bound can be similarly established using Chernoff Inequality. The following theorem bounds the message complexity of the affinity driven work stealing algorithm 4. Theorem 2. Consider the execution of a strict place-annotated computation DAG with critical path-length T∞,n by the Affinity Driven Distributed Scheduling Algorithm (section 4). Then, the total number of bytes communicated across places is O(I · (Smax + nd )) and the lower bound on number of bytes communicated within a place has the expectation O(m·T∞,n ·Smax ·nd ), where nd is the maximum number of dependence edges from the descendants to a parent and I is the number of remote spawns from one place to a remote place. Moreover, for any > 0, the probability is at least (1 − ) that the lower bound on the communication overhead per place is O(m·(T∞,n +log(1/))·nd ·Smax ). Similarly message upper bounds exist. Proof. First consider inter-place messages. Let the number of affinity driven pushes to remote places be O(I), each of maximum O(Smax ) bytes. Further, there could be at most nd dependencies from remote descendants to a parent, each of which involves communication of constant, O(1), number of bytes. So, the total inter place communication is O(I.(Smax + nd )). Since the randomized work stealing is within a place, the lower bound on the expected number of steal attempts per place is O(m.T∞,n ) with each steal attempt requiring Smax bytes of communication within a place. Further, there can be communication when a child thread enables its parent and puts the parent into the child processors’ Ready Deque. Since this can happen nd times for each time

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

175

the parent is stolen, the communication involved is at most nd .Smax ). So, the expected total intra-place communication across all places is O(n.m.T∞,n .Smax .nd ). The probabilistic bound can be derived using Chernoff’s inequality and is omitted for brevity. Similarly, expected and probabilistic upper bounds can be established for communication complexity within the places.

5 Results and Analysis We implemented our distributed scheduling algorithm (ADS) and the pure Cilk style work stealing based scheduler (CWS) using pthreads (NPTL) API. The code was compiled using gcc version (4.1.2) with options -O2 and -m64. Using well known benchmarks the performance of ADS was compared with CWS and also with original Cilk6 scheduler (referred as CORG in this section). These benchmarks are the following. Heat: Jacobi over-relaxation that simulates heat propagation on a two dimensional grid for a number of steps [1]. For our scheduling algorithm (ADS), the 2D grid is partitioned uniformly across the available cores.7 ; Molecular Dynamics (MD): This is classical Molecular Dynamics simulation, using the Velocity Verlet time integration scheme. The simulation was carried on 16K particles for 100 iterations; Conjugate Gradient (NPB8 benchmark): Conjugate Gradient (CG) approximates the largest eigenvalue of a sparse, symmetric, positive definite matrix using inverse iteration. The matrix is generated by summing outer products of sparse vectors, with a fixed number of nonzero elements in each generating vector. The benchmark computes a given number of eigenvalue estimates, referred to as outer iterations, using 25 iterations of the CG method to solve the linear system in each outer iteration. The performance comparison between ADS and CORG was done on Intel multi-core platform. This platform has 16 cores (2.93 GHz, intel Xeon 5570, Nehalem architecture) with 8M B L3 cache per chip and around 64GB memory. Intel Xeon 5570 has NUMA characteristics even though it exposes SMP style programming. Fig. 4 compares the performance for the Heat benchmark (matrix: 32K ∗ 4K, number of iterations = 100, leafmaxcol = 32). Both ADS and CORG demonstrate strong scalability. Initially, ADS is around 1.9× better than CORG, but later this gap stabilizes at around 1.20×. 5.1 Detailed Performance Analysis In this section, we analyze the performance gains obtained by our ADS algorithm vs. the Cilk style scheduling (CWS) algorithm and also investigate the behavior of our algorithm on Power6 multi-core architecture. Fig. 5 demonstrates the gain in performance of ADS vs CWS with 16 cores. For CG, Class B matrix is chosen with parameters: NA = 75K, Non-Zero = 13M , Outer iterations = 75, SHIFT = 60. For Heat, the parameters values chosen are: matrix size 6 7

8

http://supertech.csail.mit.edu/cilk/ The Dmax for this benchmark is log(numCols/leaf maxcol) where numCols represents the number of columns in the input two-dimensional grid and leafmaxcol represents the number of columns to be processed by a single thread. http://www.nas.nasa.gov/NPB/Software

176

A. Narang et al. Strong Scalability Comparison: ADS vs CORG

WS & FAB Overheads: ADS vs CWS

Performance Comparison: ADS vs CWS 2000

CORG

1000

ADS

500 0

40 30

CWS

20

ADS

10 0

2

4

8

16

CORG

1623

812

415

244

ADS

859

683

348

195

Number of Cores

Fig. 4. CORG vs ADS

CG

CWS

45.7

ADS

31.9

Heat

MD

12.2

10.6

9.8

8.9

Number of Cores

Fig. 5. ADS vs CWS

Time (s)

1500

Total Time (s)

Total Time (s)

50

20 18 16 14 12 10 8 6 4 2 0

CWS_WS_time ADS_WS_time ADS_Fab_Overhead

CG

Heat

MD

Number of Cores

Fig. 6. ADS vs CWS

= 32 ∗ 4K, number of iterations = 100 and leafmaxcol = 32. While CG has maximum gain of 30%, MD shows gain of 16%. Fig. 6 demonstrates the overheads due to work stealing and FAB stealing in ADS and CWS. ADS has lower work stealing overhead because the work stealing happens only within a place. For CG, work steal time for ADS (5s) is 3.74× better than CWS (18.7s). For Heat and MD, ADS work steal time is 4.1× and 2.8× better respectively, as compared to CWS. ADS has FAB overheads but this time is very small, around 13% to 22% of the corresponding work steal time. CWS has higher work stealing overhead because the work stealing happens from any place to any other place. Hence, the NUMA delays add up to give a larger work steal time. This demonstrates the superior execution efficiency of our algorithm over CWS. We measured the detailed characteristics of our scheduling algorithm on multi-core Power6 platform. This has 16 Power6 cores and total 128GB memory. Each core has 64KB instruction L1 cache and 64KB L1 data cache along with 4M B semi-private unified L2 cache. Two cores on a Power6 chip share an external 32M B L3 cache. Fig. 7 plots the variation of the work stealing time, the FAB stealing time and the total time with changing configurations of a multi-place setup, for MD benchmark. With constant total number of cores = 16, the configurations, in the format (number of places * number of processors per place), chosen are: (a) (16 ∗ 1), (b) (8 ∗ 2), (c) (4 ∗ 4), and (d) (2 ∗ 8). As the number of places increase from 2 to 8, the work steal time increases from 3.5s to 80s as the average number of work steal attempts increases from 140K to 4M . For 16 places, the work steal time falls to 0 as here there is only a single processor per place, so work stealing does not happen. The FAB steal time, however, increases monotonically from 0.3s for 2 places, to 110s for 16 places. In the (16 ∗ 1) configuration, the processor at a place gets activities to execute, only through remote push onto its place.Hence, the FAB steal time at the place becomes high, as the number of FAB attempts (300M average) is very large, while the successful FAB attempts are very low (1400 average). With increasing number of places from 2 to 16, the total time increases from 189s to 425s, due to increase in work stealing and/or FAB steal overheads. Fig. 8 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using Class C matrix with parameter values: NA = 150K, Non-Zero = 13M , Outer Iterations = 75 and SHIFT = 60). In this case, the work steal time increases from 12.1s (for (2 ∗ 8)) to 13.1 (for (8 ∗ 2)) and then falls to 0 for (16 ∗ 1) configuration. The FAB time initially increases slowly from 3.6s to 4.1s but then jumps to 81s for (16 ∗ 1) configuration. This behavior can be explained as in the case of MD benchmark (above).

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

177

Fig. 9 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using parameter values: matrix size = 64K ∗ 8K, Iterations = 100 and leafmaxcol = 32). The variation of work stealing time, FAB stealing time and total time follow the pattern as in the case of MD.

350

350

300

300

350 300 250

250

ADS_FAB_time

200 150 100

ADS_Total_Time

50 0

ADS_WS_time

200

ADS_FAB_time

150

ADS_Total_Time

Time (s)

250 ADS_WS_time

Time (s)

Time (s)

WS & FAB Overheads Variation: Heat

WS & FAB Overheads Variation: CG

WS & FAB Overheads Variation: MD 450 400

ADS_FAB_time

150 100

50

50

0 (2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(4 * 4)

(8 * 2)

(16 * 1)

(2 * 8)

(Num Places * Num Procs Per Place)

Fig. 7. Overheads - MD

ADS_Total_Time

0 (2 * 8)

(Num Places * Num Procs Per Place)

ADS_WS_time

200

100

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 8. Overheads - CG

Fig. 9. Overheads - HEAT

Fig. 10 gives the variation of the Ready Deque average space and maximum space consumption across all processors and FAB average space and maximum space consumption across places, with changing configurations of the multi-place setup. As the number of places increase from 2 to 16, the FAB average space increase from 4 to 7 stack frames first, and, then decreases to 6.4 stack frames. The maximum FAB space usage increases from 7 to 9 stack frames but then returns back to 7 stack frames. The average Ready Deque space consumption increases from 11 stack frames to 12 stack frames but returns back to 9 stack frames for 16 places, while the average Ready Deque monotonically decreases from 9.69 to 8 stack frames. The Dmax for this benchmark setup is 11 stack frames, which leads to 81% maximum FAB utilization and roughly 109% Ready Deque utilization. Fig. 12 gives the variation of FAB space and Ready Deque space with changing configurations, for CG benchmark (Dmax = 13). Here, the FAB utilization is very low and remains so with varying configurations. The Ready Deque utilization stays close to 100% with varying configurations. FIg. 11 gives the variation of FAB space and Ready Deque space with changing configurations, for Heat benchmark (Dmax = 12). Here, the FAB utilization is high (close to 100%) and remains so with varying configurations. The Ready Deque utilization also stays close to 100% with varying configurations. This empirically demonstrates that our distributed scheduling algorithm has efficient space utilization as well.

Ready Deque & FAB Space Variation: MD

16

12 10

Ready_Deque_Avg

8

Ready_Deque_Max

6

FAB_Avg FAB_Max

4 2 0

14 12 Ready_Deque_Avg

10

Ready_Deque_Max

8

FAB_Avg

6

FAB_Max

4 2

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 10. Space Util - MD

14 12 Ready_Deque_Avg

10

Ready_Deque_Max

8

FAB_Avg

6

FAB_Max

4 2 0

0

(2 * 8)

Number of Stack Frames

16 Number of Stack Frames

Number of Stack Frames

Ready Deque & FAB Space Variation: CG

Ready Deque & FAB Space Variation: Heat

14

(2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 11. Space Util - HEAT

(2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 12. Space Util - CG

178

A. Narang et al.

6 Conclusions and Future Work We have addressed the challenging problem of affinity driven online distributed scheduling of parallel computations. We have provided theoretical analysis of the time and message complexity bounds of our algorithm. On well known benchmarks our algorithm demonstrates around 16% to 30% performance gain over typical Cilk style scheduling. Detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. This is the first such work for affinity driven distributed scheduling of parallel computations in a multi-place setup. In future, we plan to look into space-time tradeoffs and markov-chain based modeling of the distributed scheduling algorithm.

References 1. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: SPAA, New York, NY, USA, pp. 1–12 (December 2000) 2. Agarwal, S., Barik, R., Bonachea, D., Sarkar, V., Shyamasundar, R.K., Yellick, K.: Deadlockfree scheduling of x10 computations with bounded resources. In: SPAA, San Diego, CA, USA, pp. 229–240 (December 2007) 3. Agarwal, S., Narang, A., Shyamasundar, R.K.: Affinity driven distributed scheduling algorithms for parallel computations. Tech. Rep. RI09010, IBM India Research Labs, New Delhi (July 2009) 4. Allan, E., Chase, D., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., TobinHochstadt, S.: The Fortress language specification version 0.618. Tech. rep., Sun Microsystems (April 2005) 5. Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: SPAA, Puerto Vallarta, Mexico, pp. 119–129 (1998) 6. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999) 7. Blumofe, R.D., Lisiecki, P.A.: Adaptive and reliable parallel computing on networks of workstations. In: USENIX Annual Technical Conference. Anaheim, California (1997) 8. ChamberLain, B.L., Callahan, D., Zima, H.P.: Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21(3), 291–312 (2007) 9. Charles, P., Donawa, C., Ebcioglu, K., Grothoff, C., Kielstra, A., von Praun, C., Saraswat, V., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA 2005 Onward! Track (2005) 10. Exascale Study Group, Peter Kogge (Editor and Study Lead), William Harrod (Program Manager): Exascale computing study: Technology challenges in achieving exascale systems. Tech. rep. (September 2008) 11. Yelick, K., et al.: D.B.: Productivity and performance using partitioned global address space languages. In: PASCO 2007: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, pp. 24–32. ACM, New York (2007)

Temporal Specifications for Services with Unboundedly Many Passive Clients Shamimuddin Sheerazuddin The Institute of Mathematical Sciences C.I.T. Campus, Chennai 600 113, India [email protected]

Abstract. We consider a client-server system in which unbounded, ﬁnite but unknown, number of clients request for service from the server. The system is passive as there is no further interaction between sendrequest and receive-response. We give an automata based model for such systems and a temporal logic to frame speciﬁcations. We show that the satisﬁability and model checking problems for the logic are decidable. Keywords: temporal logic, web services, client-server systems, decidability, model checking.

1

Introduction

In [DSVZ06], the authors consider a Loan Approval Service [TC03], which consists of Web Services, called peers, that interact with each other via asynchronous message exchanges. One of the peers is designated as Loan Oﬃcer, the loan disbursal authority. It receives a loan request, say for 5000 from a customer, checks her credit rating from a third party and approves or rejects the request according to some lending policy. The loan approval problem becomes doubly interesting when the disbursal oﬃcer is confronted with a number of customers asking for loans of diﬀerent sizes, say ranging from 5000 to 500,000. In such a scenario, with a bounded pool of money to loan out, it may be possible that a high loan request may be accepted when there is no other pending request, or may be rejected when accompanied with pending requests of lower loan sizes. This is an example of a system composed of unboundedly many agents: how many processes are active at any system state is not known at design time but determined only at run time. Thus, though at any point of time, only ﬁnitely many agents may be participating, there is no uniform bound on the number of agents. Design and veriﬁcation of such systems are becoming increasingly important in distributed computing, especially in the context of Web Services. Since services handling unknown clients need to make decisions based upon request patterns that are not pre-decided, they need to conform to speciﬁc service policies that are articulated at design time. Due to concurrency and unbounded state information, the design and implementation of such services becomes complex and hence subject to logical ﬂaws. Thus, there is a need for formal methods in specifying service policies and verifying that systems implement them correctly. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 179–190, 2011. c Springer-Verlag Berlin Heidelberg 2011

180

S. Sheerazuddin

Formal methods come in many ﬂavours. One special technique is that of model checking[CGP00]. The system to be checked is modelled as a ﬁnite state system, and properties to be veriﬁed are expressed as constraints on the possible computations of the model. This facilitates algorithmic tools to be employed in verifying that the model is indeed correct with respect to those properties. When we ﬁnd violations, we re-examine the ﬁnite-state abstraction, leading to a ﬁner model, perhaps also reﬁne the speciﬁcations and repeat the process. Finding good ﬁnite state abstractions forms an important part of the approach. Modelling systems with unboundedly many clients is fraught with diﬃculties. Since we have no bound on the number of active processes, the state space is inﬁnite. A ﬁnite state abstraction may kill the very feature we wish to check, since service policies, we are interested in, involve unbounded numbers of clients. On the other hand, ﬁnite presentation of such systems with inﬁnite spaces and/or their computations comes with considerable amount of diﬃculty. Propositional temporal logics have been extensively used for specifying safety and liveness requirements of reactive systems. Backed by a set of tools with theorem proving and model checking capabilities, temporal logic is a natural candidate for specifying service policies. In the context of Web Services, they have been extended with mechanisms for specifying message exchange between agents. There are several candidate temporal logics for message passing systems, but these work with a priori ﬁxed number of agents, and for any message, the identity of the sender and the receiver are ﬁxed at design time. We need to extend such logics with means for referring to agents in some more abstract manner (than by name). On the other hand, the client-server interaction needs far simpler communication facility than what is typically considered in any peerto-peer communication model. A natural and direct approach to refer to unknown clients is to use logical variables: rather than work with atomic propositions p, we use monadic predicates p(x) to refer to property p being true of client x. We can quantify over such x existentially and universally to specify policies relating to clients. We are thus naturally lead to the realm of Monadic First Order Temporal Logics (M F OT L)[GHR94]. In fact, it is easily seen that M F OT L is expressive enough to frame almost every requirement speciﬁcation of client-server systems of the kind discussed above. Unfortunately, M F OT L is undecidable [HWZ00], and we need to limit the expressiveness so that we have decidable veriﬁcation problem. We propose a fragment of M F OT L for which satisﬁability and model checking are decidable. Admittedly, this language is weak in expressive power but our claim is that reasoning in such a logic is already suﬃcient to express a broad range of service policies in systems with unboundedly many clients. Thus, we need to come up with a formal model that ﬁnitely presents inﬁnite state spaces and a speciﬁcation language that involves quantiﬁcation and temporal modlities, while ensuring that model checking can be done. This is the issue we address in this paper, by simplifying the setting a little. We consider the case where there is only one server dealing with unboundedly many clients that do not

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

181

communicate with each other and propose a class of automaton model: for passive clientele. The client is passive as it simply sends a request to the server and waits for the service (or an answer that the service cannot be provided). The client has no further interaction with the server. We suggest that it suﬃces to specify such clients by boolean formulas over unary predicates. State formulas of the server are then monadic ﬁrst order sentences over such predicates, and server properties are speciﬁed by temporal formulas built from such sentences. In the literature [CHK+ 01], these services with passive clientele are called discrete services. We call them Service with Passive Clients (SPS). Before we proceed to technicalities, we wish to emphasize that what is proposed in this paper is in the spirit of framework rather than a deﬁnitive temporal logic for services with unbounded clientele. The combination of modalities as well as the structure of models should ﬁnally be decided only on the basis of applications. Even though essentially abstract, our paradigm continues the research on Web Service composition [BHL+ 02], [NM02], work on Web Service programming languages [FGK02] and the AZTEC prototype [CHK+ 01]. There have been many automaton based models for Web Services but, as far as we know, none of them incorporate unboundedly many clients. [BFHS03] models Web Services as Mealy machines, [FBS04] models Web Services as Buchi automata and focus on message passing between them. The Roman model [BCG+ 03] focusses on an abstract notion of activities, and in essence models Web Services as ﬁnite state machines with transitions labelled by these activities. The Colombo model [BCG+ 03] combines the elements of [FBS04] and [BCG+ 03] alongwith the OWL-S model [NM02] of Web Services and accounts for data in messages too. Decidable fragments of M F OT L are few and far between. As far as we know, monodic fragment [HWZ00], [HWZ01] is the only decidable one found in the literature. The decidability crucially hinges on the fact that there is at most one free variable in the scope of temporal modalities. Later, it was found that the packed monodic fragment with equality is decidable too [Hod02]. In the realm of branching time logics with ﬁrst-order extensions, it is shown in [HWZ02] that, by restricting applications of ﬁrst-order quantiﬁers to state (path-independent) formulas, and applications of temporal operators and path quantiﬁers to formulas with at most one free variable, we can obtain decidable fragments.

2

The Client Server Model

Fix CN , a countable set of client names. In general, this set would be recursively generated using a naming scheme, for instance using sequence numbers and timestamps generated by processes. We choose to ignore this structure for the sake of technical simplicity. We will use a, b etc. with or without subscripts to denote elements of CN .

182

S. Sheerazuddin

2.1

Passive Clients

Fix Γ0 , a ﬁnite service alphabet. We use u, v etc. to denote elements of Γ0 , and they are thought of as types of services provided by a server. An extended alphabet is a set Γ = {requ , ansu | u ∈ Γ0 } ∪ {τ }. These refer to requests for such service and answers to such requests, as well as ”silent” internal action τ . Elements of Γ0 represent logical types of services that the server is willing to provide. This means that when two clients ask for a service of the same type, given by an element of Γ0 it can tell them apart only by their name. We could in fact then insist that server’s behaviour be identical towards both, but we do not make such an assumption, to allow for generality. We deﬁne below systems of services that handle passive clientele. Servers are modelled as state transition systems which identify clients only by the type of service they are associated with. Thus, transitions are associated with client types rather than client names. Definition 2.1. A Service for Passive Clients (SPS) is a tupleA = (S, δ, I, F ) where S is a ﬁnite set of states, δ ⊆ (S × Γ × S) is a server transition relation, I ⊆ S is the set of initial states and F the set of ﬁnal states of A. Without loss of generality we assume that in every SPS, the transition relation δ is such that for every s ∈ S, there exists r ∈ Γ such that for some s ∈ S, (s, r, s ) ∈ δ. The use of silent action τ makes this an easy assumption. Note that an SPS is a ﬁnite state description. A transition of the form (s, requ , s ) refers implicitly to a new client of type u rather than to any speciﬁc client name. The meaning of this is provided in the run generation mechanism described below. A conﬁguration of an SPS A is a triple (s, C, χ) where s ∈ S, C is a ﬁnite subset of CN and χ : C → Γ0 . Thus a conﬁguration speciﬁes the control state of the server, as well as the ﬁnite set of active clients at that conﬁguration and their types. We use the convention that when C = ∅, the graph of χ is empty set as well. Let ΩA denote the set of all conﬁguration of A; note that it is this inﬁnite conﬁguration space that is navigated by behaviours of A. We can extend r the transition relation δ to conﬁguration =⇒ ⊆ (ΩA × Γ × ΩA ) as follows. r (s, C, χ)=⇒(s , C , χ ) iﬀ (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C and χ = χ ; – when r = requ , C = C ∪ {a}, χ (a) = u and χ C = χ, where a is the least element of CN − C; – when r = ansu , X = {a ∈ C | χ(a) = u} = ∅, C = C − {a} where a is the least in the enumeration of X, and χ = χC . A conﬁguration (s, C, χ) is said to be initial if s ∈ I and C = ∅. A run of an SPS A is an inﬁnite sequence of conﬁgurations ρ = c0 r1 c1 · · · rn cn · · · , where c0 r is initial, and for all j > 0, cj−1 =⇒cj . Let RA denote the set of runs of A. Note that runs have considerable structure. For instance, A have an inﬁnite path generated by a self-loop of the form (s, reqx , s ) in δ which corresponds to an inﬁnite sequence of service requests of a particular type. Thus, these systems

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

183

have interesting reachability properties. But, as we shall see, our main use of these systems are as models of a temporal logic, and since the logic is rather weak, information present in the runs will be under-utilized. Language of an SPS: Given a run ρ = c0 r1 c1 r2 · · · rn cn · · · we deﬁne inf (ρ) as those states which occur inﬁnitely many times on the run. That is, inf (ρ) = {q ∈ S | ∃∞ i ∈ ω, ci = (q, ∅, ∅)}. A run ρ is good if inf (ρ) ∩ F = ∅. The language of A, Lang(A) ⊆ Γ ω is then deﬁned as follows. Lang(A) = {r1 r2 · · · rn · · · | there is a good run ρ = c0 r1 c1 r2 · · · rn cn · · · } Once, we have ﬁxed goodness properties for runs RA of a given system A, it is trivially seen that SPS are closed under union and intersection. Also, it can be observed that once an external bound on CN is assumed, the size of conﬁguration set ΩA becomes bounded and all the decision algorithms for A become decidable. q0 ansh ansl reql

reqh

ansl

q3 q1

q5 reql

reql

ansl q2

ansl

q6

reqh

ansh

q4

Fig. 1. An SPS for Loan Approval Web Service System: A1

3

Loan Approval Service

We give an example SPS for automated Loan Approval Web Service System which is a kind of discrete service. In this Composite system, there is a designated Web Service acting as Loan Oﬃcer which admits loan requests of various sizes, say h depicting high (large) amount and l depicting low (small) amounts. Depending on the number of loan requests (high and low) and according to an apriori ﬁxed loan disbursal policy, the loan oﬃcer accepts or rejects the pending requests. The behaviour of the Loan Oﬃcer is captured as SPS as follows. Let Γ0 = {h, l}, where h denotes high end loan and l denotes low size loan, and the corresponding alphabet Γ = {reqh , reql , ansh , ansl }, the Loan Approval System can be modelled as an SPS A1 = (q1 , δ1 , I1 ) as shown in Figure 1. Here, we brieﬂy describe the working of the automaton. A1 , starting from q0 , keeps

184

S. Sheerazuddin

track of at most two low-amount requests. q1 is the state with one pending request whereas q4 is the state with two pending requests. Whenever the system gets a high amount request, it seeks to dispose it at the earliest and tries avoiding to take up a low requst as long as a high one is pending with it. But, it may not succeed all the time, i.e, when the automaton reaches q6 , it is possible that it can loop back to initial state q0 , with one or more high pending requests, and then take up low requests. It is not diﬃcult to see that there are runs of A1 satisfy the following property, ψ1 , and there are those which don’t. ψ1 is expressed in english as ”whenever there is a request of type low there is an answer of type low in the next instant”. Now, suppose there is another property ψ2 described as ”there is no request of type low taken up as long as there is a high request pending”. If we want to avoid ψ2 in the Loan Approval System then we need to modify A1 and deﬁne A2 = (S2 , δ2 , I2 ) as in Figure 2. q0 ansh ansl

reql

reqh

ansl

q3 q1

q5 reql

reql

ansl q2

q6 reqh

ansl q4

ansh q7

Fig. 2. A modiﬁed SPS for Loan Approval Web Service System: A2

Furthermore, if we want to make sure that there are no two low requests pending at any time, i.e., our model satisﬁes ψ1 , we modify A2 and describe A3 in Figure 3 as follows. We shall see later that these properties can be described easily is a decidable logic which we call LSP S . Notice, that, in SPS the customer (user or client) simply sends a request of some particular type and waits for an answer. Things become interesting when the customer seeks to interact with the loan oﬃcer (server) between the sendrequest and receive-response. In this case, the complex patterns of interaction between the client and the server have to be captured by a stronger automaton model. We shall tackle this issue in a separate paper.

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

185

q0 ansh ansl

reql

reqh

q3 q1 reql

ansl q2

q6 reqh

ansh q7

Fig. 3. Another modiﬁed SPS for Loan Approval Web Service System: A3

4

LSP S

In this section we describe a logical language to specify and verify SPS-like systems. Such a language would have two mutually exclusive dimensions. One, captured by M F O fragment, talking about the plurality of clients asking for a variety of services. The other, captured by LT L fragment, which talks about the temporal variations of services being rendered. Furthermore, the M F O fragment has to be multi-sorted to cover the multilplicity of service types. Keeping these issues in mind, we frame a logical language, which we call LSP S . LSP S is a cross between LT L and multi-sorted M F O. In the case of LTL, atomic formulas are propositional constants which have no further structure. In LSP S , there are two kind of atomic formulas, basic server properties from Ps , and M F Osentences over client properties Pc . Consequently, these formulas are interpreted over sequences of MFO-structures juxtaposed with LTL-models. 4.1

Syntax and Semantics

At the outset, we ﬁx Γ0 , a ﬁnite set of client types. The set of Client formulas are deﬁned over a countable set of atomic client predicates Pc which are composed of disjoint predicates of type u Pcu , for each u ∈ Γ0 . Also, let V ar be a countable supply of variable symbols and CN be a countable set of client names. CN is divided into disjoint sets of types from Γ0 via λ : CN → Γ0 . Similarly, V ar is divided using Π : V ar → Γ0 . We use x, y to denote elements in V ar and a, b for elements in CN . Formally, the set of client formulas Φ is: α, β ∈ Φ ::= p(x : u), p ∈ Pcu | x = y, x, y ∈ V aru | ¬α | α ∨ β | (∃x : u)α.

186

S. Sheerazuddin

Let SΦ be the set of all sentences in Φ, then, the Server formulas are deﬁned as follows: ψ ∈ Ψ ::= q ∈ Ps | ϕ ∈ SΦ | ¬ψ | ψ1 ∨ ψ2 | ψ | ψ Uψ . This logic is interpreted over sequences of M F O models composed with LT L models. Formally, a model is a triple M = (ν, D, I) where – ν = ν0 ν1 · · · , where ∀i ∈ ω, νi ⊂f in Ps , gives the local properties of the server at instance i, – D = D0 D1 D2 · · · , where ∀i ∈ ω, Di = (Diu )u∈Γ0 where Diu ⊂f in CNu , gives the identity of the clients of each type being served at instance iuand – I = I0 I1 I2 · · · , where ∀i ∈ ω, Ii = (Iiu )u∈Γ0 and Iiu : Diu → 2Pc gives the properties satisﬁed by each live agent at ith instance, in other words, the corresponding states of live agents. Alternatively, Iiu can be given as Iiu : Diu × Pcu → {, ⊥}, an equivalent form. Satisfiability Relations |=, |=Φ Let M = (ν, D, I) be a valid model and π : V ar → CN be a partial map consistent with respect to λ and Π. Then, the relations |= and |=Φ can be deﬁned, via induction over the structure of ψ and α, respectively, as follows. – – – – – –

M, i |= q iﬀ q ∈ νi . M, i |= ϕ iﬀ M, ∅, i |=Φ ϕ. M, i |= ¬ψ iﬀ M, i |= ψ. M, i |= ψ ∨ ψ iﬀ M, i |= ψ or M, i |= ψ . M, i |= ψ iﬀ M, i + 1 |= ψ. M, i |= ψ Uψ iﬀ ∃j ≥ i, M, j |= ψ and ∀i ≤ i < j, M, i |= ψ.

– – – – –

M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ

5

p(x : u) iﬀ π(x) ∈ Diu and Ii (π(x), p) = . x = y iﬀ π(x) = π(y). ¬α iﬀ M, π, i |=Φ α. α ∨ β iﬀ M, π, i |=Φ α or M, π, i |=Φ β. (∃x : u)α iﬀ ∃a ∈ Diu and M, π[x → a], i |=Φ α.

Specification Examples Using LSP S

As claimed in the previous section, we would like to show that our logic LSP S adequately captures many of the facets of SPS-like systems. We consider the Loan Approval Web Service, which has already been explained with a number of examples, and frame a number of speciﬁcations to demonstrate the use of LSP S . For Loan Approval System we shall have client types as Γ0 = {h, l} and client properties as Pc = {reqh , reql , ansh , ansl } where h means a loan request of type high and l means a loan request of type low. For this system, we can write a few simple speciﬁcations viz. initially there are no pending requests, whenever there is a request of type low there is an approval for type low in the next instant, there is no request of type low taken up as long as there is a high request pending and there is at least one request of each type pending all the time. In LSP S these can be framed as follows:

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

– – – –

ψ0 ψ1 ψ2 ψ3

187

= ¬ (∃x : h)reqh (x) ∨ (∃x : l)reql (x) , = 2(∃x : l)reql (x) ⊃ (∃y : l)ansl (y) , = 2(∃x : h)reqh (x) ⊃ ¬(∃y : l)reql (y) , = 2 (∃x : l)reql (x) ∨ (∃y : h)reqh (y) .

Note, that, none of these formulas make use of equality (=) predicate. Using =, we can make stronger statements like at all times there is exactly one pending request of type high and at all times there is at most one pending request of type high. These can be expressed in LSP S as follows: – ψ4 = 2 (∃x : h)reqh (x) ∧ (∀y: h) reqh (y) ⊃ x = y , . – ψ5 = 2 ¬(∃x : h)reqh (x) ∨ (∃x : h)reqh (x)∧(∀y : h) reqh (y) ⊃ x = y In the same vein, using =, we can count the requests of each type and say more interesting things. For example, if ϕ2h asserted at a point means there are at most 2 requests of type h pending then we can frame the following formula: ψ5 = 2(ϕ2h ⊃ (ϕ2h ⊃ 2ϕ2h )) which means, if there are at most two pending requests of type high at successive instants then thereafter the number stabilizes. Unfortunately, owing to a lack of provision for free variables in the scope of temporal modalities, we can’t write which seek to match requests speciﬁcations and approvals. Here is a sample: 2 (∀x) requ (x) ⊃ 3ansu (x) which means, if there is a request of type u at some point of time then the same is approved some time in future. If we allow indiscriminate applications of quantiﬁcations over temporal modalities, it will lead to undecidable logics. As we are aware, even two free variables in the scope of temporal modalities allow us to encode undecidable tiling problems. The challenge is to come up with appropriate constraints on speciﬁcations which allow us to express interesting properties as well as remain decidable to verify.

6

Satisfiability and Model Checking for LSP S

We settle the satisﬁability issue for LSP S using the automata theoretic techniques, ﬁrst proposed by Vardi and Wolper [VW86]. Let ψ0 be an LSP S -formula. We compute a formula automaton Aψ0 , such that the following holds. Lemma 6.1. ψ0 is satisﬁable iﬀ Lang(Aψ0 ) is non-empty. From the given LSP S formula ψ0 , we can obtain Pψ0 the set of all M F O predicates occuring in ψ0 and V arψ0 the set of all variable symbols occuring in ψ0 . Using these two sets we can generate all possible M F O models at the atom level. Now, these complex atoms, which incorporate M F O valuations as well as LT L valuations, are used to construct a Buchi automaton in the standard manner to generate all possible models of ψ0 . Then, the following is immediate. Lemma 6.2. Given a LSP S -formula ψ0 with |ψ0 | = n, the satisﬁability of ψ0 k can be checked in time 2O(n·r·2 ) , where r is the number of variable symbols occuring in ψ0 and k is the number predicate symbols occuring in ψ0 .

188

S. Sheerazuddin

In order to specify SPS, in which clients do nothing but send a request of type u and wait for an answer, the most we can say about a client x is whether a request from x is pending or not. So the set of client properties are Pc = {requ , ansu | u ∈ Γ0 }. When requ (x) holds at some instant i, it means there is a pending request of type u from x at i. When ansu (x) holds at i, it means either there was no request from x or the request from x has already been def answered. That is, ansu (x) = ¬requ (x). For the above sublogic, that is LSP S with Pc = {requ | u ∈ Γ0 }, we assert the following theorem, which can be inferred directly from lemma 6.2. Theorem 6.3. Let ψ0 be an LSP S formula with |Vψ0 | = r and |Γ0 | = k and k |ψ0 | = n. Then, satisﬁability of ψ0 can be checked in time O(2n·r·2 ). 6.1

Model Checking Problem for LSP S

The goal of this section is to formulate the model checking problem for LSP S and show that it is decidable. We again solve this problem using the so called automata-theoretic approach. In such a setting, the client-server system is modelled as an SP S, A, and the speciﬁcation is given by a formula ψ0 in LSP S . The model checking problem is to check if the system A satisﬁes the speciﬁcation ψ0 , denoted by A |= ψ0 . In order to do this we bound the SPS using ψ0 and deﬁne an interpreted version. Bounded Interpreted SPS: Let A = (S, δ, I, F ) be an SPS and ψ0 be a speciﬁcation in LSP S . From ψ0 we get Vu (ψ0 ), for each u ∈ Γ0 . Now, let n = (Σu ru ) · k where |Γ0 | = k and |Vu (ψ0 )| = ru . n is the bound for SPS M . Now, for each u ∈ Γ0 CNu = {(i, u) | 1 ≤ i ≤ ru , u ∈ Γ0 } and CN = u CNu . For each u, deﬁne CNu = {{(j, u) | 1 ≤ j ≤ i} |1 ≤ i ≤ ru } ∪ {∅}. Thereafter, deﬁne CN = Πu∈Γ0 CNu . Now, we have CN = C∈CN C. Now, we are in a position to deﬁne an interepreted form of bounded SPS. The interpreted SPS is a tuple A = (Ω, ⇒, I, F , V al), where Ω = S × CN , I = {(s, C) | s ∈ I, C = ∅}, F = {(s, C) | s ∈ F, C = ∅}, V al : Ω → (2Ps × CN ) and ⇒⊆ Ω × Γ × Ω is given r as follows: (s, C)=⇒(s , C ) iﬀ (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C , – when r = requ , CN − C = ∅, if a ∈ CNu − C is the least in the enumeration then C = C ∪ {a}, – when r = ansu , X = C ∩ CNu = ∅, C = C − {a} where a ∈ X is the least in the enumeration. Note, that, |CN | = Πu∈Γ0 (ru ) < rk . Now, if, |S| = l, then |Ω| = O(l·rk ). Now, we can deﬁne the language of interpreted SPS A as Lang(A) = {V al(c0 )V al(c1 ) · · · | c0 r1 c1 r2 c2 · · · is a good run in A}. We say that A satisﬁes ψ0 if Lang(A) ⊆ Aψ0 , where Aψ0 is the formula automaton of ψ0 . This holds when Lang(A) ∩ Lang(A¬ψ0 ) = ∅. Therefore, the

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

189

complexity to check emptiness of the product automaton, is linear in the product of the sizes of A and Aψ0 . k

Theorem 6.4. A |= ψ0 can be checked in time O(l · rk · 2n·r·2 ).

7

Discussion

To conclude, we gave an automaton model for unbounded agent server-client systems for discrete services [CHK+ 01] and a temporal logic to specify such services and presented an automata based decidability argument for satisﬁability and model checking of the logic. We shall extend the SPS to model session-oriented client server systems in a subsequent paper. We shall also take up the task of framing appropriate temporal logics to specify such services. This forces us into the realm of M F OT L with free variables in the scope of temporal modalities [Hod02]. We know that too many of those are fatal [HWZ00]. The challenge is to deﬁne suitable fragments of M F OT L, which are suﬃciently expressive as well as decidable. As this paper lacks an automata theory of SPS, we need to explore whether inﬁnite-state reachability techniques such that [BEM97] could be used. An extension of the work in this paper would be to deﬁne models and logics for systems with multiple servers, say n, together, serving unbounded clients. An orthogonal exercise could be development of tools to eﬃciently implement the model checking problem for the system SPS against LSP S speciﬁcations, a la MONA [HJJ+ 95][KMS00] or SPIN [Hol97],[RH04].

References [BCG+ 03]

[BEM97]

[BFHS03] [BHL+ 02]

[CGP00]

Berardi, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Mecella, M.: Automatic composition of E-services that export their behavior. In: Orlowska, M.E., Weerawarana, S., Papazoglou, M.P., Yang, J. (eds.) ICSOC 2003. LNCS, vol. 2910, pp. 43–58. Springer, Heidelberg (2003) Bouajjani, A., Esparza, J., Maler, O.: Reachability analysis of pushdown automata: Application to model-checking. In: Mazurkiewicz, A., Winkowski, J. (eds.) CONCUR 1997. LNCS, vol. 1243, pp. 135–150. Springer, Heidelberg (1997) Bultan, T., Fu, X., Hull, R., Su, J.: Conversation speciﬁcation: a new approach to design and analysis of e-service composition. In: WWW, pp. 403–410 (2003) Burstein, M.H., Hobbs, J.R., Lassila, O., Martin, D.L., McDermott, D.V., McIlraith, S.A., Narayanan, S., Paolucci, M., Payne, T.R., Sycara, K.P.: Daml-s: Web service description for the semantic web. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 348–363. Springer, Heidelberg (2002) Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (2000)

190

S. Sheerazuddin

[CHK+ 01]

[DSVZ06] [FBS04]

[FGK02]

[GHR94] [HJJ+ 95]

[Hod02] [Hol97] [HWZ00]

[HWZ01]

[HWZ02]

[KMS00]

[NM02] [RH04]

[TC03]

[VW86]

Christophides, V., Hull, R., Karvounarakis, G., Kumar, A., Tong, G., Xiong, M.: Beyond discrete E-services: Composing session-oriented services in telecommunications. In: Casati, F., Georgakopoulos, D., Shan, M.-C. (eds.) TES 2001. LNCS, vol. 2193, pp. 58–73. Springer, Heidelberg (2001) Deutsch, A., Sui, L., Vianu, V., Zhou, D.: Veriﬁcation of communicating data-driven web services. In: PODS, pp. 90–99 (2006) Fu, X., Bultan, T., Su, J.: Conversation protocols: a formalism for speciﬁcation and veriﬁcation of reactive electronic services. Theor. Comput. Sci. 328(1-2), 19–37 (2004) Florescu, D., Gr¨ unhagen, A., Kossmann, D.: Xl: an xml programming language for web service speciﬁcation and composition. In: WWW, pp. 65–76 (2002) Gabbay, D.M., Hodkinson, I.M., Reynolds, M.A.: Temporal Logic. Part 1. Clarendon Press (1994) Henriksen, J.G., Jensen, J.L., Jørgensen, M.E., Klarlund, N., Paige, R., Rauhe, T., Sandholm, A.: Mona: Monadic second-order logic in practice. In: Brinksma, E., Steﬀen, B., Cleaveland, W.R., Larsen, K.G., Margaria, T. (eds.) TACAS 1995. LNCS, vol. 1019, pp. 89–110. Springer, Heidelberg (1995) Hodkinson, I.M.: Monodic packed fragment with equality is decidable. Studia Logica 72(2), 185–197 (2002) Holzmann, G.J.: The model checker spin. IEEE Trans. Software Eng. 23(5), 279–295 (1997) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable fragment of ﬁrst-order temporal logics. Ann. Pure Appl. Logic 106(1-3), 85–134 (2000) Hodkinson, I., Wolter, F., Zakharyaschev, M.: Monodic fragments of ﬁrst-order temporal logics: 2000-2001 A.D. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 1–23. Springer, Heidelberg (2001) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable and undecidable fragments of ﬁrst-order branching temporal logics. In: LICS, pp. 393–402 (2002) Klarlund, N., Møller, A., Schwartzbach, M.I.: Mona implementation secrets. In: Yu, S., P˘ aun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 182– 194. Springer, Heidelberg (2001) Narayanan, S., McIlraith, S.A.: Simulation, veriﬁcation and automated composition of web services. In: WWW, pp. 77–88 (2002) Ruys, T.C., Holzmann, G.J.: Advanced SPIN tutorial. In: Graf, S., Mounier, L. (eds.) SPIN 2004. LNCS, vol. 2989, pp. 304–305. Springer, Heidelberg (2004) IBM Web Services Business Process Execution Language (WSBPEL) TC. Web services business process execution language version 1.1. Technical report (2003), http://www.ibm.com/developerworks/library/ws-bpel Vardi, M.Y., Wolper, P.: An automata-theoretic approach to automatic program veriﬁcation (preliminary report). In: LICS, pp. 332–344 (1986)

Relating L-Resilience and Wait-Freedom via Hitting Sets Eli Gafni1 and Petr Kuznetsov2 1 2

Computer Science Department, UCLA Deutsche Telekom Laboratories/TU Berlin

Abstract. The condition of t-resilience stipulates that an n-process program is only obliged to make progress when at least n − t processes are correct. Put another way, the live sets, the collection of process sets such that progress is required if all the processes in one of these sets are correct, are all sets with at least n − t processes. We show that the ability of arbitrary collection of live sets L to solve distributed tasks is tightly related to the minimum hitting set of L, a minimum cardinality subset of processes that has a non-empty intersection with every live set. Thus, ﬁnding the computing power of L is N P -complete. For the special case of colorless tasks that allow participating processes to adopt input or output values of each other, we use a simple simulation to show that a task can be solved L-resiliently if and only if it can be solved (h − 1)-resiliently, where h is the size of the minimum hitting set of L. For general tasks, we characterize L-resilient solvability of tasks with respect to a limited notion of weak solvability: in every execution where all processes in some set in L are correct, outputs must be produced for every process in some (possibly diﬀerent) participating set in L. Given a task T , we construct another task TL such that T is solvable weakly L-resiliently if and only if TL is solvable weakly wait-free.

1

Introduction

One of the most intriguing questions in distributed computing is how to distinguish solvable from the unsolvable. Consider, for instance, the question of wait-free solvability of distributed tasks. Wait-freedom does not impose any restrictions on the scope of considered executions, i.e., a wait-free solution to a task requires every correct processes to output in every execution. However, most interesting distributed tasks cannot be solved in a wait-free manner [6,19]. Therefore, much research is devoted to understanding how the power of solving a task increases as the scope of considered executions decreases. For example, t-resilience considers only executions where at least n − t processes are correct (take inﬁnitely many steps), where n is the number of processes in the system. This provides for solving a larger set of tasks than wait-freedom, since in executions in which less than n − t processes are correct, no correct process is required to output. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 191–202, 2011. c Springer-Verlag Berlin Heidelberg 2011

192

E. Gafni and P. Kuznetsov

What tasks are solvable t-resiliently? It is known that this question is undecidable even with respect to wait-free solvability, let alone t-resilient [9,14]. But is the question about t-resilient solvability in any sense diﬀerent than the question about wait-free solvability? If we agree that we “understand” wait-freedom [16], do we understand t-resilience to a lesser degree? The answer should be a resounding no if, in the sense of solving tasks, the models can be reduced to each other. That is, if for every task T we can ﬁnd a task Tt which is solvable wait-free if and only if T is solvable t-resiliently. Indeed, [2,4,8] established that t-resilience can be reduced to wait-freedom. Consequently, the two models are uniﬁed with respect to task solvability. In this paper, we consider a generalization of t-resilience, called L-resilience. Here L stands for a collection of subsets of processes. A set in L is referred to as a live set. In the model of L-resilience, a correct process is only obliged to produce outputs if all the processes in some live set are correct. Therefore, the notion of L-resilience represents a restricted class of adversaries introduced by Delporte et al. [5], described as collections of exact correct sets. L-resilience describes adversaries that are closed under the superset operation: if a correct set is in an adversary, then every superset of it is also in the adversary. We show that the key to understanding L-resilience is the notion of a minimum hitting set of L (called simply hitting set in the rest of the paper). Given a set system (Π, L) where Π is a set of processes and L is a set of subsets of Π, H is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. Intuitively, in every L-resilient execution, i.e., in every execution in which at least one set in L is correct, not all processes in a hitting set of L can fail. Thus, under L-resilience, we can solve the k-set agreement task among the processes in Π where k is the hitting set size of (Π, L). In k-set agreement, the processes start with private inputs and the set of outputs is a subset of inputs of size at most k. Indeed, ﬁx a hitting set H of (Π, L) of size k. Every process in H simply posts its input value in the shared memory, and every other process returns the ﬁrst value it witnesses to be posted by a process in H. Moreover, using a simple simulation based on [2,4], we derive that L does not allow solving (k − 1)-set agreement or any other colorless task that cannot be solved (k − 1)resiliently. Thus, we can decompose superset-closed adversaries into equivalence classes, one for each hitting set size, where each class agrees on the set of colorless tasks it allows for solving. Informally, colorless tasks allow a process to adopt an input or output value of any other participating process. This restriction gives rise to simulation techniques in which dedicated simulators independently “install” inputs for other, possibly non-participating processes, and then take steps on their behalf so that the resulting outputs are still correct and can be adopted by any participant [2,4]. The ability to do this is a strong simplifying assumption when solvability is analyzed. For the case of general tasks, where inputs cannot be installed independently, the situation is less trivial. We address general tasks by considering a restricted notion of weak solvability, that requires every execution where all the processes in

Relating L-Resilience and Wait-Freedom via Hitting Sets

193

some set in L are correct to produce outputs for every process in some (possibly diﬀerent) participating set in L. Note that for colorless tasks, weak solvability is equivalent to regular solvability that requires every correct process to output. We relate between wait-free solvability and L-resilient solvability. Given a task T and a collection of live sets L, we deﬁne a task TL such that T is weakly solvable L-resiliently if and only if TL is weakly solvable wait-free. Therefore, we characterize L-resilient weak solvability, as wait-free solvability has already been characterized in [16]. Not surprisingly, the notion of a hitting set is crucial in determining TL . The simulations that relate T and TL are interesting in their own right. We describe an agreement protocol, called Resolver Agreement Protocol (or RAP), by which an agreement is immediately achieved if all processes propose the same value, and otherwise it is achieved if eventually a single correct process considers itself a dedicated resolver. This agreement protocol allows for a novel execution model of wait-free read-write protocols. The model guarantees that an arbitrary number of simulators starting with j distinct initial views should appear as j independent simulators and thus a (j − 1)-resilient execution can be simulated. The rest of the paper is organized as follows. Section 2 brieﬂy describes our system model. Section 3 presents a simple categorization of colorless tasks. Section 4 formally deﬁnes the wait-free counterpart TL to every task T . Section 5 describes RAP, the technical core of our main result. Sections 6 and 7 present two directions of our equivalence result: from wait-freedom to L-resilience and back. Section 8 overviews the related work, and Section 9 concludes the paper by discussing implications of our results and open questions. Most proofs are delegated to the technical report [10].

2

Model

We adopt the conventional shared memory model [12], and only describe necessary details. Processes and objects. We consider a distributed system composed of a set Π of n processes {p1 , . . . , pn } (n ≥ 2). Processes communicate by applying atomic operations on a collection of shared objects. In this paper, we assume that the shared objects are registers that export only atomic read-write operations. The shared memory can be accessed using atomic snapshot operations [1]. An execution is a pair (I, σ) where I is an initial state and σ is a sequence of process ids. A process that takes at least one step in an execution is called participating. A process that takes inﬁnitely many steps in an execution is said to be correct, otherwise, the process is faulty. Distributed tasks. A task is deﬁned through a set I of input n-vectors (one input value for each process, where the value is ⊥ for a non-participating process), a set O of output n-vectors (one output value for each process, ⊥ for non-terminated processes) and a total relation Δ that associates each input vector with a set of possible output vectors. A protocol wait-free solves a task T

194

E. Gafni and P. Kuznetsov

if in every execution, every correct process eventually outputs, and all outputs respect the speciﬁcation of T . Live sets. The correct set of an execution e, denoted correct (e) is the set of processes that appear inﬁnitely often in e. For a given collection of live sets L, we say that an execution e is L-resilient if for some L ∈ L, L ⊆ correct (e). We consider protocols which allow each process to produce output values for every other participating process in the system by posting the values in the shared memory. We say that a process terminates when its output value is posted (possibly by a diﬀerent process). Hitting sets. Given a set system (Π, L) where L is a set of subsets of Π, a set H ⊆ Π is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. We denote the set of hitting sets of (Π, L) by HS(Π, L), and the size of a hitting set of (Π, L) by h(Π, L). By (Π , L), Π ⊆ Π we denote the set system that consists of the elements S ∈ L, such that S ⊆ Π . The BG-simulation technique. In a colorless task (also called convergence tasks [4]) processes are free to use each others’ input and output values, so the task can be deﬁned in terms of input and output sets instead of vectors. BG-simulation is a technique by which k + 1 processes q1 , . . ., qk+1 , called simulators, can wait-free simulate a k-resilient execution of any asynchronous n-process protocol [2,4] solving a colorless task. The simulation guarantees that each simulated step of every process pj is either eventually agreed on by all simulators, or the step is blocked forever and one less simulator participates further in the simulation. Thus, as long there is a live simulator, at least n − k simulated processes accept inﬁnitely many simulated steps. The technique has been later extended to tasks beyond colorless [8]. Weak L-resilience. An execution is L-resilient if some set in L contains only correct processes. We say that a protocol solves a task T weakly L-resiliently if in every L-resilient execution, every process in some participating set L ∈ L eventually terminates, and all posted outputs respect the speciﬁcation of T . In the wait-free case, when L consists of all n singletons, weak L-resilient solvability stipulates that at least one participating process must be given an output value in every execution. Weak solvability is suﬃcient to (strongly) solve every colorless task. For general tasks, however, weak solvability does not automatically implies strong solvability, since it only allows processes to adopt the output value of any terminated process, and does not impose any conditions on the inputs.

3

Colorless Tasks

Theorem 1. A colorless task T is L-resiliently solvable if and only if T is (h(Π, L) − 1)-resiliently solvable Theorem 1 implies that L-resilient adversaries can be categorized into n equivalence classes, class h corresponding to hitting sets of size h. Note that two

Relating L-Resilience and Wait-Freedom via Hitting Sets

195

adversaries that belong to the same class h agree on the set of colorless tasks they are able to solve, and the set includes h-set agreement.

4

Relating L-Resilience and Wait-Freedom: Definitions

Consider a set system (Π, L) and a task T = (I, O, Δ), where I is a set of input vectors, O is a set of output vectors, and Δ is a total binary relation between them. In this section, we deﬁne the “wait-free” task TL = (I , O , Δ ) that characterizes L-resilient solvability of T . The task TL is also deﬁned for n processes. We call the processes solving TL simulators and denote them by s1 , . . . , sn . Let X and X be two n-vectors, and Z1 , . . . , Zn be subsets of Π. We say that X is an image of X with respect to Z1 , . . . , Zn if ∀i, such that X [i] = ⊥, we have X [i] = {(j, X[j])}j∈Zi . Now TL = (I , O , Δ ) guarantees that for all (I , O ) ∈ Δ , there exist (I, O) ∈ Δ such that: (1) ∃S1 , . . . , Sn ⊆ Π, each containing a set in L: (1a) I is an image of I with respect to S1 , . . . , Sn . (1b) |{I [i]}i − {⊥}| = m ⇒ h(∪i,I [i]=⊥ Si , L) ≥ m. In other words, every process participating in TL obtains, as an input, a set of inputs of T for some live set, and all these inputs are consistent with some input vector I of T . Also, if the number of distinct non-⊥ inputs to TL is m, then the hitting set size of the set of processes that are given inputs of T is at least m. (2) ∃U1 , . . . , Un , each containing a set in L: O is an image of O with respect to U1 , . . . , U n . In other words, the outputs of TL produced for input vector I should be consistent with O ∈ O such that (I, O) ∈ Δ. Intuitively, every group of simulators that share the same input value will act as a single process. According to the assumptions on the inputs to TL , the existence of m distinct inputs implies a hitting set of size at least m. The asynchrony among the m groups will be manifested as at most m − 1 failures. The failures of at most m − 1 processes cannot prevent all live sets from terminating, as otherwise the hitting set in (1b) is of size at most m − 1.

5

Resolver Agreement Protocol

We describe the principal building block of our constructions: the resolver agreement protocol (RAP). RAP is similar to consensus, though it is neither always safe nor always live. To improve liveness, some process may at some point become a resolver, i.e., take the responsibility of making sure that every correct process outputs. Moreover, if there is at most one resolver, then all outputs are the same.

196

E. Gafni and P. Kuznetsov

Shared variables: D, initially ⊥ Local variables: resolver , initially false propose(v) 1 (flag, est ) := CA.propose(v) 2 if flag = commit then 3 D := est ; return(est ) 4 repeat 5 if resolver then D := est 6 until D = ⊥ 7 return(D) resolve() 8 resolver := true Fig. 1. Resolver agreement protocol: code for each process

Formally, the protocol accepts values in some set V as inputs and exports operations propose(v), v ∈ V , and resolve() that, once called by a process, indicates that the process becomes a resolver for RAP. The propose operation returns some value in V , and the following guarantees are provided: (i) Every returned value is a proposed value; (ii) If all processes start with the same input value or some process returns, then every correct process returns; (iii) If a correct process becomes a resolver, then every correct process returns; (iv) If at most one process becomes a resolver, then at most one value is returned. A protocol that solves RAP is presented in Figure 1. The protocol uses the commit-adopt abstraction (CA) [7] exporting one operation propose(v) that returns (commit, v ) or (adopt , v ), for v, v ∈ V , and guarantees that (a) every returned value is a proposed value, (b) if only one value is proposed then this value must be committed, (c) if a process commits a value v, then every process that returns adopts v or commits v, and (d) every correct process returns. The commit-adopt abstraction can be implemented wait-free [7]. In the protocol, a process that is not a resolver takes a ﬁnite number of steps and then either returns with a value, or waits on one posted in register D by another process or by a resolver. A process that waits for an output (lines 4-6) considers the agreement protocol stuck. An agreement protocol for which a value was posted in D is called resolved. Lemma 1. The algorithm in Figure 1 implements RAP.

6

From Wait-Freedom to L-Resilience

Suppose that TL is weakly wait-free solvable and let AL be the corresponding wait-free protocol. We show that weak wait-free solvability of TL implies weak L-resilient solvability of T by presenting an algorithm A that uses AL to solve T in every L-resilient execution.

Relating L-Resilience and Wait-Freedom via Hitting Sets

197

Shared variables: Rj , j = 1, . . . , n, initially ⊥ Local variables: Sj , j = 1, . . . , h(Π, L), initially ∅ j , j = 1, . . . , h(Π, L), initially 0 9 Ri := input value of T 10 wait until snapshot (R1 , . . . , Rn ) contains inputs for some set in L 11 while true do 12 S := {pi ∈ P, Ri = ⊥} {the current participating set} 13 if pi ∈ H S then {H S is deterministically chosen in HS(S, L)} 14 m := the index of pi in H S 15 RAP mm .resolve () 16 for each j = 1, . . . , |H S | do 17 if j = 0 then 18 Sj := S 19 take one more step of RAP jj .propose (Sj ) j 20 if RAP j .propose (Sj ) returns v then 21 (flag, Sj ) := CAjj .propose (v) 22 if (flag = commit) then 23 return ({(s, Rs )}ps ∈Sj ) {return the set of inputs of processes in Sj } 24 j := j + 1 Fig. 2. The doorway protocol: the code for each process pi

First we describe the doorway protocol (DW), the only L-dependent part of our transformation. The responsibility of DW is to collect at each process a subset of the inputs of T so that all the collected subsets constitute a legitimate input vector for task TL (property (1) in Section 4). The doorway protocol does not require the knowledge of T or TL and depends only on L. In contrast, the second part of the transformation described in Section 6.2 does not depend on L and is implemented by simply invoking the wait-free task TL with the inputs provided by DW. 6.1

The Doorway Protocol

Formally, a DW protocol ensures that in every L-resilient execution with an input vector I ∈ I, every correct participant eventually obtains a set of inputs of T so that the resulting input vector I ∈ TL complies with property (1) in Section 4 with respect to I. The algorithm implementing DW is presented in Figure 2. Initially, each process pi waits until it collects inputs for a set of participating processes that includes at least one live set. Note that diﬀerent processes may observe diﬀerent participating sets. Every participating set S is associated with H S ∈ HS(S, L), some deterministically chosen hitting set of (S, L). We say that H S is a resolver set for Π: if S is the participating set, then we initiate |H S | parallel sequences of agreement protocols with resolvers. Each sequence of agreement protocols can

198

E. Gafni and P. Kuznetsov

return at most one value and we guarantee that, eventually, every sequence is associated with a distinct resolver in H S . In every such sequence j, each process pi sequentially goes through an alternation of RAPs and CAs (see Section 5): RAP 1j , CA1j , RAP 2j , CA2j , . . .. The ﬁrst RAP is invoked with the initially observed set of participants, and each next CA (resp., RAP) takes the output of the previous RAP (resp., CA) as an input. If some CAj returns (commit , v), then pi returns v as an output of the doorway protocol. Lemma 2. In every L-resilient execution of the algorithm in Figure 2 starting with an input vector I, every correct process pi terminates with an output value I [i], and the resulting vector I complies with property (1) in Section 4 with respect to I. 6.2

Solving T through the Doorway

Given the DW protocol described above, it is straightforward to solve T by simply invoking AL with the inputs provided by DW. Thus: Theorem 2. Task T is weakly L-resiliently solvable if TL is weakly wait-free solvable.

From L-Resilience to Wait-Freedom

7

Suppose T is weakly L-resiliently solvable, and let A be the corresponding protocol. We describe a protocol AL that solves TL by wait-free simulating an L-resilient execution of A. For pedagogical reasons, we ﬁrst present a simple abstract simulation (AS) technique. AS captures the intuition that a group of simulators sharing the initial view of the set of participating simulated codes should appear as a single simulator. Therefore, an arbitrary number of simulators starting with j distinct initial views should be able to simulate a (j − 1)-resilient execution. Then we describe our speciﬁc simulation and show that it is an instance of AS, and thus it indeed generates a (j − 1)-resilient execution of L, where j is the number of distinct inputs of TL . By the properties of TL , we immediately obtain a desired L-resilient execution of A. 7.1

Abstract Simulation

Suppose that we want to simulate a given n-process protocol, with the set of codes {code1 , . . . , coden }. Every instruction of the simulated codes (read or write) is associated with a unique position in N. E.g., we can enumerate the instructions as follows: the ﬁrst instructions of each simulated code, then the second instructions of each simulated code, etc.1 1

In fact, only read instructions of a read-write protocol need to be simulated since these are the only steps that may trigger more than one state transition of the invoking process [2,4].

Relating L-Resilience and Wait-Freedom via Hitting Sets

199

A state of the simulation is a map of the set of positions to colors {U, IP, V }, every position can have one of three colors: U (unvisited ), IP (in progress), or V (visited ). Initially, every position is unvisited. The simulators share a function next that maps every state to the next unvisited position to simulate. Accessing an unvisited position by a simulator results in changing its color to IP or V . diﬀerent states concurrently proposed

IP

adversary

U adversary, identical states concurrently proposed

V

Fig. 3. State transitions of a position in AS

The state transitions of a position are summarized in Figure 3, and the rules the simulation follows are described below: (AS1) Each process takes an atomic snapshot of the current state s and goes to position next (s) proposing state s. For each state s, the color of next (s) in state s is U . - If an unvisited position is concurrently accessed by two processes proposing diﬀerent states, then it is assigned color IP . - If an unvisited position is accessed by every process proposing the same state, it may only change its color to V . - If the accessed position is already V (a faster process accessed it before), then the process leaves the position unchanged, takes a new snapshot, and proceeds to the next position. (AS2) At any point in the simulation, the adversary may take an in-progress (IP ) position and atomically turn it into V or take a set of unvisited (U ) positions and atomically turn them into V . (AS3) Initially, every position is assigned color U . The simulation starts when the adversary changes colors of some positions to V .

We measure the progress of the simulation by the number of positions turning from U to V . Note that by changing U or IP positions to V , the adversary can potentially hamper the simulation, by causing some U positions to be accessed with diﬀerent states and thus changing their colors to IP . However, the following invariant is preserved: Lemma 3. If the adversary is allowed at any state to change the colors of arbitrarily many IP positions to V , and throughout the simulation has j chances to atomically change any set of U positions to V , then at any time there are at most j − 1 IP positions.

200

E. Gafni and P. Kuznetsov

7.2

Solving TL through AS

Now we show how to solve TL by simulating a protocol A that weakly Lresiliently solves T . First, we describe our simulation and show that it instantiates AS, which allows us to apply Lemma 3. Every simulator si ∈ {s1 , . . . , sn } posts its input in the shared memory and then continuously simulates participating codes in {code1 , . . . , coden } of algorithm A in the breadth-ﬁrst manner: the ﬁrst command of every participating code, the second command of every participating code, etc. (A code is considered participating if its input value has been posted by at least one simulator.) The procedure is similar to BG-simulation, except that the result of every read command in the code is agreed upon through a distinct RAP instance. Simulator si is statically assigned to be the only resolver of every read command in codei . The simulated read commands (and associated RAPs) are treated as positions of AS. Initially, all positions are U (unvisited). The outcome of accessing a RAP instance of a position determines its color. If the RAP is resolved (a value was posted in D in line 3 or 5), then it is given color V (visited). If the RAP is found stuck (waiting for an output in lines 4-6) by some process, then it is given color IP (in progress). Note that no RAP accessed with identical proposals can get stuck (property (ii) in Section 5). After accessing a position, the simulator chooses the ﬁrst not-yet executed command of the next participating code in the round-robin manner (function next). For the next simulated command, the simulator proposes its current view of the simulated state, i.e., the snapshot of the results of all commands simulated so far (AS1). Further, if a RAP of codei is observed stuck by a simulator (and thus is assigned color IP ), but later gets resolved by si , we model it as the adversary spontaneously changing the position’s color from IP to V . Finally, by the properties of RAP, a position can get color IP only if it is concurrently accessed with diverging states (AS2). We also have n positions corresponding to the input values of the codes, initially unvisited. If an input for a simulated process pi is posted by a simulator, the initial position of codei turns into V . This is modeled as the intrusion of the adversary, and if simulators start with j distinct inputs, then the adversary is given j chances to atomically change sets of U positions to V . The simulation starts when the ﬁrst set of simulators post their inputs concurrently take identical snapshots (AS3). Therefore, our simulation is an instance of AS, and thus we can apply Lemma 3 to prove the following result: Lemma 4. If the number of distinct values in the input vector of TL is j, then the simulation above blocks at most j − 1 simulated codes. The simulated execution terminates when some simulator observes outputs of T for at least one participating live set. Finally, using the properties of the inputs to task TL (Section 4), we derive that eventually, some participating live set of simulated processes obtain outputs. Thus, using Theorem 2, we obtain:

Relating L-Resilience and Wait-Freedom via Hitting Sets

201

Theorem 3. T is weakly L-resiliently solvable if and only if TL is weakly waitfree solvable.

8

Related Work

The equivalence between t-resilient task solvability and wait-free task solvability has been initially established for colorless tasks in [2,4], and then extended to all tasks in [8]. In this paper, we consider a wider class of assumptions than simply t-resilience, which can be seen as a strict generalization of [8]. Generalizing t-resilience, Janqueira and Marzullo [18] considered the case of dependent failures and proposed describing the allowed executions through cores and survivor sets which roughly translate to our hitting sets and live sets. Note that the set of survivor sets (or, equivalently, cores) exhaustively describe only supersetclosed adversaries. More general adversaries introduced by Delporte et al. [5] are deﬁned as a set of exact correct sets. It is shown in [5] that the power of an adversary A to solve colorless tasks is characterized by A’s disagreement power, the highest k such that k-set agreement cannot be solved assuming A: a colorless task T is solvable with adversary A of disagreement power k if and only if it is solvable k-resiliently. Herlihy and Rajsbaum [15] (concurrently and independently of this paper) derived this result for a restricted set of superset-closed adversaries with a given core size using elements of modern combinatorial topology. Theorem 1 in this paper derives this result directly, using very simple algorithmic arguments. Considering only colorless tasks is a strong restriction, since such tasks allow for deﬁnitions that only depend on sets of inputs and sets of outputs, regardless of which processes actually participate. (Recall that for colorless tasks, solvability and our weak solvability are equivalent.) The results of this paper hold for all tasks. On the other hand, as [15], we only consider the class of superset-closed adversaries. This ﬁlters out some popular liveness properties, such as obstructionfreedom [13].Thus, our contributions complement but do not contain the results in [5]. A protocol similar to our RAP was earlier proposed in [17].

9

Side Remarks and Open Questions

Doorways and iterated phases. Our characterization shows an interesting property of weak L-resilient solvability: To solve a task T weakly L-resiliently, we can proceed in two logically synchronous phases. In the ﬁrst phase, processes wait to collect “enough” input values, as prescribed by L, without knowing anything about T . Logically, they all ﬁnish the waiting phase simultaneously. In the second phase, they all proceed wait-free to produce a solution. As a result, no process is waiting on another process that already proceeded to the waitfree phase. Such phases are usually referred to as iterated phases [3]. In [8], some processes are waiting on others to produce an output and consequently the characterization in [8] does not have the iterated structure. L-resilience and general adversaries. The power of a general adversary of [5] is not exhaustively captured by its hitting set. In a companion paper [11], we propose a simple characterization of the set consensus power of a general adversary

202

E. Gafni and P. Kuznetsov

A based on the hitting set sizes of its recursively proper subsets. Extending our equivalence result to general adversaries and getting rid of the weak solvability assumption are two challenging open questions.

References 1. Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., Shavit, N.: Atomic snapshots of shared memory. J. ACM 40(4), 873–890 (1993) 2. Borowsky, E., Gafni, E.: Generalized FLP impossibility result for t-resilient asynchronous computations. In: STOC, pp. 91–100. ACM Press, New York (May 1993) 3. Borowsky, E., Gafni, E.: A simple algorithmically reasoned characterization of wait-free computation (extended abstract). In: PODC 1997: Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 189–198. ACM Press, New York (1997) 4. Borowsky, E., Gafni, E., Lynch, N.A., Rajsbaum, S.: The BG distributed simulation algorithm. Distributed Computing 14(3), 127–146 (2001) 5. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009) 6. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985) 7. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 8. Gafni, E.: The extended BG-simulation and the characterization of t-resiliency. In: STOC, pp. 85–92 (2009) 9. Gafni, E., Koutsoupias, E.: Three-processor tasks are undecidable. SIAM J. Comput. 28(3), 970–983 (1999) 10. Gafni, E., Kuznetsov, P.: L-resilient adversaries and hitting sets. CoRR, abs/1004.4701 (2010), http://arxiv.org/abs/1004.4701 11. Gafni, E., Kuznetsov, P.: Turning adversaries into friends: Simpliﬁed, made constructive, and extended. In: OPODIS (2011) 12. Herlihy, M.: Wait-free synchronization. ACM Trans. Prog. Lang. Syst. 13(1), 123– 149 (1991) 13. Herlihy, M., Luchangco, V., Moir, M.: Obstruction-free synchronization: Doubleended queues as an example. In: ICDCS, pp. 522–529 (2003) 14. Herlihy, M., Rajsbaum, S.: The decidability of distributed decision tasks (extended abstract). In: STOC, pp. 589–598 (1997) 15. Herlihy, M., Rajsbaum, S.: The topology of shared-memory adversaries. In: PODC (2010) 16. Herlihy, M., Shavit, N.: The topological structure of asynchronous computability. J. ACM 46(2), 858–923 (1999) 17. Imbs, D., Raynal, M.: Visiting gafni’s reduction land: From the bg simulation to the extended bg simulation. In: SSS, pp. 369–383 (2009) 18. Junqueira, F., Marzullo, K.: A framework for the design of dependent-failure algorithms. Concurrency and Computation: Practice and Experience 19(17), 2255–2269 (2007) 19. Loui, M., Abu-Amara, H.: Memory requirements for agreement among unreliable asynchronous processes. Advances in Computing Research 4, 163–183 (1987)

Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information Valerie King1 , Steven Lonargan1, Jared Saia2 , and Amitabh Trehan1 1

Department of Computer Science, University of Victoria, P.O. Box 3055, Victoria, BC, Canada V8W 3P6 [email protected], {sdlonergan,amitabh.trehaan}@gmail.com 2 Department of Computer Science, University of New Mexico, Albuquerque, NM 87131-1386 [email protected] Abstract. We address the problem of designing distributed algorithms for large scale networks that are robust to Byzantine faults. We consider a message passing, full information model: the adversary is malicious, controls a constant fraction of processors, and can view all messages in a round before sending out its own messages for that round. Furthermore, each bad processor may send an unlimited number of messages. The only constraint on the adversary is that it must choose its corrupt processors at the start, without knowledge of the processors’ private random bits. A good quorum is a set of O(log n) processors, which contains a majority of good processors. In this paper, we give a synchronous algorithm ˜ √n) bits of communication per which uses polylogarithmic time and O( processor to bring all processors to agreement on a collection of n good quorums, solving Byzantine agreement as well. The collection is balanced in that no processor is in more than O(log n) quorums. This yields the first solution to Byzantine agreement which is both scalable and loadbalanced in the full information model. The technique which involves going from situation where slightly more than 1/2 fraction of processors are good and and agree on a short string with a constant fraction of random bits to a situation where all good processors agree on n good quorums can be done in a fully asynchronous model as well, providing an approach for extending the Byzantine agreement result to this model.

1

Introduction

The last ﬁfteen years have seen computer scientists slowly come to terms with the following alarming fact: not all users of the Internet can be trusted. While this fact is hardly surprising, it is alarming. If the size of the Internet is unprecedented in the history of engineered systems, then how can we hope to address the challenging problem of scalability and also the challenging problem of resistance to malicious users?

This research was partially supported by NSF CAREER Award 0644058, NSF CCR0313160, AFOSR MURI grant FA9550-07-1-0532, and NSERC.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 203–214, 2011. c Springer-Verlag Berlin Heidelberg 2011

204

V. King et al.

Recent work attempts to address both of these problems concurrently. In the last few years, almost everywhere Byzantine agreement, i.e., coordination between all but a o(1) fraction of processors, was shown to be possible with no more than polylogarithmic bits of communication per processor and polylogarithmic time [13]. More recently, scalable everywhere agreement was shown to be possible if a small set of processors took on the brunt of each communicating Ω(n3/2 ) bits to the remaining processors [11], or if private channels are assumed [12]. In this paper, we give the ﬁrst load-balanced, scalable method for agreeing on a bit in the synchronous, full information model. In particular, our algorithm ˜ √n) bits. Our technique also yields an requires each processor to send only O( agreement on a collection of n good quorum gateways (referred to as quorums from now on), that is, sets of processors of size O(log n) each of which contains a majority of good processors, and a 1-1 mapping of processors to quorums. The collection is balanced in that no processor is in more than O(log n) quorums. Our usage of the quorum terminology is similar to that in the peer-to-peer literature [17,6,1,3,5,8], where quorums are of O(log n) size each having a majority of good processors, and allow for containment of adversarial behavior via majority ﬁltering. Quorums are useful in an environment with malicious processors as they can act as a gateway to ﬁlter messages from by bad processors. For example, a bad processor x can be limited in the number of messages it sends if other processors only accept messages sent by a majority of processors in x’s quorum, and the majority only agree to forward a limited number of messages from x. The number of bits of communication required per processor is polylogarith˜ √n) per processor for mic to bring all but o(1) processors to agreement and O( everywhere agreement on the composition of the n quorums. Our result is with an adversary that controls up to a 1/3 − fraction of processors, for any ﬁxed > 0, and which has full information, i.e., it knows the content of all messages passed between good processors. However, the adversary is non-adaptive, that is, it cannot choose dynamically which processors to corrupt based on its observations of the protocol’s execution. Bad processors are allowed to send an unlimited number of bits and messages, and defense against a denial of service attack is one of the features of our protocol. As an additional result, we present an asynchronous algorithm that can go from a situation where, for any positive constant γ, 1/2+γ fraction of processors are good and agree on a single string of length O(log n) with a constant fraction of random bits to a situation where all good processors agree on n good quorums. This algorithm is load-balanced in that it requires each processor to send only ˜ √n) bits, and the resulting quorums are balanced in that no processor is in O( more than O(log n) quorums. 1.1

Methodology

Our synchronous protocol builds on a previous protocol which brings all but o(1) processors to agreement on a set of s = O(log n) processors of which no more than a 1/3 − fraction are bad, using a sparse overlay network [14]. Being few in number, these processors can run a heavyweight protocol requiring all-

Load Balanced Scalable Byzantine Agreement

205

to-all communication to also agree on a string globalstr which contains a bit (or multiple bits) from each processor, such that a 2/3 + fraction of the bits are randomly set. This string can be communicated scalably to almost all processors using a communication tree formed as a byproduct of the protocol (See [13,14]). When a clear majority of good processors agree on a value, a processor should be able to learn that value, with high probability, by polling O(log n) processors. However the bad processors can thwart this approach by ﬂooding all processors with requests. Even if there are few bad processors, in the full information model, the bad processors can target processors on speciﬁc good processors’ poll lists to isolate these processors. To address this problem, we use the globalstr to build quorums to limit the number of eﬀective requests. We also restrict the design of poll lists, preserving enough randomness that they are reliable, but limit the adversary’s ability to target. Key to our work here is that we show the existence of an averaging sampler type function, H, which is known at the start by all the processors and which with high probability, when given an O(log n) length string with a constant fraction of random bits, and a processor ID, produces a good quorum for every ID. Our protocol then uses the fact that almost all processors agree on a collection of good quorums to bring all processors to agree on the string in a load balanced manner, and hence the collection of quorums. Similarly, to solve Byzantine agreement, a single bit agreed to by the initial small set can be agreed to by all the processors. We also show the existence of a function J which uses individually generated random strings and a processor’s ID to output a O(log n) poll list, so that the distribution of poll lists has desired properties. These techniques can be extended to the asynchronous model assuming a scalable implementation of [10]. That work shows that a set of size O(log log n) processors with 2/3 + good processors can be agreed to almost everywhere with probability 1 − o(1). Bringing these processors to agreement on a string with some random bits is trickier in the asynchronous full information model, where the adversary can prevent a fraction of the good processors from being heard based on their random bits. However, [10] shows that it is possible to bring such a set to agreement on a string with some randomness, which we show is enough to provide a good input to H. 1.2

Related Work

Several papers are mentioned above with related results. Most closely related is the algorithm in [11] which similarly starts with almost everywhere agreement on a bit and a small representative set of processors from [13,14] and produces everywhere agreement. However it is not load balanced, and does not create quorums or require the use of specially designed functions H and J. With private channels, load balancing in the presence of an adaptive adversary is achievable ˜ √n) bits of communication per processor [12]. with O( Awerbuch and Scheidler have done important work in the area of maintaining quorums [3,4,5,6]. They show how to scalably support a distributed hash table (DHT) using quorums of size O(log n), where processors are joining and leaving,

206

V. King et al.

a functionality our method does not support. The adversary they consider is nonadaptive in the sense that processors cannot spontaneously be corrupted; the adversary can only decide to cause a good processor to drop out and decide if an entering processor is good or bad. A critical diﬀerence between their results and ours is that while they can maintain a system that starts in a good conﬁguration, they cannot initialize such a system unless the starting processors are all good. This is because an entering processor must start by contacting a good processor in a good quorum. The quorum uses secret sharing to produce a random number to assign or reassign new positions in a sparse overlay network (using the cuckoo rule [15]). These numbers and positions are created using a method for secret sharing involving private channels and cryptographic hardness assumptions. In older work, Upfal, Dwork, Peleg and Pippenger addressed the problem of solving almost-everywhere agreement on a bounded degree network [16,7]. However, the algorithms described in these papers are not scalable. In particular, both algorithms require each processor to send at least a linear number of bits (and sometimes an exponential number). 1.3

Model

We assume a fully connected network of n processors, whose IDs are common knowledge. Each processor has a private coin. Communication channels are authenticated, in the sense that whenever a processor sends a message directly to another, the identity of the sender is known to the recipient, but we otherwise make no cryptographic assumptions. We assume a nonadaptive (sometimes called static) adversary. That is, the adversary chooses the set of tn bad processors at the start of the protocol, where t is a constant fraction, namely, 1/3 − for any positive constant . The adversary is malicious: bad processors can engage in any kind of deviations from the protocol, including false messages and collusion, or crash failures, and bad processors can send any number of messages. Moreover, the adversary chooses the input bits of every processor. The good processors are those that follow the protocol. We consider both synchronous and asynchronous models of communication. In the synchronous model, communication proceeds in rounds; messages are all sent out at the same time at the start of the round, and then received at the same time at the end of the same round; all processors have synchronized clocks. The time complexity is given by the number of rounds. In the asynchronous model, each communication can take an arbitrary and unknown amount of time, and there is no assumption of a joint clock as in the synchronous model. The adversary can determine the delay of each message and the order in which they are received. We follow [2] in deﬁning the running time of an asynchronous protocol as the time of execution, where the maximum delay of a message between the time it is sent and the time it is processed is assumed to be one unit. We assume full information: in the synchronous model, the adversary is rushing, that is, it can view all messages sent by the good processors in a round before the bad processors send their messages in the same round. In the case of the asynchronous model, the adversary can view any sent message before its delay is determined.

Load Balanced Scalable Byzantine Agreement

1.4

207

Results

We use the phrase with high probability to mean that an event happens with probability at least 1 − 1/nc , for any constant c and suﬃciently large n. We show: Theorem 1 (Synchronous Byzantine Agreement). Let n be the number of processors in a synchronous full information message passing model with a nonadaptive, rushing adversary that controls less than 1/3 − fraction of processors. For any positive constant , there exists a protocol which √ w.h.p. computes ˜ n) bits of comByzantine agreement, runs in polylogarithmic time, and uses O( munication per processor. This result follows from the application of the load balanced protocol in [14], followed by the synchronous protocol introduced in Section 3 of this paper. Theorem 2 (Almost everywhere to everywhere–asynchronous). Let n be the number of processors in a fully asynchronous full information message passing model with a nonadaptive adversary. Assume that (1/2 + γ)n good processors agree on a string of length O(log n) which has a constant fraction of random bits, and where the remaining bits are ﬁxed by a malicious adversary after seeing the random bits. Then for any positive constant γ, there exists a protocol which w.h.p. brings all good processors to agreement on n good quorums; runs ˜ √n) bits of communication per processor. in polylogarithmic time; and uses O( Furthermore, if we assume that same set of good processors have agreed on an input bit (to the Byzantine agreement problem) then this same protocol can bring all good processors to agreement on that bit. A scalable implementation of the protocol in [10] following the lines of [14] would create the conditions in the assumptions of this theorem with probability 1 − O(1/ log n) in polylogarithmic time and bits per processor with an adversary that controls less than 1/3 − fraction of processors. Then this theorem would yield an algorithm to solve asynchronous Byzantine agreement with probability 1 − O(1/ log n). The protocol is introduced in Section 4 of this paper.

2

Combinatorial Lemmas

Before presenting our protocol, we discuss here the properties of some combinatorial objects we shall use in our protocol. Let [r] denote the set of integers {1, . . . , r}, and [s]d the multisets of size d consisting of elements of [s]. Let H : [r] → [s]d be a function assigning multisets of size d to integers. We deﬁne the intersection of a multiset A and a set B to be the number of elements of A which are in B. H is a (θ, δ) sampler if at most a δ fraction of all inputs x have |H(x)∩S| > |S| + θ. d s c+1 c Let r = n . Let i ∈ [n ] and j ∈ [n]. Then we deﬁne H(i, j) to be H(in + j) and H(i, ∗) to be the collection of subsets H(i + 1), H(i + 2), ..., H(i + n).

208

V. King et al.

Lemma 1 ([[9], Lemma 4.7], [[18], Proposition 2.20]). For every s, θ, δ > 0 and r ≥ s/δ, there is a (θ, δ) sampler H : [r] → [s]d with d = O(log(1/δ)/θ2 ). A corollary of the proof of this lemma shows that if one increases the constant in the expression of d by a factor of c, we get the following: Corollary 1. Let H[r] be constructed by randomly selecting with replacement d elements of [s]. For every s, θ, δ, c > 0 and r ≥ s/δ, for d = O(log(1/δ)/θ 2 ), H(r) is a (θ, δ) sampler H : [r] → [s]d with probability 1 − 1/nc . Lemma 2. Let r = nc+1 and s = n. Let H : [r] → [s]d be constructed by randomly selecting with replacement d elements of [s]. Call an element y ∈ [s] overloaded by H if its inverse image under H contains more than a.d elements, for some ﬁxed element a ≥ 6. The probability that any y ∈ [s] is overloaded by any H(i, ∗) is less than 1/2, for d = O(log n) and a = O(1). Proof. Fix i. The probability that the size of the inverse image of y ∈ [s] ∈ H(i, ∗) is a times its expected size of d is less than 2−ad , for a ≥ 6, by a standard Chernoﬀ bound. The probability that for any i that any y ∈ [s] is overloaded is less than n(nc )2−ad < 1/2, by a union bound over all y ∈ [s] and all i for d = O(log n). Let S be any subset of [n]. A quorum or poll list is a subset of [n] of size O(log n) and a good quorum (resp., poll list) with respect to S ∈ [n] is a quorum (resp., poll list) with greater than 1/2 elements in S. Taking the union bound over the probabilities of the events given in the preceding corollary and lemma, and applying the probabilistic method yields the existence of a mapping with the desired properties: Lemma 3. For any constant c, there exists a mapping H : [nc+1 ] → [n]d such that for every i the inverse image of every element under H[i, ∗] is O(log n) and for any choice of any subset S ⊂ n of size at least 1/2 + n, with probability 1 − 1/nc over the choice of random numbers i ∈ [nc ], H[i, ∗] contains all good quorums. The following lemma is needed to show results about the polling lists, which are subsets of size O(log n) just like quorums, but are used for a diﬀerent purpose in the protocol. Lemma 4. There exists a mapping J : [1..., nc+1 ] → [n]d such that for any set of 1/2 + fraction of good processors in [1 . . . n]: 1. At least nc+1 − n elements of L are mapped to a good PollList. 2 2. For any L ⊂ [nc+1 ], |L | ≤ n, let R be any subset of [r], |R | ≤ |L |/e and let L be the inverse image of R under J. Then x∈L |J(x)∩R | < d|L |/2. Hence |L |/2 pollLists contain fewer than d elements in R . Proof. Part 1: The probability that a randomly constructed J has this property with probability greater than 1/2 follows from Lemma 3. Part 2: Let J be constructed randomly as in the previous proofs. Fix L , ﬁx R .

Load Balanced Scalable Byzantine Agreement

209

d|L | d|L| P r[ x∈L |J(x)∩R | ≥ d|L |] = d|L ≤ [(e/)(|R |/n)]d|L | ≤ | (|R |/n)

e−d|L | , for |R | ≤ n/e2 The number of ways of choosing a subset of size x and y from [nc ] and [n], resp., is bounded above by (enc /x)x ∗ (en/y)y = ex(c log n−log x+1)+y(log n−log y+1) < e2|L |c log n . The union bound over all sizes of x ≤ n and y is less than 1/2 for d > (2c log n)/ + 1/|L| Hence with probability less than 1/2, x∈L |J(x)∩R | > d|L | for all subsets L of size n or less in [nc ] and all subsets R of size |L |/e2 . Finally, by the union bound, a random J has both properites (1) and (2) with probability greater than 0. By the probabilistic method, there exists a function J with properties (1) and (2). 2.1

Using the Almost-Everywhere Agreement Protocol in [13,14]

We observe that this protocol which uses polylogarithmic bits of communication generates a representative set S of O(log n) processors which is agreed upon by all but O(1/ log n) fraction of good processors, and any message agreed upon by the processors is learned by all but O(1/ log n) fraction of good processors. Hence we start in our current work from the point where there is an b log n bit string globalstr agreed upon by all but O(1/ log n) fraction of good processors such that 2/3+ fraction of good processor in S have each generated c /b random bits (see below), and the remaining bits are generated by bad processors after seeing the bits of good processors. The ordering of the bits is independent of their value and is given by processor ID. globalstr is random enough: Lemma 5. With probability at least 1 − 1/nc , for suﬃciently large constant c and d = O log n), there is an H : [nc +1 ] → [n]d such that H(globalstr , ∗) is a collection of all good quorums. Proof. By Lemma 3 there are nc good choices for globalstr and n bad choices. We choose c to be a multiple of b which is greater than (3/2)c. Fix one bad choice string. The probability of the random bits matching this string is less than 2−(2/3c log n) and by a union bound, the probability of it matching any of the n bad strings is less than 1/nc .

3

Algorithm

In this section, we describe the protocol (Protocol 3.1) that reaches everywhere agreement from almost everywhere agreement. 3.1

Description of the Algorithm

Precondition: Each processor p starts with a hypothesis of the global string, candstr p ; this hypothesis may or may not equal globalstr . However, we make

210

V. King et al.

Given: Functions H (as described in Lemma 3), and J (as described in Lemma 4). Part I: Setting up Candidate Lists. 1: for each processor p: do 2: Select uniformly at random a subset, samplelist p , of processor IDs where √ |samplelist p | = c n log(n). 3: p.send (samplelist p , < candstr p >). 4: Set candlist p ← candstr p . 5: For each processor √ r that sent < candstr r > to p, add candstr r to candlist p with probability 1/ n. Part II: Setting up Requests through quorums. 1: for each processor p: do 2: p generates a random string rstr p . 3: For each candidate string s ∈ candlist p , p.send (H(s, p), < rstr p >). 4: Let polllist p ← J(rstr p , p) 5: if processor z ∈ H(candstr z , p) ) and z.accept(p, < rstr p >) then 6: for each processor y ∈ polllist p do 7: z.send (H(candstr z , y), < p → y >) 8: for Processor t ∈ H(candstr t , y) for any processor y do 9: Requestst (y) = {< p → y > | received from p ’s quorum H(candstr t , p)} Part III: Propagating globalstr to every processor. 1: for log n rounds in parallel do 2: if 0 < |Requestst (y)| < c log(n) then 3: for < p → y >∈ Requestst (y) do 4: t.send (y, < p → y >) 5: set Requestst (y) ← ∅. 6: if y.accept(H(candstr y , y), < p → y >) then 7: y.send (p, < candstr y >) 8: y.send (H(candstr y , p), < candstr y >) 9: when for processor p, count of processors in polllist p sending candidate string s over all rounds reaches a majority: Set candstr p ← s. 10: if when for processor z ∈ H(candstr z , p), count of processors in polllist p sending string s over all rounds reaches a majority then 11: for Processor y ∈ polllist p such that y did not yet respond do 12: z.send (H(candstr y , z), < Abort, p >) 13: if t ∈ H(candstr t , y) and t.accept(H(candstr t , p), < Abort, p > then 14: < p → y > is removed from Requestst (y).

Protocol 3. 1. Load balanced almost everywhere to everywhere

a critical assumption that at least 1/2 + γ fraction of processors are good and knowledgable i.e. their candstr equals globalstr . Actually we can ensure that 2/3 + − O(1/ log n) fraction of processors are good and knowledgeable using the almost-everywhere protocol from [13,14], but we need only have 1/2 + fraction for our protocol to work. Let candlist p be a list of candidate strings that p collects during the algorithm. Further, we call H(candstr q , p) a quorum of p (or p’s quorum) according to q. If

Load Balanced Scalable Byzantine Agreement

211

a processor p is sending to a quorum for x then it is assumed to mean that this is the quorum according to p, unless otherwise stated. Similarly, if t is sending conditional on its being in a particular quorum, then we mean this quorum according to t. Often, we shall denote a message within arrow brackets ( ), in particular < p → y > is the message that p has requested information from y. We call a quorum a legitimate quorum of p if it is generated by the globalstr i.e. H(globalstr , p). We also deﬁne the following primitives: v.send (X, m): Processor v sends message m to all processors in set X. v.accept (X, m): Processor v accepts the message m received from a majority of the processors in the set X (which could be a singleton set), otherwise It rejects it. Rejection of excess: Every processor will reject messages received in excess of the number of those messages dictated by the protocol in that round or stage of execution of the protocol. We assume each processor knows H and J. The key to achieving reliable communication channels through quorums is to use the globalstr √ . To begin, each processor p sends its candidate string candstr p directly to c n log n randomly selected processors (the samplelist p ). It then generates its own list of candidates candlist p for the √ globalstr including candstr p and every received string with probability 1/ n. This ensures that p has at least one globalstr in its list. The key to everywhere agreement is to be able to poll enough processors reliably so as to be able to learn globalstr . Part II sets up these polling requests. Each processor p generates a random string rstr p , which is used to generate p’s poll list polllist p using the function J by both p and its quorums. All the processors in the poll list are then contacted by p for their candidate string. In line 2, p determines its quorum for each of the strings in its candlist p and sends rstr p to the processors in the quorums. To prevent the adversary from targeting groups of processors, the quorums do not accept the poll list but rather the random string and then generate the poll list themselves. The important thing to note here is that even if p sent a message to its quorum the processors in the quorum will not accept the messages unless according to their own candidate string, they are in p’s quorum. Hence, it is important to note that w.h.p. at least one of these quorums is a legitimate quorum. Since p sends to at least one legitimate quorum, and the processors in this quorum will accept p’s request, this request will be forwarded. p’s quorum in turn contacts processor y’s quorum for each y that was in p’s poll list. The processors in y’s legitimate quorum gather all the requests meant for y in preparation for the next part of the protocol. Part III proceeds in log n rounds. The processors in y’s quorum only forward the received requests if they number less than c log n for some ﬁxed constant c . This prevents any processor from being overloaded. Once y accepts the requests (in accordance with y.accept), it will send its candidate string directly to p and also to p’ s quorum. When p gets the same string from a majority of processors in

212

V. King et al.

its poll list, it sets its own candidate string to this string. This new string w.h.p. is globalstr . There may be overloaded processors which have not yet answered p’s requests. To release the congestion, p will send abort messages to these quorums, which will now take the request oﬀ p’s request oﬀ their list. In each round, the number of satisﬁed processors falls by at least half, so that no more than log n rounds are needed. In this way, w.h.p. each processor decides the globalstr . 3.2

Proof of Correctness

The conditions for the correctness of the protocol given in Protocol 3.1 are stated as Lemma 10. To prove that, ﬁrst we show the following supporting lemmas. Lemma 6. W.h.p., at least one string in the candlist p of processor p is the globalstr . Proof. The proof of this √ follows from the birthday paradox. If there are n possible birthdays and O( n) children, two will likely share a birthday. Adding an O(log n) factor increases the probability for this to happen n times w.h.p. Lemma 7. For processor p and its random string rstr p , a majority of the processors y in polllist p are good and knowledgable, and they receive the request < p → y >. Proof. The poll list for processor p, polllist p is generated by the sampler J using p’s random string rstr p and p’s ID. By Lemma 4, a majority of polllist p is good and knowledgable. From Lemmas 5 and 6, processor p will send its message for its poll lists to at least one legitimate quorum. Since a majority of these are good and knowledgable, they will forward the message < p → y > for each processor y ∈ polllist p = J(rstr p , p) to at least one legitimate quorum of y. By Lemma 9, y shall accept the message. Observation 1. The messages sent by the bad processors, or good but not knowledgable processors (having candstr = globalstr ) do not aﬀect the outcome of the protocol. Proof. All communication in Parts 2 and 3 is veriﬁed by a processor against its quorums or poll list. Any communication received through the quorum or poll list is inﬂuential if only a majority of processors in them have sent it (either using the accept primitive or by counting total messages received). By Lemmas 6 and 7, majority of these lists are good and knowledgable. ˜ √n) bits. Lemma 8. For the protocol, any processor sends no more than O( Proof. Consider a good and knowledgeable processor p. In Part-I, line 3, p sends √ c n log n messages. For part II of the algorithm, consider p is in the quorum of a processor z; p forwards O(log2 n) messages to the quorums of z’s poll list. In part III, p forwards only O(log n) requests to z. The cost of aborting is no more than the cost of sending. In addition, z answers no more than the number of requests that its quorum forwards. By the rejection of excess primitive, no extra messages ˜ √n) bits over a run of the whole protocol. are sent. Thus, p sends at most O(

Load Balanced Scalable Byzantine Agreement

213

Lemma 9. By the end of part III, for each p, a majority of p’s poll list have received p’s request to respond from their legitimate quorums. Proof. Quorums will forward requests provided their processors are not overloaded. We show by induction that if in round i, there were x processors making requests to overloaded processors, there are no more than x/2 requests to overloaded processors in round i + 1, and thus in log n rounds, there shall be no overloaded processors. Hence every processor will answer its requests. Refer to Lemma 4: let Ri be the set of overloaded processors in round i (those that have more than (4/)d requests). Consider the set Li of processors which made these requests; |Li | ≥ 8/|Ri |. By part 2 of the lemma, half the processors in Li contain less than fraction of their PollLists in Ri , and their requests will be satisﬁed in the current round by a majority of good processors. Thus, there are now no more than |Li |/2 such processors making requests to processors in Ri , and hence to overloaded processors in round i + 1. Lemma 10. Let n be the number of processors in a synchronous full information message passing model with a nonadaptive rushing malicious adversary which controls less than a 13 − fraction of processors, and more than 12 + γ fraction of processors are good and knowledgable. For any positive constants , γ there exists a protocol w.h.p. such that: 1) At the end of the protocol, each good processor is also knowledgable, 2) The protocol takes no more than O(log n) rounds in ˜ √n) messages per processor. parallel, using no more than O( Proof. Part 1 follows from Lemmas 7 and observation 1; processor p hears back from its poll list and becomes knowledgable. Part 2 follows directly from lemmas 9 (Protocol is completed in O(log n) rounds) and 8.

4

Asynchronous Version

The asynchronous protocol for Byzantine agreement relies on the globalstr being generated by a scalable version of [10]. Such a string would have a reduced constant fraction of random bits but there would still be suﬃcient randomness to guarantee the properties needed. Note that the reduction in the fraction of random bits needed in the string can be compensated for by increasing the length of the string in the proof of Lemma 5. The asynchronous protocol to bring all processors to agreement on the globalstr can be constructed from the synchronous protocol by using the primitive asynch accept instead of accept and by changes to Part III. The primitive v.asynchaccept (X, m) is deﬁned as : Processor v waits until |X|/2 + 1 messages which agree on m are received and then takes their value. In Part III, since there are no rounds, there is instead an end-of-round signal for each “round” which is determined when enough processors have decided. the quorums are organized in a tree structure which allows them to simulate the synchronous rounds by explicitly counting the number of processors that become knowledgable. Round number is determined by the count of quorums which have received n/2 + 1 answers to requests of their

214

V. King et al.

processor. The quorum of a processor monitors the number of requests received and only forward the requests to a processor when the current number of requests received in a round is suﬃciently small. The asynchronous protocol incurs an additional overhead of a log n factor in the number of messages.

References 1. Aspnes, J., Shah, G.: Skip graphs. In: SODA, pp. 384–393 (2003) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley & Sons, Chichester (2004) 3. Awerbuch, B., Scheideler, C.: Provably secure distributed name service. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009. LNCS, vol. 5556, Springer, Heidelberg (2009) 4. Awerbuch, B., Scheideler, C.: Robust distributed name service. In: Voelker, G.M., Shenker, S. (eds.) IPTPS 2004. LNCS, vol. 3279, pp. 237–249. Springer, Heidelberg (2005) 5. Awerbuch, B., Scheideler, C.: Towards a Scalable and Robust DHT. In: SPAA, pp. 318–327 (2006) 6. Awerbuch, B., Scheideler, C.: Towards a scalable and robust DHT. Theory Comput. Syst. 45(2), 234–260 (2009) 7. Dwork, C., Peleg, D., Pippenger, N., Upfal, E.: Fault tolerance in networks of bounded degree. In: STOC, pp. 370–379 (1986) 8. Fiat, A., Saia, J., Young, M.: Making chord robust to byzantine attacks. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 803–814. Springer, Heidelberg (2005) 9. Gradwohl, R., Vadhan, S.P., Zuckerman, D.: Random selection with an adversarial majority. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 409–426. Springer, Heidelberg (2006) 10. Kapron, B.M., Kempe, D., King, V., Saia, J., Sanwalani, V.: Fast asynchronous byzantine agreement and leader election with full information. In: SODA, pp. 1038– 1047 (2008) 11. King, V., Saia, J.: From almost everywhere to everywhere: Byzantine agreement ˜ 3/2 ) bits. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 464–478. with O(n Springer, Heidelberg (2009) 12. King, V., Saia, J.: Breaking the O(n2 ) bit barrier: Scalable byzantine agreement with an adaptive adversary. In: PODC, pp. 420–429 (2010) 13. King, V., Saia, J., Sanwalani, V., Vee, E.: Scalable leader election. In: SODA, pp. 990–999 (2006) 14. King, V., Saia, J., Sanwalani, V., Vee, E.: Towards secure and scalable computation in peer-to-peer networks. In: FOCS, pp. 87–98 (2006) 15. Scheideler, C.: How to Spread Adversarial Nodes? Rotate! In: STOC, pp. 704–713 (2005) 16. Upfal, E.: Tolerating linear number of faults in networks of bounded degree. In: PODC, pp. 83–89 (1992) 17. Young, M., Kate, A., Goldberg, I., Karsten, M.: Practical robust communication in DHTs tolerating a byzantine adversary. In: ICDCS, pp. 263–272. IEEE, Los Alamitos (2010) 18. Zuckerman, D.: Randomness-optimal oblivious sampling. Random Struct. Algorithms 11(4), 345–367 (1997)

A Necessary and Sufficient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks Olivier Baldellon1 , Achour Most´efaoui2, and Michel Raynal2 1 LAAS-CNRS, 31077 Toulouse, France IRISA, Universit´e de Rennes 1, 35042 Rennes, France [email protected], {achour,raynal}@irisa.fr 2

Abstract. Solving the consensus problem requires in one way or another that the underlying system satisfies synchrony assumptions. Considering a system of n processes where up to t < n/3 may commit Byzantine failures, this paper investigates the synchrony assumptions that are required to solve consensus. It presents a corresponding necessary and sufficient condition. Such a condition is formulated with the notions of a symmetric synchrony property and property ambiguity. A symmetric synchrony property is a set of graphs, where each graph corresponds to a set of bi-directional eventually synchronous links among correct processes. Intuitively, a property is ambiguous if it contains a graph whose connected components are such that it is impossible to distinguish a connected component that contains correct processes only from a connected component that contains faulty processes only. The paper connects then the notion of a symmetric synchrony property with the notion of eventual bi-source, and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus in presence of up to t Byzantine processes in systems with bi-directional links and message authentication. Finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remains open (and very challenging) problems. Keywords: Asynchronous message system, Byzantine consensus, Eventually synchronous link, Lower bound, Signature, Symmetric Synchrony property.

1 Introduction Byzantine consensus. A process has a Byzantine behavior when it behaves arbitrarily [15]. This bad behavior can be intentional (malicious behavior, e.g., due to intrusion) or simply the result of a transient fault that altered the local state of a process, thereby modifying its behavior in an unpredictable way. We are interested here in the consensus problem in distributed systems prone to Byzantine process failures whatever their origin. Consensus is an agreement problem in which each process first proposes a value and then decides on a value [15]. In a Byzantine failure context, the consensus problem is defined by the following properties: every non-faulty process decides a value (termination), no two non-faulty processes decide different values (agreement), and if all non-faulty processes propose the same M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 215–226, 2011. c Springer-Verlag Berlin Heidelberg 2011

216

O. Baldellon, A. Most´efaoui, and M. Raynal

value, that value is decided (validity). (See [14] for a short introduction to Byzantine consensus.) Aim of the paper. A synchronous distributed system is characterized by the fact that both processes and communication links are synchronous (or timely) [2,13,16]. This means that there are known bounds on process speed and message transfer delays. Let t denote the maximum number of processes that can be faulty in a system made up of n processes. In a synchronous system, consensus can be solved (a) for any value of t (i.e., t < n) in the crash failure model, (b) for t < n/2 in the general omission failure model, and (c) for t < n/3 in the Byzantine failure model [12,15]. Moreover, these bounds are tight. On the contrary, when all links are asynchronous (i.e., when there is no bound on message transfer delays), it is impossible to solve consensus even if we consider the weakest failure model (namely, the process crash failure model) and assume that at most one process may be faulty (i.e., t = 1) [7]. It trivially follows that Byzantine consensus is impossible to solve in an asynchronous distributed system. As Byzantine consensus can be solved in a synchronous system and cannot in an asynchronous system, a natural question that comes to mind is the following “When considering the synchrony-to-asynchrony axis, which is the weakest synchrony assumption that allows Byzantine consensus to be solved?” This is the question addressed in the paper. To that end, the paper considers the synchrony captured by the structure and the number of eventually synchronous links among correct processes. Related work. Several approaches to solve Byzantine consensus have been proposed. We consider here only deterministic approaches. One consists in enriching the asynchronous system (hence the system is no longer fully asynchronous) with a failure detector, namely, a device that provides processes with hints on failures [4]. Basically, in one way or another, a failure detector encapsulates synchrony assumptions. Failure detectors suited to Byzantine behavior have been proposed and used to solved Byzantine consensus (e.g., [3,8,9]). Another approach proposed to solve Byzantine consensus consists in directly assuming that some links satisfy a synchrony property (“directly” means that the synchrony property is not hidden inside a failure detector abstraction). This approach relies on the notion of a 3[x + 1]bi-source (read “3” as “eventual”) that has been introduced in [1]. Intuitively, this notion states that there is a correct process that has x bidirectional input/outputs links with other correct processes and these links eventually behave synchronously [5,6]. (Our definition of a 3[x + 1]bi-source is slightly different from the original definition introduced in [1]. The main difference is that it considers only eventual synchronous links connecting correct processes. It is precisely defined in Section 61 .) Considering asynchronous systems with Byzantine processes without message authentication, it is shown in [1] that Byzantine consensus can be solved if the system has 1

We consider eventually synchronous links connecting correct processes only for the following reason. This is because, due to Byzantine behavior, a synchronous link connecting a correct process and a Byzantine process can always appear to the correct process as being an asynchronous link.

A Symmetric Synchrony Condition for Solving Byzantine Consensus

217

a 3[n − t]bi-source (all other links being possibly fully asynchronous). Moreover, the 3[n − t]bi-source can never be explicitly known. This result has been refined in [11] where is presented a Byzantine consensus algorithm for an asynchronous system that has a 3[2t + 1]bi-source. Considering systems with message authentication, a Byzantine consensus algorithm is presented in [10] that requires a 3[t + 1]bi-source only. As for Byzantine consensus in synchronous systems, all these algorithms assume t < n/3. ⇒

Synchrony property S is ambiguous

Theorem 2 Section 5

Consensus cannot be solved with property S

⇒

⇒

Theorem 1 Section 4

∃ G ∈ S and ∀C ∈ G : |C| ≤ t

⇒ Contrapositive of Ref. [10]

Lemma 3 Section 6

∃ G ∈ S with no virtual 3[t + 1]bi-source

Fig. 1. The proof of the necessary and sufficient condition (Theorem 3)

Content of the paper. The contribution of the paper is the definition of a symmetric synchrony property that is necessary and sufficient to solve Byzantine consensus in asynchronous systems with message authentication. From a concrete point of view, this property is the existence of what we call a virtual 3[t + 1]bi-source. A symmetric synchrony property S is a set of communication graphs, such that (a) each graph specifies a set of eventually synchronous bi-directional links connecting correct processes and (b) this set of graphs satisfies specific additional properties that give S a particular structure. A synchrony property can be or not ambiguous. Intuitively, it is ambiguous if it contains a graph whose connected components are such that there are executions in which it is impossible to distinguish a component with correct processes only from a connected component with faulty processes only. (These notions are formally defined in the paper). A synchrony property S for a system of n processes where at most t processes may be faulty is called (n, t)-synchrony property. The paper shows first that, assuming a property S, it is impossible to solve consensus if S is ambiguous. It is then shown that, if consensus can be solved when the actual communication graph is any graph of S (we then say “S is satisfied”), then any graph of S has at least one connected component whose size is at least t + 1. The paper then relates the ambiguity of an (n, t)-synchrony property S with the size x of a virtual 3[x]bi-source. These results are schematically represented in Figure 1 from which follows the fact that a synchrony property S allows Byzantine consensus to be solved despite up to t Byzantine processes in a system with message authentication if and only if S is not ambiguous. Road map. The paper is made up of 7 sections. Section 2 presents the underlying asynchronous Byzantine computation model. Section 3 defines the notion of a synchrony property S and the associated notion of ambiguity. As already indicated, a synchrony

218

O. Baldellon, A. Most´efaoui, and M. Raynal

property is on the structure of eventually synchronous links connecting correct processes. Then, Section 4 shows that an ambiguous synchrony property S does not allow consensus to be solved (Theorem 1). Section 5 relates the size of connected components of the graphs of an (n, t)-synchrony property S with the ambiguity of S (Theorem 2). Section 6 establishes the main result of the paper, namely a necessary and sufficient condition for solving Byzantine consensus in system with message authentication. Finally, Section 7 concludes the paper.

2 Computation Model Processes. The system is made up of a finite set Π = {p1 , . . . , pn } of n > 1 processes that communicate by exchanging messages through a communication network. Processes are assumed to be synchronous in the sense that local computation times are negligible with respect to message transfer delays. Local processing times are considered as being equal to 0. Failure model. Up to t < n/3 processes can exhibit a Byzantine behavior. A Byzantine process is a process that behaves arbitrarily: it can crash, fail to send or receive messages, send arbitrary messages, start in an arbitrary state, perform arbitrary state transition, etc. Moreover, Byzantine processes can collude to “pollute” the computation. Yet, it is assumed that they do not control the network. This means that they cannot corrupt the messages sent by non-Byzantine processes, and the schedule of message delivery is uncorrelated to Byzantine behavior. A process that exhibits a Byzantine behavior is called faulty. Otherwise, it is correct or non-faulty. Communication network. Each pair of processes pi and pj is connected by a reliable bidirectional link denoted (pi , pj ). This means that, when a process receives a message, it knows which is its sender. A link can be fully asynchronous or eventually synchronous. The bi-directional link connecting a pair of processes pi and pj is eventually synchronous if there is a finite (but unknown) time τ after which there is an upper bound on the time duration that elapses between the sending and the reception of a message sent on that link (hence an eventually synchronous link is eventually synchronous in both directions). If such a bound does not exist the link is fully asynchronous. If τ = 0 and the bound is known, then the link is synchronous. Message authentication. When the system provides the processes with message authentication, a Byzantine process can fail to relay messages or send bad messages only. When it forwards a message received from another process it cannot alter its content. Notation. Given a set of processes that defines which are the correct processes, let H ⊆ Π × Π denote the set of eventually synchronous bi-directional links connecting these correct processes. (This means that this communication graph has no incident edge to a faulty process; moreover it is possible that some pair of correct processes be not connected by an eventually synchronous bi-directional link).

A Symmetric Synchrony Condition for Solving Byzantine Consensus

219

Given a set of correct processes and an associated graph H as defined above, the previous system model is denoted AS n,t [H]. More generally let S = {H1 ,..., H } be a set of sets of eventually synchronous bidirectional links connecting correct processes. AS n,t [S] denotes the system model (set of runs, see below in Section 3.2) in which the correct processes and the eventually synchronous bi-directional links connecting them are defined by H1 , or H2 , etc., or H .

3 Definitions We consider only undirected graphs in the following. The aim of this section is to state a property that will be used to prove an impossibility result. Intuitively, a vertex represents a process, while an edge is used to represent an eventually synchronous bi-directional link. Hence the set of vertices of a graph G is Π and its set of edges is included in Π × Π. 3.1 (n, x)-Synchrony Property and Ambiguity The formal definitions given in this section will be related to processes and links of a system in the next section. Definition 1. Let G = (Π, E) be a graph. A permutation π on Π defines a permuted graph, denoted π(G) = (Π, E ), i.e., ∀ a, b ∈ Π : (a, b) ∈ E ⇔ (π(a), π(b)) ∈ E . All permuted graphs of G have the same structure as G, they differ only in the names of vertices. Definition 2. Let G1 = (Π, E1) and G2 = (Π, E2). G1 is included in G2 (denoted G1 ⊆ G2 ) if E1 ⊆ E2. Definition 3. An (n, x)-synchrony property S is a set of graphs with n vertices such that ∀G1 ∈ S we have: – Permutation stability. If G2 is a permuted graph of G1 , then G2 ∈ S. – Inclusion stability. ∀ G2 such that G1 ⊆ G2 then G2 ∈ S. – x-Resilience. ∃ G0 ∈ S such that G0 ⊆ G1 and G0 has at least x isolated vertices (an isolated vertex is a vertex without neighbors). The aim of an (n, x)-synchrony property is to capture a property on eventually synchronous bi-directional links. It is independent from process identities (permutation stability). Moreover, adding eventually synchronous links to a graph of an (n, x)-synchrony property S does not falsify it (inclusion stability). Finally, the fact that up to x processes are faulty cannot invalidate it (x-resilience). As an example, assuming n−t ≥ 3, “there are 3 eventually synchronous bi-directional links connecting correct processes” is an (n, x)-synchrony property. It includes all the graphs G of n vertices that have 3 edges and x isolated vertices plus, for every such G, all graphs obtained by adding any number of edges to G. Given a graph G = (Π, E) and a set of vertices C ⊂ Π, G \ C denotes the graph from which edges (pi , pj ) with pi or pj ∈ C have been removed.

220

O. Baldellon, A. Most´efaoui, and M. Raynal

Definition 4. Let S be an (n, x)-synchrony property. S is ambiguous if it contains a graph G = (Π, E) whose every connected component C is such that (i) |C| ≤ x and (ii) G \ C ∈ S. Such a graph G is said to be S-ambiguous. Intuitively, an (n, x)-synchrony property S is ambiguous if it contains a graph G that satisfies the property S in all runs where all processes of any connected component of G could be faulty (recall that at most x processes are faulty). 3.2 Algorithm and Runs Satisfying an (n, x)-Synchrony Property Definition 5. An n-process algorithm A is a set of n automata, such that a deterministic automaton is associated with each correct process. A transition of an automaton defines a step. A step corresponds to an atomic action. During a step a correct process may send/receive a message and change its state according to its previous steps and the current state of its automaton. The steps of a faulty process can be arbitrary. Definition 6. A run of an algorithm A in AS n,t [G] (a system of n processes with at most t faulty processes and for which G defines the graph of eventually timely channels among correct processes) is a triple I, R, T t,G where I defines the initial state of each correct process, R is a (possibly infinite) sequence of steps of A (where at most t processes are faulty) and T is the increasing sequence of time values indicating the time instants at which the steps of R occurred. The sequence R is such that, for any message m, the reception of m occurs after its sending and the steps issued by every process occur in R in their issuing order, and for any correct process pi the steps of pi are as defined by its automaton. Definition 7. Et,G (A) denotes the set of runs of algorithm A in AS n,t [G]. Definition 8. Given an (n, x)-synchrony property S, let ES (A) = t≤x,G∈S Et,G (A) (i.e., ES (A) is the set of runs r = I, R, T t,G of A such that t ≤ x and G ∈ S). Let us remind that a synchrony property S is a set of graphs on Π. Definition 9. An (n, x)-synchrony property S allows an algorithm A to solve the consensus problem in AS n,x [S] if every run in ES (A) satisfies the validity, agreement and termination properties that define the Byzantine consensus problem.

4 An Impossibility Result Given an (n, t)-synchrony property S, this section shows that there is no algorithm A that solves the consensus problem in AS n,t [H] if H is an S-ambiguous graph of S. This means that the synchrony assumptions captured by S are not powerful enough to allow consensus to be solved despite up to t faulty processes. There is no algorithm A that would solve consensus for any underlying synchrony graph of an ambiguous synchrony property S.

A Symmetric Synchrony Condition for Solving Byzantine Consensus

221

4.1 A Set of Specific Runs This section defines the set of runs in which the connected components (as defined by the eventually synchronous communication graph H) are asynchronous the ones with respect to the others, and (if any) the set of faulty processes corresponds to a single connected component. The corresponding set of runs, denoted F (A, H), will then be used to prove the impossibility result. Definition 10. Let A be an n-process algorithm and H be a graph whose n vertices are processes and every connected component contains at most t processes. Let F (A, H) be the set of runs of A that satisfy the following properties: – If pi and pj belong to the same connected component of H, then the bi-directional link (pi , pj ) is eventually synchronous. – If pi and pj belong to the same connected component of H, then both are correct or both are faulty. – If pi and pj belong to distinct connected components of H, then, if pi is faulty, pj is correct. 4.2 An Impossibility Let S be an ambiguous (n, t)-synchrony property, A be an algorithm and H be the graph defining the eventually synchronous links among processes. The lemma that follows states that, if H is S-ambiguous, all runs r in F (A, H) belong to ES (A). Lemma 1. Let S be an (n, t)-synchrony property and H ∈ S. If H is S-ambiguous, then F (A, H) ⊆ ES (A). Proof. As it is S-ambiguous, H contains only connected components with at most t processes. It follows that the set F (A, H) is well-defined. Let r ∈ F (A, H). Let C1 , . . . , Cm be the connected components of H. We can then define H0 = H (when no process are faulty) and for any i with 1 ≤ i ≤ m, Hi = H \ Ci (when the set of faulty processes correspond to Ci ). If in run r, all processes are correct, we have r ∈ Et,H (A). Moreover, if there is a faulty process in run r, by definition of F (A, H), the set of faulty processes correspond to a connected component. Let Ci be this connected component. We then have r ∈ Et,Hi (A). We just showed that F (A, H) ⊆ 0≤i≤m Et,Hi (A). As H is S-ambiguous, for any 1 ≤ i ≤ m we have Hi ∈ S. Moreover, due to the definition of H0 and the lemma assumption, we also have H0 = H ∈ S. Finally, as ES (A) = X ∈S Et,X (A), we have F (A, H) ⊆ ES (A) which prove the lemma.

Lemma 2. Let S be an ambiguous (n, t)-synchrony property and H an S-ambiguous graph. Whatever the algorithm A, there is a run r ∈ F (A, H) that does not solve consensus. Proof. The proof is a reduction to the FLP impossibility result [7] (impossibility to solve consensus despite even only one faulty process in a system in which all links are asynchronous). To that end, let us assume by contradiction that there is an algorithm

222

O. Baldellon, A. Most´efaoui, and M. Raynal

A that solves consensus among n processes p1 , . . . , pn despite the fact that up to t of them may be faulty, when the underlying eventually synchronous communication graph belongs to S (for example an S-ambiguous graph H). This means that, by assumption, all runs r ∈ ES (A) satisfy the validity, agreement and termination properties that define the Byzantine consensus problem.

C1

C2 q1 q2

⇔ (. . .) Cm

Processes p1 , . . . , pn

...

qm

Processes q1 , . . . , qm

Fig. 2. A reduction to the FLP impossibility result

Let C1 , . . . , Cm be the connected components of H and q1 , . . . , qm a set of m processes (called simulators in the following). The proof consists in constructing a simulation in which the simulators q1 , . . . , qm solve consensus despite the fact they are connected by asynchronous links and one of them may be faulty, thereby contradicting the FLP result (Figure 2). To that end, each simulator qj , 1 ≤ j ≤ m, simulates the processes of the connected component Cj it is associated with. Moreover, without loss of generality, let us assume that, for every component Cj made up of correct processes, these processes propose the same value vj . Such a simulation2 of the processes p1 , . . . , pn (executing the Byzantine consensus algorithm A) by the simulators q1 , . . . , qm results in a run r ∈ F (A, H) (from the point of view of the processes p1 , . . . , pn ). As (by definition) the algorithm A is correct, the correct processes decide in run r. As (a) H ∈ S, (b) S is ambiguous, and (c) r ∈ F(A, H), it follows from Lemma 1 that r ∈ ES (A), which means that r is a run in which the correct processes pi decide the same value v (and, if they all have proposed the very same value w, we have v = w). It follows that, simulating the processes p1 , . . . , pn that execute the consensus algorithm A, the m asynchronous processes q1 , . . . , qm (qj proposing value vj ) solve consensus despite the fact that one of them is faulty (the one associated with the faulty component Cj , if any). Hence, the simulators q1 , . . . , qm solve consensus despite the fact that one of them may be faulty contradicting the FLP impossibility result, which concludes the proof of the theorem.

2

The simulation, which is only sketched, is a very classical one. A similar simulation is presented in [15], in the context of synchronous systems, that extends the impossibility to solve Byzantine consensus from a set of n = 3 synchronous processes where one (t = 1) is a Byzantine process to a set of n ≤ 3t processes. A similar simulation is also described in [16].

A Symmetric Synchrony Condition for Solving Byzantine Consensus

223

The following theorem is an immediate consequence of lemmas 1 and 2. Theorem 1. No ambiguous (n, t)-synchrony property allows Byzantine consensus to be solved in a system of n processes where up to t processes can be faulty. Remark 1. Let us observe that the proof of the previous theorem does not depend on the fact that messages are signed or not. Hence, the theorem is valid for systems with or without message authentication. Remark 2. The impossibility to solve consensus despite one faulty process in an asynchronous system [7] corresponds to the case where S is the (n, 1)-synchrony property that contains the edge-less graph.

5 Relating the Size of Connected Components and Ambiguity Assuming a system with message authentication, let S be an (n, t)-synchrony property that allows consensus to be solved despite up to t Byzantine processes. This means that consensus can be solved for any eventually synchronous communication graph in S. It follows from Theorem 1 that S is not ambiguous. This section shows that if an eventual synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component C whose size is greater than t (|C| > t). Theorem 2. Let S be an (n, t)-synchrony property. If there is a graph G ∈ S such that none of its connected components has more than t vertices, then S is ambiguous. Proof. Let G ∈ S such that no connected component of G has more than t vertices. It follows from the t-resilience property of S that there is a graph G included in G (i.e., both have the same vertices and the edges of G are also in G) that has at least t isolated vertices. Let us observe that G can be decomposed into m+ t connected components C1 , . . . , Cm , γ1 , . . . , γt where each Ci contains at most t vertices and each γi contains a single vertex (top of Figure 3). Let us construct a graph G as follows. G is made up of the m connected components C1 , . . . , Cm plus another connected component denoted Cm+1 including the t vertices γ1 , . . . , γt (bottom of Figure 3). Moreover, G contains all edges of G plus

G :

C1

C2

...

Cm

G :

C1

C2

...

Cm

γ1

Fig. 3. Construction of the graph G

...

Cm+1

γt

224

O. Baldellon, A. Most´efaoui, and M. Raynal

the new edges needed in order that the connected component Cm+1 be a clique (i.e., a graph whose any pair of distinct vertices is connected by an edge). As G ∈ S and G ⊆ G , it follows from the stability property of S that G ∈ S. The rest of the proof consists in showing that G is S-ambiguous (from which ambiguity of S follows). – Let us first observe that, due to its very construction, each connected component C of G contains at most t vertices. – Let us now show that for any connected component C of G , we have G \ C ∈ S. (Let us recall that G \ C is G from which all edges incident to vertices of C have been removed.) We consider two cases. • Case C = Cm+1 . We then have G \ C = G . The fact that G ∈ S concludes the proof of the case.

G :

C1

...

Ci

...

Cm

...

...

γ1 . . . γd

γd+1 . . . γt

δ1 . . . δ d Gi :

C1

...

...

...

Cm

Cm+1

Fig. 4. Using a permutation

• Case C = Ci for 1 ≤ i ≤ m. Let δ1 , . . . , δd be the vertices of Ci and let Gi = G \ C. According to the permutation stability property of S there is a permutation π of the vertices of Gi such that G ⊆ π(Gi ) (Figure 4). It then follows from the fact that S is a synchrony property that π(Gi ) ∈ S and consequently Gi ∈ S, which concludes the proof of the case and the proof of the theorem.

Taking the contrapositive of Theorem 1 and Theorem 2, we obtain the following corollary. Corollary 1. If an (n, t)-synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component whose size is at least t + 1.

6 A Necessary and Sufficient Condition This section introduces the notion of a virtual 3[x + 1]bi-source and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve

A Symmetric Synchrony Condition for Solving Byzantine Consensus

225

the consensus problem in a system with message signatures and where up to t processes can commit Byzantine failures. Definition 11. A 3[x + 1]bi-source is a correct process that has an eventually synchronous bi-directional link with x correct processes (not including itself). From a structural point of view, a 3[x + 1]bi-source is a star made up of correct processes. (As already noticed, this definition differs from the usual one, in the sense that it considers only correct processes.) Lemma 3. If a graph G has a connected component C of size x+1, a 3[x+1]bi-source can be built inside C. Proof. Given a graph G that represents the eventually synchronous bi-directional links connecting correct processes, let us assume that G has a connected component C such that |C| ≥ x + 1. A star (3[x + 1]bi-source) can be easily built as follows. When a process p receives a message for the first time, it forwards it to all. Let us remember that, as messages are signed, a faulty process cannot corrupt the content of the messages it forwards; it can only omit to forward them. Let λ be the diameter of C and δ the eventual synchrony bound for message transfer delays. This means that, when we consider any two processes p, q ∈ C, λ × δ is an eventual synchrony bound for any message communicated inside the component C. Moreover, considering any process p ∈ C, the processes of C define a star structure centered at p, and such that, for any q ∈ C \ {p}, there is a virtual eventually synchronous link (bound λ × δ) that is made up of eventually synchronous links and correct processes of C, which concludes the proof of the lemma.

The following definition gives a more general definition of a 3[x + 1]bi-source. Definition 12. A communication graph G has a virtual 3[x + 1]bi-source if has a connected component C of size x + 1. Theorem 3. An (n, t)-synchrony property S allows consensus to be solved in an asynchronous system with message authentication, despite up to t Byzantine processes, if and only if any graph of S contains a virtual 3[t + 1]bi-source. Proof. The proof of the sufficiency side follows from the algorithm described in [10] that presents and proves correct a consensus algorithm for asynchronous systems made up of n processes where (a) up to t processes may be Byzantine, (b) messages are signed, and (c) there is a 3[t + 1]bi-source (a 3[t + 1]bi-source in our terminology is a 3[2t]bi-source in the parlance of [1,10]). When considering the necessity side, let S be synchrony property such that none of its graphs contains a virtual 3[x + 1]bi-source. It follows from the contrapositive of Corollary 1 that S does not allow Byzantine consensus to be solved.

The following corollary is an immediate consequence of the previous theorem. Corollary 2. The existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus (with message authentication) in presence of up to t Byzantine processes.

226

O. Baldellon, A. Most´efaoui, and M. Raynal

7 Conclusion This paper has presented a synchrony condition that is necessary and sufficient for solving consensus despite Byzantine processes in systems equipped with message authentication. This synchrony condition is symmetric in the sense that some links have to be eventually timely in both directions. Last but not least, finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remain open (and very challenging) problems.

References 1. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Consensus with Byzantine Failures and Little System Synchrony. In: Int’l Conference on Dependable Systems and Networks (DSN 2006), pp. 147–155. IEEE Computer Press, Los Alamitos (2006) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn., 414 pages. Wiley-Interscience, Hoboken (2004) 3. Cachin, C., Kursawe, K., Shoup, V.: Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement using Cryptography. In: Proc. 19th ACM Symposium on Principles of Distributed Computing (PODC 2000), pp. 123–132 (2000) 4. Chandra, T., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996) 5. Delporte-Gallet, C., Devismes, S., Fauconnnier, H., Larrea, M.: Algorithms for Extracting Timeliness Graphs. In: Patt-Shamir, B., Ekim, T. (eds.) SIROCCO 2010. LNCS, vol. 6058, pp. 127–141. Springer, Heidelberg (2010) 6. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. Journal of the ACM 35(2), 288–323 (1988) 7. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985) 8. Friedman, R., Most´efaoui, A., Raynal, M.: Pmute -Based Consensus for Asynchronous Byzantine Systems. Parallel Processing Letters 15(1-2), 162–182 (2005) 9. Friedman, R., Most´efaoui, A., Raynal, M.: Simple and Efficient Oracle-Based Consensus Protocols for Asynchronous Byzantine Systems. IEEE Transactions on Dependable and Secure Computing 2(1), 46–56 (2005) 10. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus with Few Synchronous Links. In: Tovar, E., Tsigas, P., Fouchal, H. (eds.) OPODIS 2007. LNCS, vol. 4878, pp. 76– 89. Springer, Heidelberg (2007) 11. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus in Signature-free Systems. Submitted to publication 12. Lamport, L., Shostack, R., Pease, M.: The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3), 382–401 (1982) 13. Lynch, N.A.: Distributed Algorithms, 872 pages. Morgan Kaufmann Pub., San Francisco (1996) 14. Okun, M.: Byzantine Agreement. Springer Encyclopedia of Algorithms, pp. 116–119 (2008) 15. Pease, M., Shostak, R., Lamport, L.: Reaching Agreement in the Presence of Faults. JACM 27, 228–234 (1980) 16. Raynal, M.: Fault-tolerant Agreement in Synchronous Message-passing Systems. In: Morgan & Claypool, 167 pages (September 2010)

GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays Anwitaman Datta and Rajesh Sharma School of Computer Engineering, Nanyang Technological University, Singapore {Anwitaman,raje0014}@ntu.edu.sg

Abstract. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism (GoDisco) to disseminate information in online social community networks, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. Such a designed dissemination scheme using gossiping over a egocentric social network is unique and is arguably a concept whose time has arrived, emulating word of mouth behavior and can have interesting applications like probabilistic publish/subscribe, decentralized recommendation and contextual advertisement systems, to name a few. Simulation based experiments show that despite using only local knowledge and contacts, the system has good global coverage and behavior. Keywords: gossip algorithm, community networks, selective dissemination, social network, egocentric.

1

Introduction

Many modern internet-scale distributed systems are projections of real-world social relations, inheriting also the semantic and community contexts from real world. This is explicit in some scenarios like online social networks, while implicit for others such as the ones derived from interactions among individuals (such as email exchanges), traditional ﬁle-sharing peer-to-peer systems where people with similar interests self-organize into a peer-to-peer overlay which is semantically clustered according to the tastes of these people [6,13]. Recently, various eﬀorts to build peer-to-peer online social networks (P2P OSNs) [3] are also underway. Likewise, also in massively multiplayer online games (MMOGs), virtual communities are formed. In many of these social information systems, it is often necessary to have mechanisms to disseminate information (i) eﬀectively - reaching the relevant people who would be interested in the information while not bothering others who won’t be, doing so (ii) eﬃciently - avoiding duplication or latency, in a (iii) decentralized environment - scaling without global infrastructure, knowledge and coordination, and in a (iv) reliable manner - dealing with temporary failures of a subset of the population, and ensuring information quality. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 227–238, 2011. c Springer-Verlag Berlin Heidelberg 2011

228

A. Datta and R. Sharma

Consider a hypothetical scenario involving circulation of call for papers (CFP) to academic peers. One often posts such a CFP to relevant mailing lists. Such an approach is essentially analogous to a very simplistic publish/subscribe (pub/sub) system. If the scopes of the mailing lists are very restricted, then many such mailing lists would be necessary. However people who have not explicitly subscribed to such a mailing list will miss out on the information. Likewise, depending on the scope of topics discussed in the mailing list, there may be too many posts which are actually not relevant. CFP is also propagated by personal email among colleagues and collaborators one knows. The later approach is unstructured and does not require any speciﬁc infrastructure (unlike the mailing list) nor any explicit notion of subscription. Individuals make autonomous decisions based on their perception of what their friends’ interests may be. Likewise, individuals may build local trust metrics to decide which friends typically forward useful or useless information, providing a personal, subjective context to determine the quality of information. On the downside, such an unstructured, word of mouth (gossip/epidemic) approach does not guarantee full coverage, and may also generate many duplicates. Such redundancy can however also make the dissemination process robust against failures. Consequently, epidemic algorithms are used in many distributed systems and applications [9,2]. Despite the well recognized role of word of mouth approaches in information dissemination in a selective manner - in real life or over the internet, such as by emails or in online social networks, there has not been any algorithmic (designed) information dissemination mechanism leveraging on the community structures and semantics available in social information systems. This is arguably partly because of the novelty of the systems such as P2P OSNs, [3] where the mechanism can be naturally applied. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism to disseminate information in a distributed setting, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. We explore the trade-oﬀs of coverage, spam and message duplicates; and evaluate our mechanisms over synthetic and real social communities. These designed mechanisms can be useful not only for the distributed social information systems for which we develop them, but may also have wider impact in the longer run - such as in engineering word of mouth marketing strategies, as well as understanding naturally occurring dissemination processes which had inspired us the algorithm designs at the ﬁrst instance. Experiments on synthetic and real networks show that our proposed mechanism is eﬀective and moderately eﬃcient, though the performance deteriorates in sparse graphs. The rest of the paper is organized as follows. In section 2, we present related work. We present our approach including various concepts and algorithm in section 3. Section 4 evaluates the approach using synthetic and real network. We conclude with several future directions in section 5.

GoDisco: Selective Gossip Based Dissemination of Information

2

229

Related Works

Targeted dissemination of information can be carried out eﬃciently using a structured approach as is the case with publish/subscribe systems [10] that rely on infrastructure like overlay based application layer multicasting [12] or gossiping techniques [2] to spread the information within a well-deﬁned group. Well deﬁned groups are not always practical, and alternative approaches to propagate information in unstructured environments are appealing. Broadcasting information to everyone ensures that relevant people get it (high recall), but is undesirable, since it spams many others who are uninterested (poor precision). In mobile ad-hoc and delay tolerant networking scenarios, selective gossiping techniques relying on user proﬁle and context to determine whether to propagate the information have been used. This is analogous to how multiple diseases can spread in a viral manner - aﬀecting only “susceptible” subset of the population. Autonomous gossiping [8] and SocialCast [7] are few such approaches, which rely on the serendipity of interactions with like minded users, and depends on users’ physical mobility that leads to such interactions. There are also many works studying naturally occurring (not speciﬁcally designed) information spread in community networks - including on the blogosphere [15,11] as well as in online social network communities [5], and simulation based models to understand such dissemination mechanisms [1] and human behavior [19]. Our work, GoDisco lies at the crossroads of these works - we speciﬁcally design algorithms to eﬀectively (ideally high recall and precision, low latency) and eﬃciently (low spam and duplication) disseminate information in collaborative social network communities by using some simple and basic mechanisms motivated by naturally occurring human behaviors (similar to mouth of word) using local information. Our work belongs to the family of epidemic and gossip algorithms [18]. However unlike GoDisco where we carry out targeted and selective dissemination, traditional approaches [14,2,9,16] are designed to broadcast information within a well deﬁned group, and typically assumes a fully connected graph - so any node can communicate directly with any other node. In GoDisco, nodes can communicate directly only with other nodes with whom they have a social relation. While this constraint seems restrictive, it actually helps directing the propagation within a community of users with shared interest, as well as limiting the spread to uninterested parties. We speciﬁcally aim to use only locally available social links, rather than forming semantic clusters, because we want to leverage on the natural trust and disincentives to propagate bogus messages that exist among neighbors in an acquaintance/social graph. Similar to GoDisco, design of JetStream [16] is also motivated by social behavior - but of a diﬀerent kind, that of reciprocity. JetStream is designed to enhance the reliability and relative determinism of the information propagation using gossip, but it is not designed for selective dissemination of information to an interest sub-group. A recent work, GO [20], also performs selective dissemination, but in explicit and well deﬁned subgroups, and again assuming a fully connected underlying graph. The emphasis of GO is to piggyback multiple

230

A. Datta and R. Sharma

messages (if possible) in order to reduce the overall bandwidth consumption of the system. Some of the ideas from these works [16,20] may be applicable as optimizations in GoDisco.

3

Approach

We ﬁrst summarize various concepts which are integral part of proposed solution followed by detail description of GoDisco algorithm. 3.1

Concepts

System Model: Let G(N, E) be a graph, where N represents nodes (or users) of the network and E represents the set of edges between any two nodes ni and nj ∈ N . Edge eij ∈ E exist in the network, if node ni and nj know each other. We deﬁne the neighborhood of node ni as the set ℵi where nj ∈ ℵi iﬀ eij ∈ E. Community and Information Agent: Let I be the set of interests, representing a collection of all the diﬀerent interests of all the nodes in a network. We assume each node has at least one interest, and Inj represents the collection of all the diﬀerent interests of node nj . We consider a community to be a set of people with some shared interest. The social link between these people deﬁne the connectivity graph of the community. This graph may be connected, or it may comprise of several connected components. Such a graph, representing a speciﬁc community, is a subgraph G (N , E ) where N ⊆ N and E ⊆ E such that ∃x ∈ I; ∀nj ∈ N , x ∈ Inj . According to this deﬁnition, a node can (and real data sets show, often does) belong to diﬀerent communities at the same instance if it has varied interest. Since the subgraph G (N , E ) may not be connected, it may be necessary to rely on some nodes from outside the community to pass messages between these mutually isolated subsets. We however need to minimize messages to such nodes outside the community. To that end, we should try to ﬁnd potential forwarders who can help in forwarding the message to such isolated but relevant nodes. We identify suitable forwarders based on several factors: (i) interest similarity, (ii) node’s history as forwarder, (iii) degree of the node and (iv) activeness of the node. These forwarders, which we call information agents (IAs), help spreading a message faster, even if not personally interested in the message. History as Forwarder: This parameter calculates how good forwarder a node is. The rationale behind this approach is the principle of reciprocity from social sciences. Reciprocity is used also in JetStream [16], and we reuse the same approach. Activeness of a Node: An active node in the network can play a crucial role in quick dissemination of information, making it a good potential information agent. One way to measure user activeness is in terms of frequency of visits to an online social network, which can be inferred from any kind of activity the user does - e.g., post messages or update status, etc.

GoDisco: Selective Gossip Based Dissemination of Information

231

Duplication avoidance: We use social triads [17] concept to avoid message duplication. In a social network nodes can typically be aware of all the neighbors of all its neighboring node. This knowledge can be explored in reducing duplication. Interest classiﬁcation and node categorizations: Interests can typically be classiﬁed using a hierarchical taxonomy. For simplicity, as well as to avoid too sparse and poorly connected communities, we restrict it two levels and name them as (i) Main Category (MC) & (ii) Subcategory (SC). For example, web mining and data mining are closer to machine learning rather than communication networks. So data-mining, web mining and machine learning would belong within one main category, while networks related topics will belong to a diﬀerent main category. Within a main category, similarity of interest among diﬀerent categories can vary again by diﬀerent degree, for example web-mining and data mining are relatively similar as compared to machine learning. So at next level of categorization we put data mining and web mining under one sub category and machine learning under another subcategory. To summarize, two diﬀerent main categories are more distant as compared to two diﬀerent sub categories within same main category. Information Agent Classiﬁcation: Based on interest similarity, we categories diﬀerent levels of IAs with respect to tolerance for spam message. For a particular message if a message’s interest is falling under SC11 of M C1 , we classify levels of IA in the following manner: Level 1: Nodes having interest in diﬀerent subcategory under same main category (e.g., a node having interest in M C1 SC12 ). Ideally such nodes should not be considered as spammed nodes as they have, high similar interest and possibly they might also be interested in this kind of messages. Level 2: For nodes having good history as a forwarder, an irrelevant message can be considered as spam. However they have high tolerance for such messages - that is why they have good forwarding history. These nodes will typically be cooperating with others based on the principle of reciprocity. Level 3: Highly active nodes with no common interest can play an important role in quick dissemination, but their high level of activity does not imply that they would be tolerant to spam, and such information agents should be chosen with lowest priority. While selecting an IA from non-similar communities maximum nodes should be chosen from level 1 since they are more likely to be able to connect back to other interested users, then from level 2 and 3. 3.2

GoDisco

There are two logically-independent modules of GoDisco - the control phase runs in the background to determine system parameters and self-tune the system, while the dissemination phase carrying out the actual dissemination of the messages.

232

A. Datta and R. Sharma

1. Control Phase. Each node regularly transmits its interest and its degree information to its neighbors. Each node also monitors (infers) its neighbor’s activeness and forwarding behavior. The latter two parameters are updated after each dissemination phase. Neighboring nodes who spreads/forwards further to more number of neighbors are rewarded in future (based on reciprocity principle). Also nodes that are more active are considered better IA (potential forwarders) as compared to less active nodes. Every node maintains a tuple of for each of its neighbors, reﬂecting their history as forwarder, degree and activeness respectively. During the dissemination phase non-relevant nodes are ranked according to a weighted sum hα + dβ + aγ to determine potential information agents (IAs) where α , β , γ are parameters to set the priority of the three variables such that α + β + γ =1. 2. Dissemination Phase. We assume that the originator of the message provides necessary meta-information - a vector of interest categories that the message ﬁts (msgproﬁle), as well as dissemination parameters tuple (their use is explained below). The message payload is thus: <message, msgproﬁle, parameters, dampingﬂag>. dampingf lag is a ﬂag used to dampen the message propagation when no more relevant nodes are found. Time to live could also be used instead. Algorithm 1 illustrates the logic behind sending message to relevant nodes and collecting non-relevant nodes, ranking their suitability and using them as information agents (IAs) for disseminating a message based on multiple criterion. We have adopted an optimistic approach for forwarding a message to neighboring nodes. A node forwards a message to all its neighbors with at least some matching interest in the message. For example, a node with interests in data mining and peer-to-peer systems will get a message intended for people with interests in data mining or web-mining. Of the nodes who are not directly relevant, we determine nodes with common interest within the main categories from message and users’ proﬁles (using M C(.)), helping identify Level 1 IAs, and forward the message to these Level 1 IAs probabilistically with probability p1 which is a free parameter. For level 1 IAs, the message though possibly not directly relevant, is still somewhat relevant, and the “perception of spam” is expected to be weaker. If all the neighbors of a node are totally non-relevant, then the dissemination stops, since such a node is at the boundary of the community. However, existence of some isolated islands or even a large community with same interests is possible. To alleviate the same, some boundary nodes can probabilistically send random walkers. We implement such random walkers with preference to neighbors with higher score (hα + dβ + aγ) and limit the random walks with limited timeout (in the experiments we chose time-out equal to the network’s diameter). If a random-walker revisits a speciﬁc node, then it is forwarded to a diﬀerent neighbor than in the past, with a ceiling to the number of times such revisits are forwarded (in experiments, this was set to two).

4

Evaluation

We evaluate GoDisco based information dissemination on both real and synthetically generated social networks. User behavior vis-a-vis the degree of activeness

GoDisco: Selective Gossip Based Dissemination of Information

233

and history of forwarding were assigned uniformly at random from some arbitrary scale (and normalized to a value between 0-1). 1. Synthetic graphs. We used preferential attachment based Barabassi graph generator1 to generate a synthetic network of ∼20000 Nodes and a diameter of 7 to 8 (calculated using ANF [4]).

Algorithm 1. GoDisco: Actions of node ni (which has relevant proﬁle, i.e., msgprof ile Ini = ∅) with neighbors ℵi upon receiving message payload <message, msgproﬁle, parameters, dampingﬂag> from node nk 1: for ∀nj s.t. nj ∈ ℵi nj ∈ ℵk do 2: if msgprof ile Inj = ∅ then 3: DelvM sgT o(nj ); {Forward message to neighbors with some matching interest in the message} 4: else 5: if M C(msgprof ile) M C(Inj ) = ∅ then 6: With p1 probability DelvM sgT o(nj ); {Message is forwarded with probability p1 to Level 1 IAs.} 7: else 8: N onRelv ← nj ; {Append to a list of nodes with no (apparent) common interest} 9: end if 10: end if 11: end for 12: if |N onRelv| == |ℵi | then 13: if dampingf lag == T RU E then 14: Break; {Stop and do not execute any of the below steps. Note: In experimental evaluation, the “no damping” scenario corresponds to not having this IF statement/not bothering about the damping ﬂag at all.} 15: else 16: dampingf lag := T RU E; {Set the damping ﬂag} 17: Sort N onRelv in descending order of their score using hα+dβ +aγ; { obtained from control phase, from parameters set by message originator.} 18: IAN odes := Top X% of the sorted nodes; 19: for ∀nj ∈ IAN odes do 20: DelvM sgT o(nj ); {Carry out control phase tasks in the background} 21: end for 22: end if 23: else 24: dampingf lag := F ALSE {Reset the damping ﬂag} 25: end if

Random cliques: People often share some common interests with their immediate friends and form cliques, but do not form a contiguous community. For example, all soccer fans in a city do not necessarily form one community, instead smaller 1

http://www.cs.ucr.edu/~ ddreier/barabasi.html

234

A. Datta and R. Sharma

bunch of buddies pursue the interest together. To emulate such a behavior, we pick random nodes but relatively smaller number of neighbors (between 50-200) and assign these cliques some common interest. Associativity based interest assignment: People often form a contiguous subgraph within the population where all (most) members of the subgraph share some common interest, and form a community. This is particularly true in professional social networks. To emulate such a scenario, we randomly picked a node (center node) and applied a restricted breadth ﬁrst algorithm covering ∼ 1000 nodes and assigned the interest of these nodes to be similar to that of the center node, fading the associativity with distance from the center node. Totally random: Interest of the nodes were assigned uniformly at random. 2. Real network - DBLP network: We use the giant connected component of co-authorship graph from DBLP record of papers published in 1169 unique conferences between 2004 to 2008, which comprises of 284528 unique authors. We classiﬁed the category of the conferences (e.g., data mining, distributed systems, etc.) to determine the interest proﬁle of the authors. 4.1

Results

In this section we describe various metrics we observed in our experiments. The default parameter values were p1 = 0.5, X = 10%, α=0.50, β=0.30 and γ=0.20. Other choices provided similar qualitative results. We also expose results from limited exploration of parameters in a brief discussion later (see Figure 3). 1. Message dissemination: Figure 1 shows the spread of dissemination over time in the various networks. The plots compare three diﬀerent mechanism of propagation of information - (i) with damping (D), (ii) without damping (ND), and with the additional use of (iii) random walkers (RW) in the case with damping and plots the number of relevant nodes (R) and total nodes including non-relevant nodes (T) who receive the message. With the use of the damping mechanism, the number of non-relevant nodes dropped sharply but with only a small loss in coverage of relevant nodes. This shows the eﬀectiveness of damping mechanism approach to reduce spam. Using random walkers provide better coverage of relevant nodes, and only marginally more non-relevant nodes receive the message. This eﬀect is most pronounced in the case of the DBLP graph. Associativity based synthetic graph better resembles real networks. Gossiping based mechanism is also expected to work better in communities which are not heavily fragmented. So we will mostly conﬁne our results to the associativity based and DBLP graphs due to space constraints even though we used all the networks for all the experiments described next. If qualitatively diﬀerent results are observed for the other graphs, we will mention the same as and when necessary. 2. Recall: Recall is the ratio of the number of relevant nodes who get a message to the total number of relevant nodes in the graph. We compare the recall for damping (D) vs non-damping (ND) mechanisms shown in Figure 2(a) and 2(b)

GoDisco: Selective Gossip Based Dissemination of Information

(a) Rand. Cliq.

(b) Tot. Rand.

(c) Asst.

235

(d) DBLP

Fig. 1. Message dissemination (a) Random Cliques (Rand. Cliq.) (b) Total Random (Tot. Rand.) (c) Associativity (Asst.) (d) DBLP

(a) R Asst

(b) R DBLP

(c) P Asst

(d) P DBLP

Fig. 2. Recall (R) & Precision (P) for DBLP and Associativity (Asst)

for Associativity based and DBLP respectively. Use of damping mechanism leads to slight decrease in recall. In associativity based interest network, recall value reaches very close to 1. Even in random cliques, the recall value reached very close to one, but it was relatively slower than in the associativity based network, but in totally random assignment of individuals’ interests, recall value of only upto 0.9 could be achieved (non-damping), while random walkers provide relatively more improvement (recall of around 0.8) than the case of using damping but no random walkers (recall of roughly 0.7), demonstrating the limitations of a gossip based approach if the audience is highly fragmented as well as the eﬀectiveness of random walkers in reaching some of such isolated groups at a low overhead. In the case of DBLP network (Figure 2(b)) the recall value is reasonably good. Since the DBLP graph comprises of an order of magnitude more total nodes (around fourteen times more) than the synthetic graphs. Consequently, the absolute numbers observed in the DBLP graph can not directly be compared to the results observed in the synthetic graphs. The use of random walker provides signiﬁcant improvements in the dissemination process. 3. Precision: Precision is the ratio of number of relevant nodes who get a message to the total number of nodes (relevant plus irrelevant) who get the message. Figure 2(c) and 2(d) shows the precision value measured in the diﬀerent networks for associativity based and DBLP respectively. We notice that in the DBLP network, with real semantic information, the precision is in fact relatively better than what is achieved in the synthetic associativity based network. From

236

A. Datta and R. Sharma

(a) Msg. Diss.

(b) P & R

(c) R DBLP

(d) P DBLP

Fig. 3. α and γ comparison on the random clique based network, Eﬀect of Feedback on Recall (R) & Precision (P)

this we infer that in real networks, the associated semantic information in fact enables a more targeted dissemination. We also observe the usual trade-oﬀ with the achieved recall. 4. Parameter exploration: In associativity based network, because of a tightly coupled contiguous community, the quality of dissemination is not so sensitive to the parameter choice, while in scattered networks like random cliques it is. To evaluate the eﬀect of γ in the network, we perform experiments on the random cliques based network with γ=0.60, β=0.30 and α=0.10, and compare these with the scenarios with the default values of α=0.50, β=0.30 and γ=0.20. A greater value of γ puts a greater emphasis to highly active users as potential IAs, who can help improve the speed of dissemination, but at the cost of spamming more uninterested users. Results shown in Figure 3(a) and 3(b) conﬁrm this intuition. Figure 3(a) shows the total nodes (T) and relevant nodes (R) receiving a message. Figure 3(b) compares the recall and precision for the two choices of parameters. 5. Message duplication: Nodes may receive duplicates of the same message. We use proximity to reduce such duplicates. Figures 4(a) and 4(c) show for associativity based and DBLP networks respectively the number of duplicates avoided during the dissemination process for both nodes for whom the message is relevant (Relv) as well as for nodes for whom it is irrelevant (IrRelv). Interesting to note is that with damping, the absolute number of irrelevant nodes getting the message is already low, so the savings in duplication is also naturally low. The results show that the use of proximity is an eﬀective mechanism to signiﬁcantly reduce such unnecessary network traﬃc. 6. Volume of duplicates: Figure 4(b) and 4(d) measures the volume of duplicates received by individual nodes for associativity based and DBLP networks respectively during the dissemination process. The observed trade-oﬀs of using random walkers with damping, or the case of not using damping is intuitive. 7. Eﬀect of Feedback: Figure 3(c) shows the eﬀect of feedback on recall for DBLP network. In case of associativity based network community members are very tightly coupled with few exceptions, so there is not much visible improvement with feedback (not shown). However in case of DBLP, for non-damping

GoDisco: Selective Gossip Based Dissemination of Information

(a) DS Asst.

(b) DR Asst.

(c) DS DBLP

237

(d) DR DBLP

Fig. 4. Duplication saved using proximity (DS) & Duplicates received (DR) for DBLP and Associativity (Asst)

as well as random walk based schemes where non-relevant nodes are leveraged for disseminating the information, we observe signiﬁcant improvement in recall when comparing the spread of ﬁrst message in the system with respect to the thousandth message (identiﬁed with an ‘F’ in the legends to indicate the scenarios with feedback), as the feedback mechanism helps the individual nodes self-organize and choose better information agents over time, which accelerates the process of connecting fragmented communities. Interestingly, this improvement in recall does not compromise the precision, which in fact even improves slightly, further conﬁrming that the feedback steers the dissemination process to use more relevant information agents.

5

Conclusion and Future Work

We have described and evaluated a selective gossiping based information dissemination mechanism (namely GoDisco), which is constrained by communication among only nodes socially connected to each other, and leverage users’ interests and the fact that users with similar interests often form a community - in order to do directed dissemination of information. GoDisco is nevertheless a ﬁrst work of its kind, leveraging on interest communities for directed information dissemination using exclusively social links. We have a multidirectional plan for future extensions of this work. We plan to apply GoDisco in various kinds of application, including in information dissemination in peer-to-peer online social networks [3] - for example, for probabilistic publish/subscribe systems, or to contextually advertise products to sell to other users of a social network. We are also planning to improve existing schemes in various manner, like incorporating various security mechanisms and disincentives for antisocial behavior.

References 1. Apolloni, A., Channakeshava, K., Durbeck, L., Khan, M., Kuhlman, C., Lewis, B., Swarup, S.: A study of information diﬀusion over a realistic social network model. In: Int. Conf. on Computational Science and Engineering (2009)

238

A. Datta and R. Sharma

2. Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal multicast. ACM Trans. Comput. Syst. 17(2) (1999) 3. Buchegger, S., Schi¨ oberg, D., Vu, L.H., Datta, A.: PeerSoN: P2P social networking - early experiences and insights. In: Proc. of the 2nd ACM Workshop on Social Network Systems (2009) 4. Palmer, P.G.C., Faloutsos, C.: Anf: A fast and scalable tool for data mining in massive graphs. In: KDD (2002) 5. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the ﬂickr social network. In: WWW (2009) 6. Cholvi, V., Felber, P., Biersack, E.W.: Eﬃcient search in unstructured peer-to-peer networks. In: SPAA (2004) 7. Costa, P., Mascolo, C., Musolesi, M., Picco, G.P.: Socially-aware routing for publish-subscribe in delay-tolerant mobile ad hoc networks. IEEE Journal on Selected Areas in Communications 26(5) (June 2008) 8. Datta, A., Quarteroni, S., Aberer, K.: Autonomous gossiping: A self-organizing epidemic algorithm for selective information dissemination in wireless mobile adhoc networks. Semantics of a Networked World (2004) 9. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. In: PODC. ACM, New York (1987) 10. Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.-M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2) (2003) 11. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkinsi, A.: Information diﬀusion through blogspace. In: WWW (2004) 12. Hosseini, M., Ahmed, D.T., Shirmohammadi, S., Georganas, N.D.: A survey of application-layer multicast protocols 9(3) (2007) 13. Iamnitchi, A., Ripeanu, M., Santos-Neto, E., Foster, I.: The small world of ﬁle sharing. IEEE Transactions on Parallel and Distributed Systems 14. Karp, R., Schindelhauer, C., Shenker, S., Vocking, B.: Randomized rumor spreading. In: FOCS. IEEE Comput. Soci., Los Alamitos (2000) 15. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web 8(2) (2005) 16. Patel, J.A., Gupta, I., Contractor, N.: Jetstream: Achieving predictable gossip dissemination by leveraging social network principles. In: Proceedings of the Fifth IEEE Int. Symposium on Network Computing and Applications (2006) 17. Rapoport, A.: Spread of information through a population with sociostructural bias: I. assumption of transitivity. Bulletin of Mathematical Biophysics 15 (1953) 18. Shah, D.: Gossip algorithms. Found. Trends Netw. 3(1) (2009) 19. Song, X., Lin, C.-Y., Tseng, B.L., Sun, M.-T.: Modeling and predicting personal information dissemination behavior. In: KDD 2005: Proceedings of the Eleventh ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, ACM, New York (2005) 20. Vigfusson, Y., Birman, K., Huang, Q., Nataraj, D.P.: Go: Platform support for gossip applications. In: P2P (2009)

Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres Maitreya Natu1 , Vaishali Sadaphal1 , Sangameshwar Patil1 , and Ankit Mehrotra2 1

Tata Research Development and Design Centre, Pune, India 2 SAS R & D India Pvt Limited, Pune , India {maitreya.natu,vaishali.sadaphal,sangameshwar.patil}@tcs.com, [email protected]

Abstract. In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. We argue that identiﬁcation of such frequently occurring sub-graphs can provide many useful insights about the functioning of the system. We demonstrate how the existing frequent sub-graph discovery algorithms can be modiﬁed for the domain of communication graphs in order to provide computationally light-weight and accurate solutions. We present two algorithms for extracting frequent communication sub-graphs and present a detailed experimental evaluation to prove the correctness and eﬃciency of the proposed algorithms.

1

Introduction

With the increasing scale and complexity of today’s data-centers, it is becoming more and more diﬃcult to analyze the as-is state of the system. The data center operators observe the dire need for obtaining insights into the as-is state of the system such as the inter-component dependencies, heavily used resources, heavily used communication patterns, occurrence of changes, etc. Modern enterprises support two broad types of workloads and applications: transactional and batch. The communication patterns in these systems are dynamic and keep changing over time. For instance, in a batch system, the set of jobs executed on a day depends on several factors, including the day of the week, day of the month, addition of new reporting requirements, etc. The communication patterns for both transactional and batch systems can be observed as a temporal sequence of communication graphs. We argue that there is a need to extract and analyze frequently-occurring subgraph derived from a sequence of communication graphs to answer various analysis questions. In scenarios where the discovered frequent subgraph is large in size, the frequent subgraph provides a representative graph of the entire system. Such a representative graph becomes extremely useful in scenarios where the communication graphs change dynamically over time. The representative graph can M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 239–250, 2011. c Springer-Verlag Berlin Heidelberg 2011

240

M. Natu et al.

provide a good insight into the recurring communication pattern of the system. The discovered representative graph can be used in a variety of ways. For instance, (a) It can be used to predict the future communication patterns. (b) It can be used to perform what-if analysis. (c) Given a representative graph, various time-consuming analysis operations such as dependency discovery, building workload-resource utilization models, performing slack analysis, etc., can be done a-priori in an oﬀ-line manner to aid in quicker online analysis in future. In scenarios where the discovered frequent subgraph is small in size, such a subgraph can be used to zoom into the heavily used communication patterns. The components in this graph can be identiﬁed as critical components and further analyzed for appropriate resource provisioning, load balancing, etc. In this paper we present techniques to identify frequently occurring subgraph within a set of communication graphs. The existing graph-mining solutions [5,6,4] for frequent subgraph discovery address a more complex problem. Most of the existing techniques assume that the graph components can have non-unique identiﬁers. For instance, consider a problem of mining chemical compounds to ﬁnd recurrent substructures. Presence of non-unique identiﬁers results in an explosion in the number of possible combinations of sub-graphs. Further, operations such as graph isomorphism become highly computation intensive. The problem of ﬁnding frequent subgraph in a set of communication graphs is a simpler problem as each graph component can be assigned a unique identiﬁer. In this paper, we argue that simpler graph-mining solutions can be developed for this problem. We present two techniques for discovering frequent subgraph. (1) We ﬁrst present a bottom-up approach where we incrementally take combinations of components and compute their support. The key idea behind the proposed modiﬁcation is that the support of subgraphs at the next level can be estimated based on the support of the subgraphs at the lower levels using probability theory. (2) We then present a top-down approach where, instead of analyzing components in a graph, we consider the entire graph as a whole and analyze the set of graphs. We use simple matrix operations to mine frequent subgraph. The key idea behind this algorithm is that the property of unique component identiﬁers can be used to assign a common structure to all graphs. This common structure can be used to process entire graph as a whole in order to identify frequently occurring subgraphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns. (2) We present a modiﬁcation to the existing bottom-up Apriori algorithm to improve eﬃciency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs. (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the eﬀective application areas of the proposed algorithms.

Mining Frequent Subgraphs to Extract Communication Patterns

2

241

Related Work

Graph mining techniques have been applied to varied domains viz. for the purpose of extracting common patterns in chemical compounds, genetic formulae, extracting common structures in Internet, social networks, [3,2] etc. In this paper, we consider extracting commonality in communication patterns given that the communication patterns are changing dynamically. Our work is diﬀerent in that we take advantage of presence of unique node identiﬁers and propose new algorithms to identify frequent subgraph. Frequent subgraph mining techniques proposed in the past [5], [6], [1] use a bottom-up approach. In this paper, we propose a modiﬁcation to the traditional bottom-up approach and present a novel top-down approach for frequent subgraph discovery. Most of the past research assumes non-unique identiﬁers of components and is directed towards solving the issues related to explosion of candidate space due to presence of multiple vertices and edges with same identiﬁer.When applied to the domain of communication graphs, the problem of mining frequent subgraph translates to a much simpler problem due to the fact that no two nodes or links in a communication network have same identiﬁer. [7] propose algorithms for testing graph isomorphism, computing largest common subgraph in the trees or graphs with unique node labels. In this paper, we address the problem of mining frequent graphs in a set of graphs with unique node and link identiﬁers.

3

Problem Description

In this section, we ﬁrst deﬁne various terms used in this paper. We then systematically map the addressed problem to the problem of frequent subgraph discovery. We then present the problem deﬁnition and introduce the proposed solution approaches. 3.1

Terms Used

We ﬁrst deﬁne various terms used in this paper. Size(G): The size of a graph G is deﬁned as the number of edges present in the graph. Subgraph(G): A graph G (V , E ) is a subgraph of a graph G(V, E), if and only if V ⊆ V and E ⊆ E. Support(G , G): Given a set of n graphs G = {G1 , G2 , . . . , Gn }, a subgraph G has a support ns if G is a subgraph of s graphs out of the set {G1 , G2 , . . . , Gn }. Frequent Subgraphs(min support, G): Given a set of graphs G = {G1 , G2 , . . . , Gn } and the required minimum support of min support, a subgraph G is a frequent subgraph if and only if Support(G , G) ≥ min support. Note that the resulting frequent subgraph can be a disconnected graph.

242

3.2

M. Natu et al.

Problem Deﬁnition

We now map the problem of identifying communication patterns to the problem of frequent subgraph discovery. Consider a network topology T represented by a graph GT (VT , ET ) where the vertices VT represent network nodes and edges ET represent network links. Each vertex (and each edge) in the graph G(V, E) can be identiﬁed using a unique identiﬁer. The problem of identifying communication patterns in the network from a set of communication traces can be mapped to the problem of frequent subgraph discovery as follows. 1. A communication trace C consists of the links (and nodes) being used by the system over time. 2. The communication trace Ct , at time t, can be represented as a graph Gt (Vt , Et ) where the set Vt and Et represent the network nodes links being used in time-window t. (Note that, Vt ⊆ VT and Et ⊆ ET .) 3. A set of communication traces C1 , C2 , . . . , Cn can be represented by a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). 4. The problem of identifying frequent communication patterns in a trace C can be mapped to the problem of identifying frequent subgraph in a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). The problem can then be deﬁned as follows: Given a set of graphs G1 , G2 , . . . , Gn and required support of min sup, compute the set F of all frequently occurring subgraphs F1 , F2 , . . . , Fm .

Fig. 1. (a) Example graphs, (b) Frequent subgraphs with minimum support of

2 3

Fig. 2. (a) Execution of Apriori algorithm, (b) Execution of Approx Apriori algorithm

Mining Frequent Subgraphs to Extract Communication Patterns

4

243

Proposed Bottom-Up Approach: Algorithm Approx-Apriori

In the past, the problem of identifying frequent subgraph has been solved using bottom-up approaches. Apriori [4] is a representative algorithm of this category. The traditional Apriori algorithm broadly involves two steps: Step 1: Generate subgraphs of size (k + 1) by identifying pairs of frequent subgraphs of size k that can be combined. Step 2: Compute the support of subgraphs of size (k + 1) to identify frequent subgraphs. We use a running example to explain the Apriori algorithm and other algorithms presented in the next section. Running example: We present an example consisting of a set of three graphs shown in Figure 1a. We consider min support = 23 . We show the execution of the traditional Apriori algorithm on this example. Figure 2a shows the subgraphs of size 1, 2, 3, and 4 generated by the iterations of Apriori algorithm. Consider the subgraphs of size 1. The subgraph with a single edge (c − e) has a support less than 23 and is discarded from further analysis. The remaining subgraphs are used to build subgraphs of size 2. Similar process is continued in subsequent iterations to identify subgraphs of size 3 and 4. This approach results in two kinds of overheads, (1) joining two size-k frequent subgraphs to generate one size-(k + 1) subgraph, (2) counting frequency of these subgraphs. Step (1) involves making all possible combinations of subgraphs of level-k resulting into large number of candidates making the approach computation intensive. The problem becomes more acute when the minimum support required for a subgraph to be identiﬁed as a frequent subgraph is smaller. In this paper, we present a modiﬁcation of the ﬁrst step of the Apriori algorithm by pruning of the search space. We modify the ﬁrst step of the Apriori algorithm by intelligently selecting the size-k subgraphs. In this step, we propose to estimate the support of the generated subgraphs of size-k using the support of the constituent size (k − 1) subgraphs. We use support of subgraph of size (k − 1) as the probability of its occurrence in the given set of graphs. Thus, given the probability of occurrence of two subgraphs of size (k − 1), their product is used as the probability of occurrence of the subgraph of size k. This is used as an estimate of the support of subgraphs of size-k. We prune a size k subgraph if the estimated support is less than desired support, pruning min support. Note that, the above computation assumes that the occurrence of two subgraphs of size k is independent. Thus the estimated support tends to move away from the actual support in situations where the independence assumption does not hold. Furthermore, the error propagates in subsequent iterations as the graphs grow in size which may result in larger inaccuracy. We hence propose to relax the pruning min support with every iteration. The pruning min support is equal to min support in the ﬁrst iteration. We decrease the pruning min support by a constant REDUCTION FACTOR in every iteration. The pruning thus performed narrows down the search space and decreases the execution time. In Section 7, we show through experimental evaluation that

244

M. Natu et al.

with appropriate use of REDUCTION FACTOR, the Approx-Apriori algorithm gives reasonably accurate results. We next present various steps involved in the proposed approach. Input: 1. G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs 2. min support: Minimum support required to declare a subgraph as frequent Output: Set of frequently occurring subgraphs. Initialization: 1. 2. 3. 4.

Generate a set of graphs GS1 of size 1 for each edge in E, E = E1 ∪ . . . ∪ En . S1 Remove Graph GS1 if Support(GS1 i from G i , G) < min support. Set k = 2, where k = size of subgraphs. k = k + 1 with every iteration. pruning min support = min support.

Step 1 - Generate subgraphs of size k by identifying pairs of frequent subgraphs of size k − 1 that can be combined: Two subgraphs GSk and i Sk+1 GSk of size k + 1 if and only j of size k are combined to create a subgraph Gij Sk Sk if k − 1 edges are common in GSk and GSk i j . In other words, |Ei ∩ Ej | = 1. – Estimate the support of the subgraph GSk+1 : ij Estimated Support(GSk+1 ) = Prob(GSk+1 ) ij ij

–

Sk where Prob(GSk+1 ) = Prob(GSk i ) * Prob(Gj ). ij Sk+1 Prune the subgraph Gij if Estimated Support(GSk+1 ) < pruning min support. ij

– Decrease the pruning min support for the next iteration:

pruning min support = pruning min support - REDUCTION FACTOR.

Step 2 - Compute the support of subgraphs of size k to identify frequent subgraphs: Repeat Step 1 and Step 2 until subgraphs of all sizes are explored. Running example: For the the running example of Figure 1a Figure 2b shows the subgraphs and estimated support of size 1, 2, 3, and 4 generated by the iterations of Approx-Apriori algorithm. Unlike Apriori, the Approx-Apriori performs an intelligent selection of pairs using estimated support. If the estimated support of a size k subgraph is less than 23 then the constituent pair of size k − 1 subgraphs are never combined. For example, consider the pair of size 3 subgraphs: (a − b)(b − c)(c − d) and (a − b)(b − c)(b − e) both having a support of 23 . The estimated support of the resulting size-4 subgraph is 49 which is less than 23 . Hence the size-4 subgraph (a − b)(b − c)(c − d)(b − e) is not constructed for analysis.

5

Proposed Top-Down Approach: Matrix-ANDing

While building a top-down approach, we use the entire graph as a whole and use the following two properties of the communication graphs: (1) Each node in a network can be identiﬁed with a unique identiﬁer. (2) The network topology is known apriori. These properties can be exploited to assign a common structure

Mining Frequent Subgraphs to Extract Communication Patterns

245

Fig. 3. (a)Matrix representation of Graphs X, Y, and Z, Mx , My , Mz . (b) Consolidated Matrix, Mc .

to all graphs. This common structure can be used to process entire graphs as a whole in order to identify frequently occurring subgraphs. In the following algorithm, we ﬁrst explain this structure and show all graphs can be represented in a similar manner. We then present a technique to process these structures to extract frequent subgraphs. Input: (1) G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs. (2) min support: Minimum support required to declare a subgraph as frequent. Output: Set of frequently occurring subgraphs. Initialization: (1) We ﬁrst identify the maximal set of vertices VM as discussed above and order these vertices lexicographically. (2) We then represent each graph Gt ∈ GT as a |VT | × |VT | matrix Mt using the standard binary adjacency matrix representation. Note that the nodes in all matrices are in same order. Thus, a cell Mt [i, j] represents the same edge Eij in all the matrices. Processing: 1. Assign a unique prime number pt as an identiﬁer of a graph Gt ∈ G and multiply all the values in its representative matrix Mt with pt . 2. Given the matrices M1 , . . . , Mn , compute a consolidated matrix Mc such that ∀i∈VT ,j∈VT Mc [i, j] = M1 [i, j] ∗ . . . ∗ Mn [i, j] if Mn [i, j] = 0. 3. a set of n graphs G = {G1 , . . . , Gn } and min support, compute Given n combinations of the graphs. Compute an identiﬁer for each comn∗min support bination as a product of the identiﬁers of its constituent graphs. Thus, the identiﬁer for a combination of graphs G1 , . . . , Gk is computed as p1 ∗ . . . ∗ pk . The set Gcomb consists of identiﬁer for each of the n∗min nsupport combinations. – Given an identiﬁer Gcomb [i] for a combination of graphs, we can identify if an edge represented by Mc [i, j] is present in all the graphs represented by Gcomb [i]. By the property of prime numbers, this can be done by simply checking if Mc [i, j] is divisible by Gcomb [i]. 4. For each identiﬁer Gcomb [i] identify the cells Mc [i, j] in the consolidated matrix Mc that are divisible by Gcomb [i]. The edges identiﬁed by these cells represent a frequently occurring subgraph with a support greater than or equal to the min support.

246

M. Natu et al.

Note that each element in a consolidated matrix has a product of prime numbers. Even if there are small number of graphs in a database that have the same edge, the element corresponding to the edge in the consolidated matrix would have a very large value resulting in overﬂow. We propose to use compression techniques to avoid such scenarios. Bit vectors can also be used in such cases. Running example: We assign prime numbers 3, 5, and 7 to the graphs X, Y , and Z from Figure 1a. As the maximal set of nodes in these graphs is {a, b, c, d, e, f, g} we represent all graphs in the form of a 7 × 7 matrix. Figure 3a shows matrix representation Mx , My , and Mz of graphs X, Y , and Z where matrix of each graph has been multiplied by its identifier prime number. The consolidated matrix built from these matrices is shown in Figure 3b. From the given set of 3 graphs, X, Y, and Z, in order to identify frequent subgraphs with min support = 23 , we compute 32 = 3 combinations, viz., (X, Y ), (X, Z), (Y, Z). The identifiers for the combinations (X, Y ), (X, Z), and (Y, Z) are computed as 3 ∗ 5 = 15, 3 ∗ 7 = 21, and 5 ∗ 7 = 35, respectively. With identifier of X,Y equal to 15, the cells in Mc that are divisible by 15 indicate the edges that are present in both graphs X and Y. This in turn also provides a frequent subgraph with a minimum support of 2. Similarly, other frequent subgraphs can be identified by checking for cells with divisibility by 21, and 35.

Fig. 4. (a,b) Sample snapshots of the real-life mainframe batch processing system. (c) Identiﬁed frequent subgraph.

6

Application of Proposed Technique on a Real-Life Example

In this section, we apply the proposed technique on a real-life mainframe batch processing system. This system is used at a leading ﬁnancial service provider for a variety of end-of-day trade result processing. We have 6 months data about the per-day graph of dependence among these jobs. Over the period of 6 months, we observed a total of 516 diﬀerent jobs and 72702 diﬀerent paths. Analyzing the per-day job precedence graphs brings out the fact that the jobs and their dependencies change over time. Figures 4(a,b) show two sample

Mining Frequent Subgraphs to Extract Communication Patterns

247

snapshots of this changing graph. Figure 4(c) shows one of the frequent subgraphs detected by our algorithm (min support = 0.7). On an average, there are about 156 processes and 228 dependence links per graph, whereas the frequently occurring subgraph in the set of these graphs consists of 98 processes and 121 dependence links.

Fig. 5. Execution time for experiments with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support

The frequent subgraphs discovered on the system are found to be large in size. It covers more than 65% of the graph at any time instance. This insight is typically true for most back-oﬃce batch processing systems, where a large portion of communication takes place on a regular basis and few things change seasonally. This communication graph can then be used as a representative graph and various analysis can be performed a-priori on this representative graph in an oﬀ-line manner. This oﬀ-line analysis on the representative graph can then be used to quickly answer various analysis questions on-line about the entire system.

7

Experimental Evaluation

In this section, we present the experiment design for systematic evaluation of the algorithms proposed in this paper. We simulate systems with temporally changing communication patterns and execute the proposed algorithms to identify frequently occurring communication patterns. We generate various network topologies based on the desired number of nodes and average node degree and model each topology as a graph. For each topology, we generate a set of graphs that represents temporally changing communication patterns. We model the level of change in system-activity, c, by controlling the amount of change in links across the graphs. The default values of various experiment parameters are set as follows: Number of nodes = 10; Average node degree = 5; Change in level of activity = 0.5; Number of graphs = 10; min support = 0.3; REDUCTION FACTOR = 0; Each point plotted on the graphs is an average of the results of 10 runs.

248

M. Natu et al.

Fig. 6. False negatives for experiments with changing (a) number of nodes, (b) average node degree, (c) min support

7.1

Comparative Study of Diﬀerent Algorithms to Observe the Eﬀect of Various System Properties

We performed experiments to observe the eﬀect of four system properties viz. number of nodes, average node degree, level of system activity, and desired min support. Figure 5 presents eﬀect on time-taken by the three algorithms. Figure 6 presents the eﬀect on accuracy of the Approx. Apriori algorithm. Number of nodes: We evaluate the algorithm on graphs with number of nodes = 10 to 18 in Figure 5a and Figure 6a. It can be seen that execution time of Apriori algorithms is more sensitive to the number of nodes than the Matrix ANDing algorithm. The execution time of Apriori algorithm is signiﬁcantly larger than that of the Approx. Apriori algorithm. The false negatives of the Approx. Apriori algorithm increase as the size of graph increases. This is because with the increase in the size of the graph the number of candidates increases and the possibility of missing out a frequent subgraph increases. Average node degree: We evaluate the algorithms on the graphs with node degree 3 to 7 in Figure 5b and Figure 6b. The execution time of Matrix-ANDing algorithm is independent of the average node degree of the network. This is because its execution time is mainly dependent on the matrix size (number of nodes) and depends very less on the node degree or the number of links. The execution time of Apriori algorithms increases as the node degree increases because of the increased number of candidate subgraphs. Larger node degree results in a larger candidate space. As a result, increase in node degree results in increase in false negatives. Change in the level of activity in the system: Figure 5c shows the eﬀect of amount of change in the level of activity in the system, c, on the execution time of the algorithm. There is no signiﬁcant eﬀect of amount of change in the level of activity in the system on the false negatives. The execution time of MatrixANDing algorithm is independent of the level of activity in the system. The execution time of the Apriori algorithms decreases with increase in the level of activity. As c increases, the number of frequent subgraphs decreases, resulting in the decrease in execution time of the Apriori algorithms.

Mining Frequent Subgraphs to Extract Communication Patterns

249

Desired min support : Figure 5d shows the eﬀect of min support on the execution time of the algorithms. Figure 6c shows eﬀect of min support on false negatives. The execution time of the Apriori algorithms decreases with increase in the min support value. Small min support results in a large number of graphs resulting in large execution time. The Matrix ANDing algorithm behaves in an interesting manner. The Matrix-ANDing algorithm processes n∗min nsupport combinations of graphs. Note that this number is maximum when min support = 0.5. As a result, Matrix-ANDing takes maximum time to execute when min support = 0.5. The false negatives decrease with increasing min support since the number of frequent graphs decreases. This experiment brings out the strength and weaknesses of the three algorithms. (1) It is interesting to note that the execution time of Apriori and Approx-Apriori algorithm is controlled by graph properties such as number of nodes, average node degree, and the level of activity. However, behaviour of Matrix-ANDing is mainly dependent upon application parameters such as min support. (2) Note that both Apriori and Matrix-ANDing provide optimal solutions. For large values of support Apriori algorithm requires smaller execution time and for small values of support Matrix-ANDing requires smaller execution time. (3) In cases where some false negatives can be tolerated, ApproxApriori algorithm performs fastest with near-optimal solutions for larger values of support. 7.2

Properties of Identiﬁed Frequent Subgraphs

We study the eﬀect of diﬀerent parameters on the properties of the identiﬁed frequent subgraphs viz. number and the size of subgraphs. The number of frequent subgraphs and size of frequent subgraphs identiﬁed in the given set of graphs provides insight into the dynamic nature of the system and size of critical components in the graph.

Fig. 7. Number of frequent subgraphs identiﬁed in the network with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support

Figure 7a and Figure 7b show the eﬀect of increase in number of nodes and node degree in the graph, respectively. The number of nodes or the node degree in the graph increases the number and size of frequent subgraphs increase.

250

M. Natu et al.

Figure 7c shows the eﬀect of minimum support, min support on the number and the size of the identiﬁed frequent subgraphs. The number and the size of frequent subgraph decreases as the minimum support increases. This is because, larger is the value of the minimum support, more stringent are the requirements for a graph to be declared as a frequent subgraph. Figure 7d shows the eﬀect of amount of change in the level of activity in the system on the number and size of the frequent subgraphs. With increase in the level of activity of the system the number and size of frequent subgraphs decrease.

8

Conclusion

In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns, (2) We present a modiﬁcation to the existing bottom-up Apriori algorithm to improve eﬃciency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs, (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the eﬀective application areas of the proposed algorithms.

References 1. Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zueﬂe, A.: Probabilistic frequent itemset mining in uncertain databases. In: KDD (2009) 2. Chen, C., Yanz, X., Zhuy, F., Hany, J.: gapprox: Mining frequent approximate patterns from a massive network. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 445–450. Springer, Heidelberg (2007) 3. Christos Faloutsos and Jimeng Sun. Incremental pattern discovery on streams, graphs and tensors. Technical report at CMU (2007) 4. Mannila, H., Toivonen, H., Verkamo, A.: Eﬃcient algorithms for discovering association rules. In: AAAI Workshop on KDD (1994) 5. Inokuchi, A., Washio, T., Motoda, H.: Complete mining of frequent patterns from graphs. In: Mining Graph Data, Machine Learning (2003) 6. Kuramochi, M., Karypis, G.: An eﬃcient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering (2004) 7. Valiente, G.: Eﬃcient Algorithms on Trees and Graphs with Unique Node Labels. In: Studies in Computational Intelligence, vol. 52, pp. 137–149. Springer, Heidelberg (2007)

On the Hardness of Topology Inference H.B. Acharya1 and M.G. Gouda2 1 2

The University of Texas at Austin, USA [email protected] The National Science Foundation, USA [email protected]

Abstract. Many systems require information about the topology of networks on the Internet, for purposes like management, eﬃciency, testing of new protocols and so on. However, ISPs usually do not share the actual topology maps with outsiders; thus, in order to obtain the topology of a network on the Internet, a system must reconstruct it from publicly observable data. The standard method employs traceroute to obtain paths between nodes; next, a topology is generated such that the observed paths occur in the graph. However, traceroute has the problem that some routers refuse to reveal their addresses, and appear as anonymous nodes in traces. Previous research on the problem of topology inference with anonymous nodes has demonstrated that it is at best NP-complete. In this paper, we improve upon this result. In our previous research, we showed that in the special case where nodes may be anonymous in some traces but not in all traces (so all node identiﬁers are known), there exist trace sets that are generable from multiple topologies. This paper extends our theory of network tracing to the general case (with strictly anonymous nodes), and shows that the problem of computing the network that generated a trace set, given the trace set, has no general solution. The weak version of the problem, which allows an algorithm to output a “small” set of networks- any one of which is the correct one- is also not solvable. Any algorithm guaranteed to output the correct topology outputs at least an exponential number of networks. Our results are surprisingly robust: they hold even when the network is known to have exactly two anonymous nodes, and every node as well as every edge in the network is guaranteed to occur in some trace. On the basis of this result, we suggest that exact reconstruction of network topology requires more powerful tools than traceroute.

1

Introduction

Knowledge of the topology of a network is important for many design decisions. For example, the architecture of an overlay network - how it allocates addresses etc.- may be signiﬁcantly optimized by knowledge of the distribution and connectivity of the nodes on the underlay network that actually carries the traﬃc. Several important systems, such as P4P [9] and RMTP [7], utilize information about the topology of the underlay network for optimization as well as management. Furthermore, knowledge of network topology is useful in research; for M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 251–262, 2011. c Springer-Verlag Berlin Heidelberg 2011

252

H.B. Acharya and M.G. Gouda

example, in evaluating the performance of new protocols. Unfortunately, ISPs do not make maps of the true network topology publicly available. Consequently, a considerable amount of research eﬀort has been devoted to the development of systems that reconstruct the topology of networks in the Internet from publicly available data - [10], [6], and [4]. The usual mechanism for generating the topology of a network is by the use of Traceroute [3]. Traceroute is executed on a node, called the source, by specifying the address of a destination node. This execution produces a sequence of identiﬁers, called a trace, corresponding to the route taken by packets traveling from the source to the destination. A trace set T is generated by repeatedly executing Traceroute over a network N , varying the terminal nodes, i.e. the source and destination. If T contains traces that identify every instance when an edge is incident on a node, it is possible to reconstruct the network exactly. However, practical trace sets do not have this property. The most common problems are incomplete coverage, anonymity (where a node can be detected, but will not state its unique identiﬁer, i.e. its address), and aliasing (nodes may have multiple unique identiﬁers). The situation is further complicated by load balancing, which may cause incorrect traces; tools such as Paris Traceroute [8] attempt to correct this problem. In this paper, we deal with the problem of inferring the correct network topology in the presence of anonymous nodes. The problem posed by anonymous nodes in a trace is that a given anonymous node may or may not be identical to any other anonymous node. Clearly, a topology in which these nodes are distinct is not identical to one in which they are merged into a single node. Thus, there may be multiple topologies for the computed network. Note that all these candidate topologies can generate the observed trace set; no algorithm can tell, given the trace set as input, which of these topologies is correct. To solve this problem, Yao et al. [10] have suggested computing the minimal topology - the topology of the network with the smallest number of anonymous nodes (subject to some constraints - trace preservation and distance preservation) from which the given trace set is generable. They conclude that the problem of computing a minimal network topology from a given set of traces is NP-complete. Accordingly, most later research in the area, such as [6] and [4], has focused on heuristics for the problem. We attack this problem from a diﬀerent direction. In our earlier papers [1] and [2], we introduced a theory of network tracing, i.e. reconstruction of network topology from trace sets. In these papers, we made the problem theoretically tractable by assuming that no node is strictly anonymous. In this theory, a node can be irregular, meaning it is anonymous in some traces, but there must exist at least one trace in which it is not anonymous. This simplifying assumption clearly does not hold in practice; in fact, an anonymous node is almost always consistently anonymous, not irregular. (In practical cases, anonymous nodes correspond to routers that do not respond to ping; irregular nodes are routers that drop ping due to excessive load. Clearly, the usual case is for nodes to be

On the Hardness of Topology Inference

253

consistently anonymous, rather than irregular.) However, it enabled us to develop a theory for the case when the number of nodes in the network is clearly known (equal to the number of unique identiﬁers). In this paper, we develop our theory of network tracing for networks with strictly anonymous nodes. Our initial assumption was that, as irregular nodes are “partially” anonymous, the hardness results in [1] should hold for anonymous nodes. To our surprise, this turned out not to be true; in Theorem 3, we show that networks with one anonymous node are completely speciﬁed by their trace sets, while networks with one irregular node are not [1]. Consequently, we constructed a complete new proof for network tracing in the presence of strict anonymity, presented in Section 3. We show that even under the assumption that the minimal topology is correct, the network tracing problem with anonymous nodes is in fact much harder than NP-complete; it is not just intractable, but unsolvable. Even if we weaken the problem and allow an algorithm to return a “small” number of topologies (one of which is correct) the problem remains unsolvable: an algorithm guaranteed to return the correct topology returns a number of topologies which is at least exponential in the total number of nodes (anonymous and non-anonymous). A very surprising fact is that this result even holds if the number of anonymous nodes is restricted to two. We demonstrate how to construct a trace set which is generable from an exponential number of networks with two anonymous nodes, but not generable from any network with one anonymous node or less. (It is interesting to note that our results are derived under a network model with multiple strong assumptions - stable and symmetric routing, no aliasing, and complete coverage. The reason we choose such friendly conditions for our model is to demonstrate that the problem cannot be made easier using advanced network tracing techniques, such as Paris Traceroute to detect artifact paths, and inference of missing links [5]. We would like to thank Dr Stefan Schmid for this observation.) We would like to clarify our claim that the problem of identifying the network from which a trace set was generated, given only the trace set, is unsolvable. Our proof does not involve a reduction to a known uncomputable problem, such as the halting problem. Instead, we demonstrate that there are many minimal networks - an exponential number of them - that could have generated a given trace set; so, given only the trace set, it is impossible to state with certainty that one particular topology (or even one member of a small set of topologies) represents the network from which the trace set was in fact generated. The earlier proof of NP-completeness (by a reduction to graph coloring) provided by Yao et al. holds for constructing a minimal topology, not the minimal topology from which the trace set was generated. It is NP-complete to ﬁnd a single member of the exponential-sized solution set. Thus, even under the assumption that the true network is minimal in the number of anonymous nodes, trying to reconstruct it is much harder than previously thought. In the next section, we formally deﬁne terms such as network, trace and trace set, so as to be able to develop our mathematical treatment of the problem.

254

2

H.B. Acharya and M.G. Gouda

Minimal Network Tracing

In this section, we present formal deﬁnitions of the terms used in the paper. We also explain our network model and the reasoning underlying our assumptions. Finally, we provide a formal statement of the problem studied. 2.1

Term Definitions

A network N is a connected graph where nodes have unique identiﬁers. However, a node may or may not be labeled with its unique identiﬁer. If a node is labeled with its unique identiﬁer, it is non-anonymous; otherwise, it is anonymous. Further, non-anonymous nodes are either terminal or non-terminal. (These terms are used below.) A trace is a sequence of node identiﬁers. A trace t is said to be generable from a network N iﬀ the following four conditions are satisﬁed: 1. t represents a simple path in N . 2. The ﬁrst and last identiﬁers in t are the unique identiﬁers of terminal nodes in N . 3. If a non-anonymous node “a” in N appears in t, then it appears as “a”. 4. If an anonymous node “∗” in N appears in t, then it appears as “∗i ”. i is a unique integer in t, to distinguish anonymous nodes from each other. A trace set T is generable from a network N iﬀ the following conditions are satisﬁed: 1. Every trace in T is generable from N . 2. For every pair of terminal nodes x, y in N , T has at least one trace between x and y. 3. Every edge in N occurs in at least one trace in T . 4. Every node in N occurs in at least one trace in T . 5. T is consistent: for every two distinct nodes x and y, exactly the same nodes must occur between x and y in every trace in T where both x and y occur. We now discuss the reason why we assume the above conditions. The ﬁrst condition is obviously necessary. The third and fourth conditions are also clearly necessary, as we are interested in the problem of node anonymity, not incomplete coverage. However, the second and ﬁfth conditions are non-trivial; we explain them as follows. Conditions like inconsistent or asymmetric routing may or may not be true. Furthermore, it is possible, using tools like source routing and public traceroute pages, to ensure that a trace set contains traces between every possible pair of terminals. As our primary results are negative, we show their robustness by assuming the worst case: we develop our theory assuming the best possible conditions for the inference algorithm, and prove that the results are still valid.

On the Hardness of Topology Inference

255

In our earlier work, [1] and [2], we developed our theory using another strong condition: no node was anonymous. For a trace set to be generable from a network, we required that the unique identiﬁer of every node in the network appear in at least one trace. However, on further study we learned that routers in a network appear anonymous because they are conﬁgured either to never send ICMP responses, or to use the destination addresses of the traceroute packets instead of their real addresses [10]. Thus, if a node is anonymous in a single trace, it is usually anonymous in all traces in a trace set. This fact reduces our earlier study of network tracing to a theoretical exercise, as clearly its assumptions cannot be satisﬁed. Accordingly, in this paper, we have discarded this condition, and updated our theory of network tracing to include networks with anonymous nodes. The introduction of strictly anonymous nodes leads to a complication in our theory: we no longer have all unique identiﬁers, and cannot be sure of the total number of nodes in the network. Hence we will adopt the same approach as Yao et al. in [10] and attempt to reconstruct a topology with the smallest possible number of anonymous nodes. Accordingly, we adopt a new deﬁnition: A minimal network N from which trace set T is generable is a network with the following properties: 1. T is generable from N . 2. T is not generable from any network N which has fewer nodes than N . Note that, if there are multiple minimal networks from which a trace set T is generable, then they all have the same number of nodes. Further, as all such networks contain every non-anonymous node seen in T , it follows that all minimal networks from which a trace set T is generable also have the same number of anonymous nodes. 2.2

The Minimal Network Tracing Problem

We can now state a formal deﬁnition of the problem studied in this paper. The minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a network N such that T is generable from N and, for any network N = N , at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than N .” The weak minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a small set S = {N1 , .., Nk } of minimal networks such that T is generable from each network in this set and, for any network N ∈ / S, at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than any member of S.”

256

H.B. Acharya and M.G. Gouda

The minimal network tracing problem is clearly a special case of the weak minimal network tracing problem, where we consider only singleton sets to be small. In Section 3, we show that the weak minimal network tracing problem is unsolvable in the presence of anonymous nodes, even if we consider only sets of exponential size to be “not small”; of course, this means that the minimal network tracing problem is also unsolvable.

3

The Hardness of Minimal Network Tracing

In this section, we begin by constructing a very simple trace set with only one trace, T0,0 = {(a, ∗1 , b1 )} which, of course, corresponds to the network in Figure 1.

Fig. 1. Minimal topology for T0,0

We now deﬁne two operations to grow this network, Op1 and Op2. In Op1, we introduce a new non-anonymous node and a new anonymous node; the nonanonymous nodes introduced by Op1 are b-nodes. In Op2, we introduce a nonanonymous node, but may or may not introduce an anonymous node; if we only consider minimal networks, then in Op2 we only introduce non-anonymous nodes. To execute Op1, we introduce a new b-node (say bi ) which is connected to a through a new anonymous node ∗i . We will now explain how we ensure that ∗i is a new anonymous node. Note that our assumption of consistent routing ensures that there are no loops in traces. Thus, we can ensure that ∗i is a “new” anonymous node (and not an “old”, i.e. previously-seen anonymous node) by showing that it occurs on a trace with every old anonymous node. To achieve this, we add traces from bi to each pre-existing b-node bj . These traces are of the form (bi , ∗ii , a, ∗jj , bj ). We then use consistent routing to show that ∗i = ∗ii and ∗j = ∗jj , and (as we intended) ∗i = ∗j . We denote the trace set produced by applying Op1 k times to T0,0 by Tk,0 . For example, after one application of Op1 to T0,0 , we obtain trace set T1,0 : T1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 )} As we assume consistent routing, ∗1 = ∗3 and ∗2 = ∗4 . Furthermore, as ∗3 and ∗4 occur in the same trace, ∗1 = ∗2 .

On the Hardness of Topology Inference

257

Fig. 2. Minimal topology for T1,0

There is exactly one possible network from which this trace set is generable; we present it in Figure 2. We now deﬁne operation “Op2”. In Op2, we introduce a new non-anonymous node (ci ). We add traces such that ci is connected to a through an anonymous node, and is directly connected to all b and c nodes. We denote the trace set produced by applying Op2 l times to Tk,0 by Tk,l . For example, one application of Op2 to the trace set T1,0 produces trace set T1,1 given below. T1,1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 ), (a, ∗5 , c1 ), (b1 , c1 ), (b2 , c1 )} From Figure 3 we see that three topologies are possible: (a) ∗5 is a new node. ∗1 = ∗5 and ∗2 = ∗5 . (b) ∗1 = ∗5 . (c) ∗2 = ∗5 . But network N1,1.1 is not minimal; it has one more anonymous node than the networks N1,1.2 and N1,1.3 . Hence, in future we discard such topologies and only consider the cases where the anonymous nodes introduced by Op2 are “old” (previously-seen) anonymous nodes.

(a) Network N1,1.1

(b) Network N1,1.2 Fig. 3. Topologies for T1,1

(c) Network N1,1.3

258

H.B. Acharya and M.G. Gouda

We are now in a position to prove the following theorem: Theorem 1. For every pair of natural numbers (k, l), there exists a trace set Tk,l that is generable from (k + 1)l minimal networks, and the number of nodes in every such network is 2k + l + 3. Proof. Consider the following construction. Starting with T0,0 , apply Op1 k times successively. This constructs the trace set Tk,0 , which has k + 1 distinct anonymous nodes. Finally, apply Op2 l times in succession to get Tk,l . Now, we show that Op2 indeed has the properties claimed. Note that every time Op2 is applied, it introduces an anonymous identiﬁer. This identiﬁer can correspond to a new node or to a previously-seen anonymous node. As we are considering only minimal networks, we know that this is a previously-seen anonymous node. There are k +1 distinct anonymous nodes, and the newly-introduced identiﬁer can correspond to any one of these. There is no information in the trace set to decide which one to choose. Furthermore, each of these nodes is distinct - it is connected to a diﬀerent (non-anonymous) b-node. In other words, each choice produces a distinct topology from which the constructed trace set is generable. Hence the number of minimal networks from which the trace set Tk,l is generable, is (k + 1)l . Further, there are 3 nodes to begin with. Every execution of Op1 adds two new nodes (b-node and the new ∗-node), and every execution of Op2 adds one new node (the c-node). As the total number of nodes in a minimal network is n, we also have n = 3 + 2k + l. We can see that n grows linearly with k and l. The number of candidate networks from which Tk,l is generable, grows as (k + 1)l . So, for example if we take k = n (n 3 −1) , which is obviously l = n−3 3 , the number of candidate networks is ( 3 ) exponential. In fact, this expression is so strongly exponential that it remains exponential even in the special case where we restrict the number of anonymous nodes to exactly two. Note that, if we execute Op1 exactly once and Op2 l times, then by the formula above the number of minimal networks is 2l = 2n−5 , which is O(2n ) - exponential. We have proved the following theorem: Theorem 2. For any n ≥ 6, there exists a trace set T such that: (a) n is the number of nodes in a minimal network from which T is generable. (b) Every such minimal network has exactly two anonymous nodes. (c) The number of such minimal networks is O(2n ). As an example, Figure 4 shows all 23 = 8 possible networks from which the trace set T1,3 is generable. We are now in a position to state our result about the minimal network tracing problem.

On the Hardness of Topology Inference

259

Theorem 3. Both the minimal network tracing problem and the weak minimal network tracing problem are unsolvable in general, but solvable for the case where the minimal network N , from which trace set T is generable, has exactly one anonymous node. Proof. Consider any algorithm that can take a trace set and return the correct network. If the algorithm is given as input one of the trace sets shown in Theorems 1 and 2, it must return an exponentially large number of networks in the worst case. (If it does not return all networks from which the trace set is generable, it may fail to return the topology of the actual network from which the trace set was generated.) In other words, no algorithm that always returns a “small” number of networks can be guaranteed to have computed the correct network from the trace set; the weak minimal network tracing problem is unsolvable in general. As the minimal network tracing problem is a stricter version of this problem, it is also unsolvable. The case where the minimal network has only one anonymous node is special. If there is only one anonymous node, there is no need to distinguish between anonymous nodes. We assign it some identiﬁer (say x) that is not the unique identiﬁer of any non-anonymous node, and replace all instances of “∗” by this identiﬁer. Now the problem reduces to ﬁnding a network from a trace set with no anonymous (or irregular) nodes, which is of course solvable [1]. As the minimal network tracing problem is solvable in this case, the weak minimal network tracing problem (which is easier) is solvable also.

4

Unsolvable, or NP-Complete?

In Section 3, we demonstrated the hardness of the minimal network tracing problem in the presence of anonymous nodes, and concluded that both the strict and the weak versions of the problem are unsolvable in general. It is natural to ask how we can claim a problem to be unsolvable, unless we reduce it to the halting problem or some other such uncomputable problem. Also, it seems on ﬁrst observation that our ﬁndings conﬂict with the earlier results of Yao et al., who had found the problem of minimal topology inference to be NP-complete; an NP-complete problem lies in the intersection of NP-hard and NP, so it lies in NP and is deﬁnitely not unsolvable! In this section, we will answer these questions and resolve this apparent conﬂict. The problem we study is whether it is possible to identify the true network from which a given trace set T was generated in practice - in other words, to ﬁnd a single network N such that T is generable from N and only from N . As there is not enough information in T to uniquely identify N (because T is generable from many minimal networks), the minimal network tracing problem is not solvable. In fact, even the weak minimal network tracing problem is not solvable, as T only provides enough information for us to identify that N is one member of an exponential-sized set (which is clearly not a small set). Thus, our statement that the problem is not solvable does not depend on proving uncomputability, but on

260

H.B. Acharya and M.G. Gouda

(a) Network N1,3.1

(b) Network N1,3.2

(c) Network N1,3.3

(d) Network N1,3.4

(e) Network N1,3.5

(f) Network N1,3.6

(g) Network N1,3.7

(h) Network N1,3.8

Fig. 4. Minimal Topologies for T1,3 (with two anonymous nodes)

the fact that no algorithm can identify the correct solution out of a large space of solutions, all of which are equally good. We now consider how our work relates to the proof of Yao [10]. The solution to our apparent conﬂict is that Yao et al. claim NP completeness for the decision problem TOP-INF-DEC, which asks “Does there exist a network, from which trace set T is generable, and which has at most k anonymous nodes?” This decision problem is equivalent to the problem of demonstrating any one network from which T is generable, with k or less anonymous nodes.

On the Hardness of Topology Inference

261

Yao et al. implicitly assume that the space of networks, from which a trace set T is generable, is a search space; identifying the smallest network in this space will yield the true network from which T was generated in practice. This is simply not true - the number of minimal networks from which T is generable is at least exponentially large, and as these are all minimal networks we cannot search for an optimum among them (they are all equally good solutions; in fact, they satisfy a stronger equivalence condition than having the same number of nodes - our construction produces networks with the same number of nodes and the same number of edges). Finding one minimal network N from which T is generable does not guarantee that N is actually the network from which T was generated! We say nothing about the diﬃculty of ﬁnding a random minimal network from which a trace set is generable (without regard to whether it is actually the network that generated the trace set). Hence, there is no conﬂict between our results and the results in [10].

5

Conclusion

In our previous work, we derived a theory of network tracing under the assumption that nodes were not consistently anonymous. As we later learned that this assumption is impossible to satisfy, we updated our theory to include networks with strictly anonymous nodes, which we present in this paper. As the introduction of irregularity - a limited form of anonymity - caused the problem to become hard in our previous study, we had expected that it would be even harder when we introduced strict anonymity. To our great surprise, we found a counterexample. Networks with a single anonymous node are completely speciﬁed by their trace sets (Theorem 3), while networks with a single irregular node are not (Figure 1 of [1]). We feel that this example is very interesting, as it disproves the intuition that introducing anonymous nodes should cause more trouble to a network tracing algorithm than introducing irregular (partly anonymous) nodes. In the general case, however, we prove in this paper that both the strict version and the weak versions of the minimal network tracing problem are unsolvable: no algorithm can do better than reporting that the required network is a member of an exponentially large set of networks. This result holds even when the number of anonymous nodes is restricted to two. The question of identifying the particular classes of networks, with the property that any such network can be uniquely identiﬁed from any trace set generable from it (even if the network contains anonymous nodes), is an open problem we will attack in future research.

References 1. Acharya, H.B., Gouda, M.G.: A theory of network tracing. In: 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems (November 2009)

262

H.B. Acharya and M.G. Gouda

2. Acharya, H.B., Gouda, M.G.: The weak network tracing problem. In: International Conference on Distributed Computing and Networking (January 2010) 3. Cheswick, B., Burch, H., Branigan, S.: Mapping and visualizing the internet. In: Proceedings of the USENIX Annual Technical Conference, pp. 1–12. USENIX Association, Berkeley (2000) 4. Gunes, M., Sarac, K.: Resolving anonymous routers in internet topology measurement studies. In: INFOCOM 2008: The 27th Conference on Computer Communications, pp. 1076–1084. IEEE, Los Alamitos (April 2008) 5. Gunes, M.H., Sarac, K.: Inferring subnets in router-level topology collection studies. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, pp. 203–208. ACM, New York (2007) 6. Jin, X., Yiu, W.-P.K., Chan, S.-H.G., Wang, Y.: Network topology inference based on end-to-end measurements. IEEE Journal on Selected Areas in Communications 24(12), 2182–2195 (2006) 7. Paul, S., Sabnani, K.K., Lin, J.C., Bhattacharyya, S.: Reliable multicast transport protocol, rmtp (1996) 8. Viger, F., Augustin, B., Cuvellier, X., Magnien, C., Latapy, M., Friedman, T., Teixeira, R.: Detection, understanding, and prevention of traceroute measurement artifacts. Computer Networks 52(5), 998–1018 (2008) 9. Xie, H., Yang, Y.R., Krishnamurthy, A., Liu, Y.G., Silberschatz, A.: P4p: provider portal for applications. SIGCOMM Computer Communications Review 38(4), 351– 362 (2008) 10. Yao, B., Viswanathan, R., Chang, F., Waddington, D.: Topology inference in the presence of anonymous routers. In: Twenty-Second Annual Joint Conference of the IEEE Computer and Communications Societies, March - April 3, vol. 1, pp. 353–363. IEEE, Los Alamitos (2003)

An Algorithm for Traﬃc Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy Sukanta Bhattacharya1, Tanmay De1 , and Ajit Pal2 1 2

Department of Computer Science and Engineering, NIT Durgapur, India Department of Computer Science and Engineering, IIT Kharagpur, India

Abstract. In wavelength-division multiplexing (WDM) optical networks, the bandwidth request of a traﬃc stream is generally much lower than the capacity of a lightpath. Therefore, to utilize the network resources (such as bandwidth and transceivers) eﬀectively, several low-speed traﬃc streams can be eﬃciently groomed or multiplexed into high-speed lightpaths, thus we can improve the network throughput and reduce the network cost. The traﬃc grooming problem of a static demand is considered as an optimization problem. In this work, we have proposed a traﬃc grooming algorithm to maximize the network throughput and reduce the number of transceivers used for wavelength-routed mesh networks and also proposed a dynamic path selection strategy for routing requests which selects the path such that the load on the network gets distributed throughout. The eﬃciency of our approach has been established through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent bandwidth granularities for diﬀerent network topologies and compared the approach with existing algorithm. Keywords: Lightpath, WDM, Transceiver, Grooming.

1

Introduction

Wavelength division multiplexing (WDM) technology is now being widely used for expanding the capacity of optical networks. It has provided vast bandwidth to the optical ﬁber by allowing simultaneous transmission of traﬃc on many nonoverlapping channels (wavelengths). In a wavelength routed optical network, a lightpath may be established to carry traﬃc from a source node to a destination node. A lightpath is established by selecting a path of physical links between the source and destination nodes, and taking a particular wavelength on each of these links for the path. A lightpath must use the same wavelength on all of its links, if there is no wavelength converter at intermediate nodes, and this restriction is known as wavelength continuity constraint [3], [5]. An essential functionality of WDM networks, referred to as traﬃc grooming, is to aggregate low speed traﬃc connections onto high speed wavelength channels in a resource-eﬃcient way, that is, to maximize the network throughput when the resources are given or to minimize the resource consumption when the M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 263–268, 2011. c Springer-Verlag Berlin Heidelberg 2011

264

S. Bhattacharya, T. De, and A. Pal

traﬃc requests to be satisﬁed are given. Eﬃcient traﬃc grooming techniques (algorithms) can reduce network cost by reducing the number of transceivers and increase the throughput and also save time by accommodating more number of low-speed traﬃc streams on a single lightpath. The work in [4] and [6] investigates the static Traﬃc Grooming problem with the objective of maximizing network throughput. Zhu and Mukherjee [6] investigate the traﬃc grooming problem in WDM mesh networks with the objective of improving network throughput. They present an integer linear programming (ILP) formulation of the traﬃc grooming problem and propose two heuristics namely, Maximizing Single-Hop Traﬃc (MST) and Maximizing Resource Utilization (MRU), to solve the GRWA problem. Subsequently, A global approach for designing reliable WDM networks and grooming the traﬃc is presented in [1] and Traﬃc grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning [2] motivated us to use the concept of reducing the total network cost to solve the GRWA problem, which is presented in the following sections. The grooming problem consists of two interconnected parts: (a) designing light paths, which includes specifying the physical route of each path; (b) assigning each packet stream to a sequence of light paths. The work proposed in this paper is based on the static GRWA problem in WDM mesh networks with a limited number of wavelengths and transceivers and the proposed approach allows singlehop and multi-hop grooming that is similar to [6]. The objective of this work is to maximize the network throughput in terms of total successfully routed traﬃc and reduce the number of transceivers used. The performance of our proposed approach is evaluated through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent granularity for diﬀerent network topologies. The results show that the proposed approach gives better performance with respect to the existing traﬃc grooming algorithm, Maximizing Single-Hop Traﬃc (MST). The problem formulation is presented in Section II. Section III gives the detailed description of proposed algorithm. Section IV contains experimental results and comparison with previous works. Finally, conclusion in Section V.

2

General Problem Statement

Given a network topology which is a directed connected graph G(V, E), where V and E are the sets of optical nodes and bi-directional links (edges) of the network, respectively, a number of transceivers at each node, a number of wavelengths on each ﬁber, the capacity of each wavelength and a set of connection requests with diﬀerent bandwidth granularities, our objective is to set up lightpaths and multiplex low-speed connection requests on the same lightpath such that the network throughput is maximized in terms of total successfully routed lowspeed traﬃc and minimize in terms of number of transceivers used for satisfying the request. Since the Traﬃc Grooming problem is NP-complete. So, eﬃcient heuristic approach gives a better solution. In the next section we have proposed a heuristic approach to solve the Traﬃc Grooming problem.

An Algorithm for Traﬃc Grooming in WDM Mesh Networks

3

265

Proposed Approach

In this section, we propose a Traﬃc Grooming 2 (TG2) algorithm based on dynamic path selection strategy for the GRWA problems. Our proposed approach has two steps similar to [6]. In the ﬁrst step, we construct a virtual topology trying to satisfy given requests (in the decreasing order of request) in single-hop using single lightpath. In the second step, we try to satisfy the leftover blocked requests through multi-hop (in the decreasing order of request). We run the multi-hop on the spare capacity of the virtual topology created in the ﬁrst step. The leftover requests are sorted and we try to satisfy them one by one with single hop. As soon one request is satisﬁed by single hop we try to satisfy all leftover requests by multi hop so that with the new lightpath created in single hop some requests may get satisﬁed by multi hop, thus we can reduce the number of transceivers used. The process is repeated on the leftover requests until all resources gets exhausted or all requests are satisﬁed or no leftover request can be satisﬁed by single hop. 3.1

Alternate Path Selection Strategy

In this work we have used a variant of adaptive routing. At each time when a request between a SD pair is satisﬁed we calculate all possible paths between source (S) and destination (D) and calculate cost of each path using a cost function: 1 C= (α) + L(β) (1) W Where, α and β are constants and C, W , L are cost of the path, Total common free wavelength in the physical path and Length of the path (i.e distance between SD pair) respectively. First parameter is more dominant than the second in determining the cost. This is taken into consideration so that traﬃc load on the network gets distributed and no single path gets more congested. 3.2

Traﬃc Grooming Algorithm

The proposed algorithm selects the minimum cost path dynamically such that the traﬃc load on the network gets distributed and no particular path gets congested. Algorithm TG2 1. Sort the requests in descending order of OC-x 2. Select the sorted requests one by one and start making the virtual topology and satisfy single-hop traﬃc. (a) Find all possible paths from source (S) to destination (D) for a request R. (b) Calculate cost of each path and thus ﬁnd the minimum cost path for the SD pair using equation 1.

266

3. 4. 5.

6. 7. 8. 9. 10. 11.

S. Bhattacharya, T. De, and A. Pal

(c) Find the lowest index wavelength (w) among the common free wavelengths in the edges present in the minimum cost path. (d) Update the virtual topology assigning a lightpath from node (S) to node (D) using wavelength ”w”. (e) Update the request matrix. if Request is fully satisﬁed then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisﬁed) request. end if (f) Update the wavelength status of the corresponding physical edge present in the lightpath. (g) Update the transceiver used status at the nodes. if Node is starting node then Reduce count of Transmitter by 1. end if if Node is ending node then Reduce count of Receiver by 1. end if Repeat Step-2 for all SD pair requests. Sort the blocked requests in descending order. Select the sorted requests one by one and try to satisfy them using multiple lightpaths on the Virtual topology (VT), created in step 2. (a) Update the request matrix. if Request is fully satisﬁed then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisﬁed) request. end if (b) Update the wavelength status of the corresponding physical edges present in the lightpaths. (c) Update the virtual topology. Sort the blocked requests again in descending order. Try to satisfy the requests one by one with single hop in descending order until one of the request is satisﬁed. Update the system as described in step 2(d) to 2(g). When one request is satisﬁed with single hop try to satisfy all remaining requests with multi hop. Update the system as described in step 5(a) to 5(c). Repeat Steps 6 to 10 until all resources gets exhausted or all requests are satisﬁed or no leftover request can be satisﬁed by single hop.

End Algorithm TG2

An Algorithm for Traﬃc Grooming in WDM Mesh Networks

4

267

Experimental Results

We have evaluated the performance of our proposed heuristics TG2 for the GRWA problem using simulation and compared the results with the well-known MST algorithm [6]. We conducted our experiments on diﬀerent network topologies but due to page limitations we have presented results only for 14-node NSFNET shown in Fig.1. The value of α and β in equation 1 is taken to be 10 and 0.3 respectively. We assume that each physical link is bidirectional with the same length. During simulation we have assumed that the capacity of each wavelength is OC-48 and allowed traﬃc bandwidth requests were assumed to be OC-1, OC-3, and OC-12 and are generated randomly. 3000 TxRx = 6, W = 7

11

12 2500

13

10 6

8

9

1 7

Throughput (OC−1 unit)

3

1500

1000

500

14 2

2000

MST TG2

4 0

0

1000

5

2000

3000

4000

5000

5500

Number of Requests (OC−1 unit)

Fig. 1. Node architectures used in simulation

Fig. 2. Relationship between throughput and requested bandwidth

3000

2500 Req = OC−3000, W = 7

Req = OC−3000, TxRx = 7 2500

Throughput (OC−1 unit)

Throughput (OC−1 units)

2000

2000

1500

1000

1500

1000

500 500 MST TG2 0 0

2

4

6

8

10

12

MST TG2 14

Number of Wavelengths per Link

Fig. 3. Relationship between throughput and number of wavelengths per ﬁber link

0

0

1

2

3

4

5

6

7

8

9

Number of Transceivers per Node

Fig. 4. Relationship between throughput and number of transceivers per node

Figure 2 shows the relationship between the network throughput and total requested bandwidth for a 14-node network (Fig.1). Initially, performance (in terms of throughput) of both the algorithms are similar but subsequently TG2 returns a better throughput than MST.

268

S. Bhattacharya, T. De, and A. Pal

The relationship between the network throughput and the number of wavelengths per link for the two algorithms is shown in Fig. 3. We observe that the proposed algorithm TG2 provides a higher network throughput than the existing MST algorithm. The throughput increases with the increase of wavelength and due to transceiver constraint there is no signiﬁcant change in throughput after the number of wavelengths reaches a certain limit for both algorithms. The relationship between the network throughput and the number of transceivers per node for the proposed and existing algorithms is shown in Fig. 4. We observe that initially the throughput increases with the increase in the number of transceivers and there is no signiﬁcant change in the throughput as the number of transceivers is increased beyond certain value, due to capacity of wavelengths is exhausted. However, the proposed TG2 algorithm performs better in terms of network throughput compared to the existing MST algorithm.

5

Conclusions

This study was aimed at traﬃc-grooming problem in a WDM mesh network.We have studied the problem of static single-hop and multi-hop GRWA with the objective of maximizing the network throughput for wavelength routed mesh networks. We have proposed a algorithm TG2, using the concept of single hop and multi hop grooming in static GRWA problem [6]. The performance of our proposed algorithm is evaluated through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent bandwidth granularities under diﬀerent network topologies.

References 1. Bahri, A., Chamberland, S.: A global approach for designing reliable WDM networks and grooming the traﬃc. Computers & Operations Research 35(12), 3822– 3833 (2008) 2. De, T., Pal, A., Sengupta., I.: A Traﬃc grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning. Photonic Network Communications (February 2010) 3. Mohan, G., Murthy, C.S.: WDM optical networks: concepts, design and algorithms. Prentice Hall, India (2001) 4. Yoon, Y., Lee, T., Chung, M., Choo, H.: Traﬃc grooming based on shortest path in optical WDM mesh networks. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 1120–1124. Springer, Heidelberg (2005) 5. Zang, H., Jue, J., Mukherjee, B.: A review of routing and wavelength assignment approaches for wavelength-routed optical WDM networks. SPIE Opt. Netw. Mag. 1(1), 47–60 (2000) 6. Zhu, K., Mukherjee, B.: Traﬃc grooming in an optical WDM mesh network. IEEE J. Sel. Areas Commun. 20(1), 122–133 (2002)

Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks Bala Kalyanasundaram and Mahe Velauthapillai Department of Computer Science, Georgetown University, Washington DC, USA [email protected], [email protected]

Abstract. Co-operative computations in a network of sensor nodes rely on an established, interference free and repetitive communication between adjacent sensors. This paper analyzes a simple randomized and distributed protocol to establish a periodic communication schedule S where each sensor broadcasts once to communicate to all of its neighbors during each period of S. The result obtained holds for any bounded degree network. The existence of such randomized protocols is not new. Our protocol reduces the number of random bits and the number of transmissions by individual sensors from Θ(log 2 n) to O(log n) where n is the number of sensor nodes. These reductions conserve power which is a critical resource. Both protocols assume upper bound on the number of nodes n and the maximum number of neighbors B. For a small multiplicative (i.e., a factor ω(1)) increase in the resources, our algorithm can operate without an upper bound on B.

1

Introduction

A wireless sensor network (WSN) is a network of devices called sensor nodes that communicate wirelessly. The WSN is used in many applications including environment monitoring, traﬃc management, wild-life monitoring, etc. [1,2,4,7,8,9,5]. Depending on the application, a WSN can consist of a few nodes to millions of nodes. The goal of the network is to monitor the environment continuously to detect and/or react to certain predeﬁned events or patterns. When an application requires millions of nodes, individually programming each node is impractical. When deployed, it is often diﬃcult to control the exact location of each sensor. Even if we succeed in spreading the sensors evenly, it is inevitable that some nodes will fail and the resulting topology is no longer uniform. It may be reasonable to assume that the nodes may know an upper bound on the number of nodes in the network but nothing else about the network. This paper analyzes the performance of a randomized and distributed protocol that establishes communication among neighbors of a bounded degree network of sensors. We assume that B is a constant.

Supported in part by Craves Family Professorship. Supported in part by McBride Chair.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 269–280, 2011. c Springer-Verlag Berlin Heidelberg 2011

270

B. Kalyanasundaram and M. Velauthapillai

The following wireless transmission model is considered in this paper. A node cannot transmit and receive information simultaneously. Each node has a transmission range r, and any node within that range can receive information from this node. Also, each node A has an interference range r+ . It means that a transmission from node C to any node B within this range r+ of A can interfere with the transmission from node A. In general r + ≥ r. For the ease of presentation, we assume r+ = r. However, the proofs can be easily extended to r+ ≥ r as long as the number of nodes in the interference range is big-oh of number of nodes in the transmission range. In the literature of wireless/sensor network, there are many diﬀerent media access control (MAC) protocols. In general, the protocols fall into the following three categories [6]: ﬁxed assignment, demand assignment, and random access. The protocol which we will present is a ﬁxed assignment media access protocol. However, the protocol that we use to derive a ﬁxed assignment media access protocol is a random access protocol. In the random access protocol, the time is divided into slots of equal length. Intuitively, the sensors, using randomization, will ﬁrst establish a schedule for pair-wise communication with their neighbors. The sensors run the second phase of the protocol where the schedule is compressed such that each sensor broadcasts once to communicate to all of its neighbors. After the compression, the resultant protocol is a ﬁxed assignment protocol that can be used by the sensor network to communicate and detect patterns. We consider uniform transmission range for the sensors. One can view the network of sensors as a graph where each node is a sensor and there is an edge between two nodes if the nodes are within the transmission range. The resultant graph is often called a disk graph DG, and unit-disk graph UDG when the transmission range is same for all sensors. The problem addressed in this paper can be thought of as the problem of ﬁnding a interference-free communication schedule for a given UDG where the graph is unknown to the individual nodes. Gandhi and Parthasarathy [3] considered this problem and proposed a natural distance-2 coloring based randomized and distributed algorithm to establish an interference-free transmission schedule. Each node in the network ran Θ(log2 n) rounds of transmissions and used Θ(log2 n) random bits to establish the schedule. Comparing these two protocols, it is interesting to note that our protocol exhibits better performance. The major diﬀerence between the two approaches is in the way we split the protocol in two phases where pair-wise communication is established in the ﬁrst phase and compression takes place in the second. Our protocol reduces the number of transmissions as well as the number of random bits to O(log n). Moreover, the number of bits transmitted by our protocol is O(log n) per node, whereas the protocol by Gandhi and Parthasarathy uses O(log2 n) bits per node. These reductions conserve power, a critical resource in sensor network. It is worthy to note that the running time of both protocols is O(log2 n) where each transmission is considered to be O(1) step. Let b be the maximum number of neighbors of any node. CDSColor protocol explicitly uses an upper bound on b and the number of nodes in the graph. After

Analysis of a Simple Randomized Protocol

271

Table 1. Comparing Algorithms Case CDSColor(see [3]) Our Alg. Random Bits O(log2 n) O(log n) # of Transmissions O(log2 n) O(log n) Bits Transmitted per Node O(log2 n) O(log n) Number of Steps O(log2 n) O(log2 n)

the ﬁrst phase of our protocol, each node will know the exact number of its neighbors with high probability. In order to increase the conﬁdence/probability of total communication, the length of the ﬁrst phase will be set to O(log n) where the constant in big-oh depends on b. If we do not have a clear upper bound on b, the length of the ﬁrst phase can be ω(log n) (e.g., O(log n log log n)). By increasing the number of transmissions in the ﬁrst phase, our protocol will establish communication with high probability (i.e., (1 − 1/nc ) for any given constant c). Our analysis for the ﬁrst phase of the algorithm uses a recurrence relation to derive accurate/exact probability in establishing a pair-wise communication. One can write a simple program to calculate this probability accurately. Using the probability on pair-wise communication, we can ﬁnd the expected number of transmissions needed to establish communication. The bound obtained using the recurrence relation closely matches the value observed in simulation. For instance, the number of transmissions needed to establish communication between every pair of neighbors in an entire network with million nodes is around 400. For incredibly large network, n = 1050 , the bound on the number of transmissions is less than 3000. From this observation, we can safely say that our protocol does not need to know n for any real-life sensor network. Definition 1. Given a sensor network G = (V, E), we define H1 (v) to be set of nodes in V that are either immediate neighbor or neighbor’s neighbor (i.e., 1-hop away) of node v. For ease of presentation, we refer {v}∪H1 (v) to be 1-hop neighborhood of v.

2

First Phase - Establishing Pair-Wise Communication

Each sensor node v selects an id which should be unique among the id’s of nodes in {v} ∪ H1 (v). This can be accomplished with high probability by selecting c log n random bits as id where c ≥ 1 is a carefully chosen constant. This is captured in the following lemma which is trivial to establish. Lemma 1. Suppose there are n nodes in the network and for each node v we have |H1 (v)| ≤ c, a fixed constant. Each node chooses c1 log n random bits as its id where c1 ≥ 1. The probability that every node in the network will choose a unique id in the 1-hop neighborhood is at least 1 − nc1c−1 .

272

B. Kalyanasundaram and M. Velauthapillai

After establishing an id, each node executes the following simple protocol for c2 log n steps where c2 ≥ 1 is a constant. The choice of c2 depends on the conﬁdence parameter in the high probability argument. This will become clear when we present the analysis of the protocol. TorL(p): (one step) Toss a biased coin where probability for head is p and tail is (1-p). Transmit the node’s id if the outcome is head and listen if the outcome is tail. We could present the analysis of this protocol for the arbitrary bounded degree network now. But we choose to consider both line topology and grid topology before presenting the arbitrary case. There are two reasons for this choice. The analysis is a bit more exact for the simpler case. We ran simulations for grid topology to see the eﬀectiveness of the protocol for reasonably large-sized network. Our simulation and calculation showed that c2 log n is around only 3000 for network of size n = 1050 . 2.1

Line Topology

The analysis of the ﬁrst phase of the protocol contains both high probability argument as well as expected case argument. Bounds we get from both arguments are not diﬀerent in the asymptotic sense. However, our simulation result for grid network shows that the expectation value given by the recurrence relation matches very closely to the value we observed in the simulation. Hence, we choose to present both arguments. Definition 2. After running a protocol to establish communication, we say a sensor node is unsuccessful if it failed to receive information (that is id of the neighbor) from one of its neighbor nodes. 1 Theorem 1. Let p be be positive real in the range (0, 12 ] and let b = 1−p(1−p) 2. Suppose n sensor nodes are uniformly distributed on a line. Assume omnidirectional transmission where the signal from each node reaches both neighbors on the line. 1. After c2 log n steps, probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant. 2. The number of steps of TorL(p) needed so that the expected number of unsuccessful sensor nodes is less than 1 is (1 + logb (2n)).

Proof. Consider a sensor node with a neighbor on both sides. For i = 0, 1 and 2, let R(i, k) be the probability that the sensor node successfully received information from exactly i neighbors on or before k rounds. According to protocol TorL(p), p is the probability that a sensor node transmits at each time step. So, in order to receive information at time t, a sensor node must not transmit at t and one neighbor must be transmitting while the other is not. Let α = 2p(1 − p)(1 − p) be the probability of a sensor node successfully receiving information from one of its neighbor. Since coin tosses are independent, we can express R(i, k) in the form of a recurrence relation shown below:

Analysis of a Simple Randomized Protocol

273

R(0, k) = (1 − α)k R(1, k) = R(0, k − 1)α + R(1, k − 1)(1 − α2 ) = (1 − α)k−1 α + R(1, k − 1)(1 − α2 ) R(1, 0) = 0 R(2, k) = R(1, k − 1) α2 + R(2, k − 1) R(2, 1) = 0. We now show some steps to solve the recurrence relation: R(1, k) = (1 − α)k−1 α + (1 − α2 )(1 − α)k−2 α + . . . + (1 − α2 )k−1 α = α(1 − α)k−1

k−1 1− α2 i i=0

1−α

= 2 [(1 − α2 )k − (1 − α)k ]. By expanding recursively, the function R(2, k) and substituting for R(1, j) results in: k−1 R(2, k) = α2 j=1 R(1, j) =

α 2

=α

k−1 j=1

2 [(1 − α2 )j − (1 − α)j ]

k−1

j=1 (1

− α2 )j −

k−1

j=1 (1

− α)j ]

= 2[1 − (1 − α2 )k ] − [1 − (1 − α)k ] = 1 − [2(1 − α2 )k − (1 − α)k ]. The probability that a node is not successful after k steps is at most 1−R(2, k) = 2(1 − α2 )k − (1 − α)k ≤ 2(1 − α2 )k ). We now ﬁnd a bound on k such that 1 2 2(1 − α2 )k ≤ nd+1 for the given d. Simplifying, we get 2nd+1 ≤ ( 2−α )k . Simple algebra show that it holds for k ≥ c2 such that c2 log n ≥

(d+1) 2 log( 2−α )

(d+1) 2 log( 2−α )

log2 (2n). Choose a smallest constant

log2 (2n). So, if each node runs the protocol for

c2 log n steps, the probability that it will be unsuccessful is at most 1/nd+1 . There are n nodes in the network. Therefore, the probability that there exists 1 an unsuccessful node is at most n × nd+1 = n1d . We deﬁne a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. Suppose there are n sensor nodes in the network. Using R(2, k), we get that the expected value of β(i, k) is E[β(i, k)] = R(2, k) for each 1 ≤ i ≤ n. Observe

274

B. Kalyanasundaram and M. Velauthapillai

that a sensor node receives information from a neighbor not only depends on the random bit of the sensor node but also the random bits of its neighbor nodes. As a result, the random variables β(i, k) are not independent random variables. But, applying linearity of expectation we get E[

n

β(i, k)] =

i=1

n

E[β(i, k)] = n R(2, k).

i=1

Therefore, the number of steps k needed to reach E[ ni=1 β(i, k)] > n − 1 is given by the inequality n R(2, k) > n − 1. By substituting the bound on R(2, k) it suﬃces to satisfy the inequality 1 − [2(1 − α2 )k − (1 − α)k ] > n−1 n . Therefore it suﬃces to show that 2(1 − α2 )k − (1 − α)k < 1/n. For α ≥ 0, we have (1 − α α k 2 ) > (1 − α). Hence, it suﬃces to satisfy the equation (1 − 2 ) < 1/(2n) or k 2n < [2/(2 − α)] . This happens for k = 1 + logb (2n) where b = (2/(2 − α)). Observe that if α increases then k decreases. 2.2

Technical Lemmas

The following results and recurrence relation will help us establish the expected time needed to establish a pair-wise communication between all adjacent sensors for a general bounded degree network. Lemma 2. Let 0 < p < 1 , integer b ≥ 1 and α = bp(1 − p)b . Maximum value b (b+1) 1 of α is ( b+1 ) ≤ 12 and it occurs when p = b+1 . Proof. Diﬀerentiating the function bp(1 − p)b with respect to p and setting it equal to 0 we get zeros at p = 0, 1 and 1/(b + 1). Diﬀerentiating again, it is not hard to verify that maximum occurs when p = 1/(b + 1) and the maximum value is (b/(b + 1))b+1 . Lemma 3. Let 0 < p < 1, integer 2 ≤ b, α = bp(1 − p)b < 1 and integers i, k ≥ 0. Given the following recurrence relation and boundary conditions: R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((b+1)−i) αR(i − 1, k − 1) b + (1 − (b−i) α)R(i, k − 1) b the following hypothesis I(i, k) is true: αi R(i, k) ≤ ki (b−1)! (1 − (b−i) α)k−i (b−i)! bi−1 b

(k ≥ 0) (i > k) (0 ≤ i ≤ b − 1) ∧ (k ≥ i)

(0 ≤ i ≤ (b − 1)) ∧ (k ≥ i).

Proof. We will prove this by double induction (i.e., on i and k) where the desired inequality is the inductive hypothesis.

Analysis of a Simple Randomized Protocol

Base case where i = 0 is easy to verify. Observe that

k i

(b−i) α)k−i b

275

(b−1)! αi (1 (b−i)! bi−1 k

−

= (1 − α)k . The base case condition R(0, k) ≤ (1 − α) is true since R(0, k) is deﬁned to be equal to (1 − α)k . Assume that the hypothesis holds for all i such that i ≤ x ≤ (b − 1) and for all k ≥ i. We will show that the hypothesis holds for i = x + 1 ≤ (b − 1) and for all k ≥ i. This is again proved by induction on k ≥ i = x + 1. The base case of this induction is when k = i = x + 1. x+1 (b−1)! (b−(x+1)) αx R(x + 1, x + 1) ≤ α)(x+1)−(x+1) x + 1 (b−(x+1))! bx (1 − b =

(b−1)! αx (b−(x+1))! bx

Now from the recurrence relation: R(x + 1, x + 1) = ((b+1)−(x+1)) αR(x, x) + (1 − (b−(x+1)) α)R(x + 1, x) b b (b−(x+1)) R(x + 1, x + 1) = b−x αR(x, x) + (1 − α)R(x + 1, x) b b x−1

(b−1)! α Substituting for R(x, x) ≤ ( (b−(x))! ) and R(x + 1, x) = 0, we get bx−1

R(x + 1, x + 1)

≤

b−x (b−1)! αx−1 b α (b−x)! bx−1

=

(b−1)! αx (b−(x+1))! bx .

Hence the base case is true. For the inductive step, we assume that the hypothesis I(i, k) holds for ( i < x < b and i ≤ k) or (i = x + 1 ≤ b and i ≤ k ≤ y). We will prove that the hypothesis holds for i = x + 1 and k = y + 1, that is I(x + 1, y + 1) is also true. Hypothesis I(x + 1, y) states that y (x+1)! α(x+1) (b−(x+1)) R(x + 1, y) ≤ x+ α)y−(x+1) 1 (b−(x+1))! bx (1 − b Now consider the recurrence relation with i = x + 1 and k = y + 1: R(x + 1, y + 1) = R(x + 1, y + 1) =

((b+1)−(x+1)) (b−(x+1)) αR(x, y) + (1 − α)R(x b b (b−(x+1)) b−x αR(x, y) + (1 − α)R(x + 1, y) b b

+ 1, y)

Substituting for R(x + 1, y) from hypothesis I(x + 1, y) we have R(x + 1, y + 1)

y (b−1)! b−(x+1) αx+1 + (1 − b−(x+1) α) α)y−(x+1) x + 1 (b−(x+1))! bx (1 − b b y (b−1)! (b−(x+1)) α(x+1) = b−x α)(y+1)−(x+1) b αR(x, y) + x + 1 (b−(x+1))! bx (1 − b ≤

b−x αR(x, y) b

Substituting for R(x, y) from hypothesis I(x, y) we have R(x + 1, y + 1) ≤ =

y x

y (b−1)! αx (b−1)! αx+1 (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x (b−x)! bx−1 b bx b (x+1) x+1 y (b−1)! (b−1)! α α (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x . (b−(x+1))! bx b bx b

b−x α b

y x

276

B. Kalyanasundaram and M. Velauthapillai (b−x) α) ≤ (1− (b−(x+1)) α), using this in the above expression: b b (x+1) y y (b−1)! 1) ≤ [ x + x + 1 ][ (b−(x+1))! α bx (1 − (b−(x+1)) α)y−x ] b (x+1) 1 (b−1)! (b−(x+1)) α = xy + α)y−x . + 1 (b−(x+1))! bx (1 − b

Note that (1 − R(x + 1, y +

Hence the result. Lemma 4. Let 0 < p < 1, integer b ≥ 2, and α = bp(1−p)b . Given the recursive b−1 definition of R(i, k) for 0 ≤ i < b and k ≥ 0, let R(b, k) = 1 − i=0 R(i, k). There exist constants c > 0 and 1 − αb < < 1 such that for integer k ≥ c, we have R(b, k) ≥ 1 − k That is, lim R(b, k) = 1, and the convergence rate is k→∞

exponential in k. Constants and c depends on constant b. Proof. From Lemma 2, we have α ≤ 12 . Applying Lemma 3 we get k (b−1)! αi (b−i) k−i R(i, k) ≤ (0 ≤ i ≤ (b − 1)) ∧ (k ≥ i) i (b−i)! bi−1 (1 − b α) ≤ αi k i (1 − αb )k−i ≤ k (b−1) (1 − αb )k = (1 −

since α ≤

(b−1) log k

1 2

and i ≤ (b − 1)

α k− log b−log(b−α) . b)

Recall that R(0, k) = (1 − α)k . Hence, lim R(i, k) = 0 for (0 ≤ i ≤ b − 1). k→∞

Substituting the bound for R(i, k), we get R(b, k) ≥ 1 − b(1 − αb )k−O(log k) which converges to 1 exponentially. Observe that there exists constants 1 − αb < < 1 and c > 0 such that for k ≥ c, we have b(1 − αb )k−O(log k) ≤ k . As a consequence we get R(b, k) ≥ 1 − k . 2.3

Arbitrary Bounded Degree Topology

Looking carefully at the protocol and proof techniques in the previous section, it will become clear that they can be extended to any arbitrary bounded degree sensor network. Theorem 2. Let G be an arbitrary sensor network with n nodes and each node has at most b neighbors where b is a constant. Let p be a positive real in the range (0, 12 ]. Each node repeatedly runs TorL(p). 1. After c3 log n steps, the probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant and c3 is a positive constant that depends on d. 2. The number of steps needed so that the expected number of unsuccessful sensor is less than 1 is c4 log n where c4 is another constant. 3. The number of bits transmitted by a node is O(log2 n) and the number of random bits used by a node is O(log n).

Analysis of a Simple Randomized Protocol

277

Proof. Suppose a node x has a neighbors where 1 ≤ a ≤ b. Let R(i, k) for 0 ≤ i ≤ a be the probability that the sensor node has received information from exactly i neighbors at the end of k rounds. This probability obeys the following recurrence relation. R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((a+1)−i) αR(i − 1, k − 1) + (1 − a

(k ≥ 0) (i > k) (a−i) α)R(i, k a

− 1) when (0 ≤ i ≤ a − 1) ∧ (k ≥ i)

Applying Lemma 4, we know R(a, k) ≥ 1 − k where is a positive constant less than 1. Therefore, the probability that the sensor is unsuccessful after k rounds is 1 1 1 at most k = 2−k log2 ( ) . Substituting k = c3 log n, we get 2k log2 ( ) = c3 log 1 . 2( ) n

Choose c3 such that c3 log2 ( 1 ) ≥ d + 1. For this choice of c3 , the probability that 1 the node is unsuccessful is at most nd+1 . Since there are n nodes, the probability that any node is unsuccessful is at most n1d . Assume that we number the nodes i = 1 through n. Let ai ≤ b the number of neighbors for node i In order to calculate the expected number of steps of T orL(p) to have less than one unsuccessful node, we deﬁne a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. The expected value of βp (i, k), denoted by E[βp (i, k)], is equal to R(a, k). Observe that R(a, k) ≥ R(b, k) for all a ≤ b. The expected number of nodes in the entire network n that receives communication from all of its neighbors after k rounds is E[ i=1 βp (i, k)]. Applying linearity of expectation, we get E[

n i=1

βp (i, k)] =

n i=1

E[βp (i, k)] =

n

R(ai , k)

i=1

Applying Theorem 4, the number of steps k needed to reach the bound n k i=1 E[βp (i, k)] > n − 1 is given by the inequality n(1 − ) > n − 1 . Simk k log2 ( 1 ) plifying we get , < 1/n or 2 > n. The result follows if we choose k = c4 log2 n where c4 ≥ log 1( 1 ) . 2 Finally, observe that each node uses random bits to select id and O(1) random bits per step of T orL(p). Since id is O(log n), and number of steps of T orL(p) is also O(log n), the total number of random bits is O(log n). It is easy to see that the number of transmissions per node is O(log n). 2.4

Simulation and Practical Bounds for Grid Network

We ran simulations to estimate the number of steps needed for a large sensor network in practice. For a network with one million sensors, we needed approximately 373 time slots to establish communication between every pair of adjacent

278

B. Kalyanasundaram and M. Velauthapillai

nodes. This is an average over 20 random runs. For three million nodes, the number of rounds is approximately 395 time slots. Based on our recurrence relation, our calculations for network of size 1012 show that communication will be established in 650 time slots or steps. The table below gives the average number of rounds needed to establish communication for diﬀerent sized network. Here the probability of transmission is set to p = 1/(B + 1) = 1/(8 + 1) = 1/9. Here B = 8 is the number of neighbors for any node. Table 2. Simulation On Grid Network Network Size 360,000 640,000 1,000,000 1,440,000 1,960,000 2,560,000 3,240,000 Avg. # Steps 342 359 373 385 394 392 395

Table 3. Probability Bounds for Grid - Based on Recurrence Relation Steps 501 1001 1501 2001 2501 3001 Prob of Failure of a Node 1.86e-09 4.541e-19 1.10e-28 2.69e-38 6.57e-48 1.601e-57

However, when we set the probability of transmission p = 12 , the number of rounds needed to establish communication exceeds 3000 for a small network of size one hundred. So it is critical to set p close to 1/(B + 1).

3

Second Phase: Compression Protocol

Given any schedule that establishes pair-wise communication between neighbors (e.g., the schedule of length O(log n) from last section), we will show how to compress the schedule to a small constant length where each node broadcasts once to communicate to its neighbors. After the ﬁrst run of the protocol T orL(p) for c3 log n steps, each node has already communicated its id to the neighbors with high probability. However, each node does not know when it succeeds in its communication attempt. Let T (x) (resp. L(x)) be the set of transmission (resp. listening) steps of the sensor x. Each sensor x runs another c3 log n steps to communicate to its neighbors. In this case, no random bits are used and each sensor x transmits only during steps in T (x). Each sensor x transmits the following two pieces of information when it transmits during this iteration: 1. List of ids of its neighbors. 2. For each neighbor y of x, (id of neighbor y, earliest time in L(x) that x listens to a transmission from y). At the end of this second iteration, each sensor knows its b neighbors and its neighbor’s neighbor. Each sensor also knows at most b transmission times that it must use from now on to communicate to its neighbors during c3 log n steps.

Analysis of a Simple Randomized Protocol

279

It is important to observe that no more random bits are used to run the transmission schedule after the ﬁrst round. During each round, each sensor must listen at most b times and transmit at most b times. This conserves power. However, the biggest drawback is that the communication between neighbors takes place once in c3 log n steps. Let us call such a long one-to-one communication schedule by the name Long1to1. After the compression, each node transmits once, and listens b times. Communication between neighbors take place once in every O(1) steps. Compressor: Protocol For a Node 1. Let b the number of neighbors and be the number of neighbor’s neighbor. 2. Let T = {1, 2, 3, . . . , (b + + 1)}. 3. Maintain a set of AV of available slots and initially AV = T . 4. Repeat the following until a slot is chosen for the Node: Choose a random number x in AV . Run one round of schedule Long1to1 to communicate (id, x) to the neighbors. Let N be the set of pairs (id, x) received from the b neighbors. Run one round of schedule Long1to1 to communicate (id, N ). to the neighbors. Let M be the set of all random numbers chosen by the neighbors or neighbor’s neighbor during this iteration. If x is not in M then x is set to be the chosen slot for the sensor. Run one round of schedule Long1to1 to communicate to the neighbors. Transmit (id, x) if x is the chosen slot and empty otherwise. Let C be the set of pairs (id, x) received from the neighbors. Run again one round of schedule Long1to1 to communicate (id, C) to the neighbors. Let P be the set of chosen slot numbers during this round by its neighbor or neighbor’s neighbor. Update AV = AV − P . End Compressor Protocol Theorem 3. Suppose there are n nodes in the network. For any d > 0, the probability that a node does not choose a slot after c5 loge n iterations of the loop at step 4 is at most 1/nd+1 where c5 = 2b+ (d + 1) loge n. With the probability at least 1 − n1d , every node in the network will successfully choose a number. Proof. Consider an arbitrary node z and one iteration of the loop at step 4. Without loss of generality, let 0 ≤ k ≤ b + be the number of neighbors or neighbor’s neighbor of z without a chosen slot for communication. Observe that if there is only one choice then the node will choose the only remaining slot. Otherwise, each node will have at least two choices. Hence, the probability of choosing a number is at most 12 . So, when a node chooses a slot, the probability that k other nodes in the neighborhood do not choose the number is at least (1 − 12 )k = 21k . Therefore the probability that z will succeed in choosing a 1 number is at least 2b+ since k ≤ b + . The probability that a node z fails

280

B. Kalyanasundaram and M. Velauthapillai

1 to succeed in choosing a number after c5 log n steps is at most (1 − 2b+ )c5 loge n . 1 b+ 2b+ Set c5 = 2 (d + 1) loge n. Observe that (1 − 2b+ ) ≤ 1/e. Therefore, the probability that z fails to succeed in choosing a number after c5 loge n steps is 1 at most e(d+1)1loge n = nd+1 . The result follows since there are at most n nodes.

4

Conclusion

This paper provides a tight analysis of a randomized protocol to establish a single interference-free broadcast schedule for nodes in any bounded degree networks. Our protocol is simple and it reduces the number of random bits and number of broadcasts from O(log2 n) to O(log n). Experimental results show that the bounds predicted by the analysis is reasonably accurate.

References 1. Abadi, D.J., Madden, S., Lindner, W.: Reed: Robust, eﬃcient ﬁltering and event detection in sensor networks. In: VLDB, pp. 769–780 (2005) 2. Bonnet, P., Gehrke, J., Seshadri, P.: Querying the physical world. IEEE Personal Communication Magazine, 10–15 (2000) 3. Gandhi, R., Parthasarathy, S.: Distributed algorithms for connected domination in wireless networks. Journal of Parallel and Distributed Computing 67(7), 848–862 (2007) 4. Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L., Rubenstein, D.: [duplicate] energy-eﬃcient computing for wildlife tracking: Design tradeoﬀs and early experiences with zebranet (2002) 5. Kalyanasundaram, B., Velauthapillai, M.: Communication complexity of continuous pattern detection. Unpublished manuscript (January 2009) 6. Karl, H., Willig, A.: Protocols and Architectures for Wireless Sensor Networks. John Wiley & Sons, Chichester (2005) 7. Kim, S., Pakzad, S., Culler, D., Demmel, J., Fenves, G., Glaser, S., Turon, M.: Wireless sensor networks for structural health monitoring. In: SenSys 2006: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems, pp. 427–428. ACM, New York (2006) 8. Mainwaring, A., Polastre, J., Szewczyk, R., Culler, D.: Wireless sensor networks for habitat monitoring. In: Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications, pp. 88–97 (2002) 9. Paek, J., Chintalapudi, K., Govindan, R., Caﬀrey, J., Masri, S.: A wireless sensor network for structural health monitoring: Performance and experience. In: The Second IEEE Workshop on Embedded Networked Sensors, EmNetS-II 2005, pp. 1–10 (May 2005)

Reliable Networks with Unreliable Sensors Srikanth Sastry1 , Tsvetomira Radeva2 , Jianer Chen1 , and Jennifer L. Welch1 1

2

Department of Computer Science and Engineering Texas A&M University College Station, TX 77840, USA {sastry,chen,welch}@cse.tamu.edu Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA [email protected]

Abstract. Wireless sensor networks (WSNs) deployed in hostile environments suﬀer from a high rate of node failure. We investigate the eﬀect of such failure rate on network connectivity. We provide a formal analysis that establishes the relationship between node density, network size, failure probability, and network connectivity. We show that as network size and density increase, the probability of network partitioning becomes arbitrarily small. We show that large networks can maintain connectivity despite a signiﬁcantly high probability of node failure. We derive mathematical functions that provide lower bounds on network connectivity in WSNs. We compute these functions for some realistic values of node reliability, area covered by the network, and node density, to show that, for instance, networks with over a million nodes can maintain connectivity with a probability exceeding 99% despite node failure probability exceeding 57%.

1

Introduction

Wireless Sensor Networks (WSNs) [2] are being used in a variety of applications ranging from volcanology [21] and habitat monitoring [18] to military surveillance [10]. Often, in such deployments, premature uncontrolled node crashes are common. The reasons for this include, but are not limited to, hostility of the environment (like extreme temperature, humidity, soil acidity, and such), node fragility (especially if the nodes are deployed from the air on to the ground), and the quality control in the manufacturing of the sensors. Consequently, crash fault tolerance becomes a necessity (not just a desirable feature) in WSNs. Typically, a suﬃciently dense node distribution with redundancy in connectivity and coverage provides the necessary fault tolerance. In this paper, we analyze the connectivity fault tolerance of such large scale sensor networks and show how, despite high unreliability, ﬂaky sensors can build robust networks. The results in this paper address the following questions: Given a static WSN deployment (of up to a few million nodes) where (a) the node density is D nodes

This work was supported in part by NSF grant 0964696 and Texas Higher Education Coordinating Board grant NHARP 000512-0130-2007.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 281–292, 2011. c Springer-Verlag Berlin Heidelberg 2011

282

S. Sastry et al.

per unit area, (b) the area of the region is Z units, and (c) each node can fail1 with an independent and uniform probability ρ: what is the probability P that the network is connected (that is, the network is not partitioned)? What is the relationship between P , ρ, D, and Z? Motivation. The foregoing questions are of signiﬁcant practical interest. A typical speciﬁcation for designing a WSN is the area of coverage, an upper bound on the (ﬁnancial) cost, and QoS guarantees on connectivity (and coverage). High reliability sensor nodes oﬀer better guarantees on connectivity but also increase the cost. An alternative is to reduce the costs by using less reliable nodes, but the requisite guarantees on connectivity might necessitate greater node density (that is, greater number of nodes per unit area), which again increases the cost. As a network designer, it is desirable to have a function that accepts, as input, the speciﬁcations of a WSN and outputs feasible and appropriate design choices. We derive the elements of such a function in Sect. 6 and demonstrate the use of the results from Sect. 6 in Sect. 7. Contribution. This paper has three main contributions. First, we formalize and prove the intuitive conjecture that as node reliability and/or node density of a WSN increases, the probability of connectivity also increases. We provide a probabilistic analysis for the relationship between node reliability (ρ), node density (D), area of the WSN region (Z), and the probability of network connectivity(P ); we provide lower bounds for P as a function of ρ, D, and Z. Second, we provide concrete lower bounds for expected connectivity probability for various reasonable values of ρ, D, and Z. Third, we use a new technique of hierarchical network analysis to derive the lower bounds on a non-hierarchical WSN. To our knowledge, we are the ﬁrst to utilize this approach in wireless sensor networks. The approach, model, and proof techniques themselves may be of independent interest. Organization. The rest of this paper is organized as follows: The related work is described next in Section 2. The system model assumptions are discussed in Section 3. The methodology includes tiling the plane with regular hexagons. The analysis and results in this paper use a topological object called a level-z polyhex that is derived from a regular hexagon. The level-z polyhex is introduced in Section 4. Section 5 introduces the notion of level-z connectedness of an arbitrary WSN region. Section 6 uses this notion of level-z to formally establish the relationship between P , ρ, D, and Z. Finally, section 7 provides lower bounds on connectivity for various values of ρ, D, and Z.

2

Related Work

There is a signiﬁcant body of work on static analysis of topological issues associated with WSNs [12]. These issues are discussed in the context of coverage [13], connectivity [19], and routing [1]. 1

Node is said to fail if it crashes prior to its intended lifetime. See Sect. 3 for details.

Reliable Networks with Unreliable Sensors

283

The results in [19] focus on characterizing the fault tolerance of sensor networks by establishing the k-connectivity of a WSN. However, such characterization results in a poor lower bound of k − 1 on the fault tolerance which corresponds to the worst-case behavior of faults. It fails to characterize the expected probability of network partitioning in practical deployments. In other related results, Bhandari et al. [5] focus on optimal node density (or degree) for a WSN to be connected w.h.p, and Kim et al. [11] consider connectivity in randomly duty-cycled WSNs in which nodes take turns to be active to conserve power. A variant of network connectivity, called partial connectivity, is explored in [6] which derives derives the relationship between node density and the percentage f of the network expected to be connected. Our research addresses a diﬀerent, but related question: given a ﬁxed WSN region with a ﬁxed initial node density (and hence, degree) and a ﬁxed failure probability, what is the probability that the WSN will remain connected? The results in [16,4,22,20,3] establish and explore the relationship between coverage and connectivity. The results in [22] and [20] show that in large sensor networks if the communication radius rc is at least twice the coverage radius rs , then coverage of a convex area implies connectivity among the non-faulty nodes. In [4], Bai et al. establish optimal coverage and connectivity in regular patterns including square grids and hexagonal lattice where rc /rs < 2 by deploying additional sensors at speciﬁc locations. Results from [16] show that even if rc = rs , large networks in a square region can maintain connectivity despite high failure probability; however, connectivity does not imply coverage. Ammari et al., extend these results in [3] to show that if rc /rs = 1 in a k-covered WSN, then the network fault tolerance is given by 4rc (rc + rs )k/rs2 − 1 for a sparse distribution of node crashes. Another related result [17] shows that in a uniform random deployment of sensors in a WSN covering the entire region, the probability of maintaining connectivity approaches 1 as rc /rs approaches 2. Our work diﬀers from the works cited above in three aspects: (a) we focus exclusively on maintaining total connectivity, (b) while the results in [16,4,22,20] apply to speciﬁc deployment patterns or shape of a region, our results and methodology can be applied to any arbitrary region and any constant node density, and (c) our analysis is probabilistic insofar as node crashes are assumed to be independent random events, and we focus on the probability of network connectivity in the average case instead of the worst case. The tiling used in our model induces a hierarchical structure which can be used to decompose the connectivity property of a large network into connectivity properties of constituent smaller sub-networks of similar structure. This approach was ﬁrst introduced in [9], and subsequently used to analyze fault tolerance of hypercube networks [7] and mesh networks [8]. Our approach diﬀers from those in [7] and [8] as we construct higher order polyhex tiling using the underlying hexagons to derive a recursive function that establishes a lower bound on network connectivity as a function of ρ and D.

284

3

S. Sastry et al.

System Model

We make the following simplifying assumptions: – Node. The WSN has a ﬁnite ﬁxed set of n nodes. Each node has a communication radius R. – Region and tiles. A WSN region is assumed to be a ﬁnite plane tiled by regular hexagons whose sides are of length l such that nodes located in a given hexagon can communicate reliably2 with all the nodes in the same hexagon and adjacent hexagons. We assume that each hexagon contains at least D nodes. – Faults. A node can fail only by crashing before the end of its intended lifetime. Faults are independent and each node has a constant probability ρ of failing. – Empty tile. A hexagon is said to be empty if it contains only faulty nodes. We say that two non-faulty nodes p and p are connected if either p and p are in the same or neighboring hexagons, or there exists some sequence of nonfaulty nodes pi , pi+1 , . . . , pj such that p (and p , respectively) and pi (and pj , respectively) are in adjacent hexagons, and pk and pk+1 are in adjacent hexagons, where i ≤ k ≤ j. We say that a region is connected if every pair of non-faulty nodes p and p in the region are connected.

4

Higher Level Tilings: Polyhexes

For the analysis of WSNs in an arbitrary region, we use of the notion of higher level tilings by grouping sets of contiguous hexagons into ‘super tiles’ such that some speciﬁc properties (like the ability to tile the Euclidean plane) are preserved. Such ‘super tiles’ are called level-z polyhexes. Diﬀerent values of z specify diﬀerent level-z polyhexes. In this section we deﬁne a level-z polyhex and specify its properties. The following deﬁnitions are borrowed from [14]: A tiling of the Euclidean plane is a countable family of closed sets called tiles, such that the union of the sets is the entire plane and such that the interiors of the sets are pairwise disjoint. We are concerned only with monohedral tilings — tilings in which every tile is congruent to a single ﬁxed tile called the prototile. In our case, a regular hexagon is a prototile. We say that the prototile admits the tiling. A patch is a ﬁnite collection of non-overlapping tiles such that their union is a closed topological disk3 . A translational patch is a patch such that the tiling consists entirely of a lattice of translations of that patch. 2 3

We assume that collision resolution techniques are always successful in ensuring reliable communication. A closed topological disk is the image of a closed circular disk under a homeomorphism. Roughly speaking, homeomorphism is a continuous stretching and bending of the object into a new shape (you are not allowed to tear or ‘cut holes’ into the object). Thus, any two-dimensional shape that has a closed boundary, ﬁnite area, and no ‘holes’ is a closed topological disk. This includes squares, circles, ellipses, hexagons, and polyhexes.

Reliable Networks with Unreliable Sensors

(a) The gray tiles form a level-2 polyhex.

285

(b) A level-3 polyhex formed by 7 level-2 polyhexes A–F.

Fig. 1. Examples of Polyhexes

We now deﬁne a translational patch of regular hexagons called level-z polyhexes for z ∈ N as follows: – A level-1 polyhex is a regular hexagon: a prototile. – A level-z polyhex for z > 1 is a translational patch of seven level-(z − 1) polyhexes that admits a hexagonal tiling. Note that each level-z polyhex is made of seven level-(z −1) polyhexes. Therefore, the total number of tiles in a level-z polyhex is size(z) = 7z−1 . Figure 1(a) illustrates the formation of a level-2 polyhex with seven regular hexagons, and Fig. 1(b) illustrates how seven level-2 polyhexes form a level-3 polyhex. A formal proof that such level-z polyhexes exist for arbitrary values of z (in an inﬁnite plane tessellated by regular hexagons) is available at [15].

5

Level-z Polyhexes and Connectivity

The analysis in Section 6 is based on the notion of level-z connectedness that is introduced here. First, we deﬁne a ‘side’ to each level-z polyhex. Second, we introduce the concepts of connected level-z polyhexes and level-z connectedness in a WSN region. Finally, we show how level-z connectedness implies that all non-faulty nodes in a level-z polyhex of a WSN are connected. We use this result and the deﬁnition of level-z connectedness to derive a lower bound on the probability of network connectivity in Section 6. Side. The set of boundary hexagons that are adjacent to a given level-z polyhex are said be a ‘side’ of the level-z polyhex. Since a level-z polyhex can have 6 neighboring level-z polyhexes, every level-z polyhex has 6 ‘sides’. The number of hexagons along each ‘side’ (also called the ‘length of the side’) is given by z−2 sidelen(z) = 1 + i=0 3i where z ≥ 2.4 4

The proof for this equation is a straightforward induction on z and the proof has been omitted.

286

S. Sastry et al.

We now deﬁne what it means for a level-z polyhex to be connected. Intuitively, we say that a level-z polyhex is connected if the network of nodes in the level-z polyhex is not partitioned. Connected level-z polyhex. A level-z polyhex Tzi is said to be connected if, given the set Λ of all hexagons in Tzi that contain at least one non-faulty node, for every pair of hexagons p and q from Λ, there exists some (possibly empty) sequence of hexagons t1 , t2 , . . . , tj such that {t1 , t2 , . . . , tj } ⊆ Λ, and t1 is a neighbor of p, every ti is a neighbor of ti+1 , and tj is a neighbor of q. Note that if a level-z polyhex is connected, then all the non-faulty nodes in the level-z polyhex are connected as well. We are now ready to deﬁne the notion of level-z connectedness in a WSN region. Level-z connectedness. A WSN region W is said to be level-z connected if there exists some partitioning of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 We are now ready to prove the following theorem: Theorem 1. Given a WSN region W, if W is level-z connected, then all nonfaulty nodes in W are connected. Proof. Suppose that the region W is level-z connected. It follows that there exists some partitioning Λ of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . sidelen(z) Additionally, each ‘side’ of Tzi has at least non-empty hexagons. 2 To prove the theorem, it is suﬃcient to show that for any two non-faulty nodes in W in hexagons p and q, respectively, the hexagons p and q are connected. Let hexagon p lie in a level-z polyhex Tzp (∈ Λ), and let q lie in a level-z polyhex Tzq (∈ Λ). Note that since Λ is a partitioning of W, either Tzp = Tzq or Tzp and Tzq are disjoint. If Tzp = Tzq , then since Tzp is connected, it follows that p and q are connected. Hence, all non-faulty nodes in p are connected with all non-faulty nodes in q. Thus, the theorem is satisﬁed. If Tzp and Tzq are disjoint, then it follows from the deﬁnition of level-z connectedness that there exists some sequence of connected level-z polyhex Tz1 , Tz2 , . . . , Tzj such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 Consider any two neighboring level-z polyhexes (Tzm , Tzn ) ∈ Λ × Λ . Each ‘side’ of Tzm and Tzn has sidelen(z) hexagons. Therefore, Tzm and Tzn have

Reliable Networks with Unreliable Sensors

287

sidelen(z) boundary hexagons such that each such hexagon from Tzm (and respectively, Tzn ) is adjacent to two boundary hexagons in Tzn (and respectively, Tzm ), except for the two boundary hexagons on either end of the ‘side’ of Tzm (and respectively, Tzn ); these two hexagons are adjacent to just one hexagon in Tzn (and respectively, Tzm ). We know that at least sidelen(z) of these bound2 ary hexagons are non-empty. It follows that there exists at least one non-empty hexagon in Tzm that is adjacent to a non-empty hexagon in Tzn . Such a pair of non-empty hexagons (one in Tzm and the other in Tzn ) form a “bridge” between Tzm and Tzn allowing nodes in Tzm to communicate with nodes in Tzn . Since Tzm and Tzn are connected level-z polyhexes, it follows that nodes within Tzm and Tzn are connected as well. Additionally, we have established that there exist at least two hexagons, one in Tzm and one in Tzn that are connected. It follows that nodes in Tzm and Tzn are connected with each other as well. Thus, it follows that Tzp and Tz1 are connected, every Tzi is connected with Tz(i+1) , and Tzj is connected with Tzq . From the transitivity of connectedness, it follows that Tzp is connected with Tzq . That is, all non-faulty nodes in hexagon p are connected with all non-faulty nodes in q. Since p and q are arbitrary hexagons in W, it follows that all the nodes in W are connected. Theorem 1 provides the following insight into connectivity analysis of a WSN: for appropriate values of z, a level-z polyhex has fewer nodes than the entire region W. In fact, a level-z polyhex could have orders of magnitude fewer nodes than W. Consequently, the analysis of connectedness of a level-z polyhex is simpler and easier than the connectedness of the entire region W. Using Theorem 1, we can leverage such an analysis of a level-z polyhex to derive a lower bound on the connectivity probability of W. The foregoing motivation is explored next.

6

On Fault Tolerance of WSN Regions

We are now ready to derive a lower bound on the connectivity probability of an arbitrarily-shaped WSN region. Let W be a WSN region with node density of D nodes per hexagon such that the region is approximated by a patch of x level-z polyhexes that constitute a set Λ. Let each node in the region fail independently with probability ρ. Let ConnW denote the event that all the non-faulty nodes in the region W are connected. Let Conn(T,z,side) denote the event that a level-z polyhex T is connected and each ‘side’ of T has at least sidelen(z)/2 nonempty hexagons. We know that if W is level-z connected, then all the non-faulty nodes in W are connected. Also, W is level-z connected if: ∀T ∈ Λ :: Conn(T,z,side) . Therefore, the probability that W is connected is bounded by: P r [ConnW ] ≥ (P r Conn(T,z,side) )x . Thus, in order to ﬁnd a lower bound on P r [ConnW ], we have to ﬁnd the lower bound on (P r Conn(T,z,side) )x . Lemma 2. In a level-z polyhex T with node density of D nodes per hexagon, suppose each node fails independently with a probability ρ. Then the probability

288

S. Sastry et al.

that T is connected and each ‘side’ of T has at least sidelen(z)/2 non-empty size(z) hexagons is given by P r Conn(T,z,side) = i=0 Nz,i (1 − ρD )size(z)−i ρD×i , where Nz,i is the number of ways by which we can have i empty hexagons and size(z) − i non-empty hexagons in a level-z polyhex such that the level-z polyhex is connected and each ‘side’ of the level-z polyhex has at least sidelen(k)/2 non-empty hexagons. Proof. Fix i hexagons in T to be empty such that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons. Since nodes fail independently with probability ρ, and there are D nodes per hexagon, the probability that a hexagon is empty is ρD . Therefore, the probability that exactly i hexagons are empty in T is given by (1 − ρD )size(z)−i ρD×i . By assumption, there are Nz,i ways to ﬁx i hexagons to be empty. Therefore, the probability that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons despite i empty hexagons is given by Nz,i (1 − ρD )size(z)−i ρD×i . However, note that we can set i (the number of empty hexagons) to be anything from 0 to size(z). Therefore, P r Conn(T,z,side) is given by size(z) Nz,i (1 − ρD )size(z)−i ρD×i . i=0 Given the probability of Conn(T,z,side) , we can now establish a lower bound for the probability that the region W is connected. Theorem 3. Suppose each node in a WSN region W fails independently with probability ρ, W has a node density of D nodes per hexagon and tiled by a patch of x level-z polyhexes. Then that all non-faulty nodes in W are the probability connected is at least (P r Conn(T,z,side) )x Proof. There are x level-z polyhexes in W. Note that if W is level-z connected, then all non-faulty nodes in W are connected. However, observe that W is level-z connected if each such level-z polyhex is connected and each ‘side’ of each such level-z polyhex has at least sidelen(z)/2 non-empty hexagons. Recall from Lemma 2 that the probability of such an event for each polyhex is given by P r Conn(T,z,side) . Since there are x such level-z polyhex, and failure probability of nodes (and hence disjoint level-z polyhexes) is independent, it follows that the probability of W being connected is at least (P r Conn(T,z,side) )x . Note that the lower bound we have established depends on the function Nz,i deﬁned in Lemma 2. Unfortunately, to the best of our knowledge, there is no known algorithm that computes Nz,i in a reasonable amount of time. Since this is a potentially infeasible approach for large WSNs with millions of nodes, we provide an alternate lower bound for P r Conn(T,z,side) . from Lemma 2 is bounded below Lemma 4. The value of P r Conn(T,z,side) 7 by: P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) ) + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Proof. Recall that a level-z polyhex consists for seven level-(z−1) polyhexes with one internal level-(z − 1) polyhex and six outer level-(z − 1) polyhexes. Observe

Reliable Networks with Unreliable Sensors

289

that a level-z polyhex satisﬁes Conn(T,z,side) if either (a) all the seven level(z − 1) polyhexes satisfy Conn(T,z−1,side) , or (b) the internal level-(z − 1) polyhex is empty and the six outer level-(z − 1) polyhexes satisfy Conn(T,z−1,side) . From Lemma 2 we know that the probability of alevel-(z − 1) polyhex satisfying Conn(T,z−1,side) is given by P r Conn(T,z−1,side) and the probability of a level(z − 1) polyhex being empty is ρD×size(z−1) . For a level-1 polyhex (which is a regular hexagon tile), the probability that the hexagon is not empty is 1 − ρD . Therefore, for z > 1 is given the probability that cases (a) or (b) is satisﬁed by (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) . There fore, P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Analyzing the connectivity probability for WSN regions that are level-z connected where z is large, can be simpliﬁed by invoking Lemma 4, and reducing the complexity of the computation to smaller values of z for which P r Conn(T,z,side) can be computed (by brute force) fairly quickly.

7

Discussion

Choosing the size of the hexagon. For the results from the previous section to be of practical use, it is important that we choose the size of the hexagons in our system model carefully. On the one hand, choosing very large hexagons could violate the system model assumption that nodes can communicate with nodes in neighboring hexagons, and on the other hand, choosing small hexagons could result in poor lower bounds and thus result in over-engineered WSNs that incur high costs but with incommensurate beneﬁts. If we make no assumptions about the locations of nodes within √ hexagons, then the length l of the sides of a hexagon must be at most R/ 13 to ensure connectivity between non-faulty nodes in neighboring hexagons. However, if the nodes are “evenly” placed within each hexagon, then l can be as large as R/2 while still ensuring connectivity between neighboring hexagons. In both cases, the requirement is that the distance between two non-faulty nodes in neighboring hexagons is at most R. Computing Nz,i from Lemma 2. The function Nz,i does not have a closedform solution. It needs to be computed through exhaustive enumeration. We computed Nz,i for some useful values of z and i and included them in Table 1. Using these values, we applied Theorem 3 and Lemma 4 to sensor networks of diﬀerent sizes, node densities, and node failure probabilities. The results are presented in Table 2. Next, we demonstrate how to interpret and understand the entries in these tables through an illustrative example. Practicality. Our results can be utilized in the following two practical scenarios. (1) Given an existing WSN with known node failure probability, node density, and area of coverage, we can estimate the probability of connectivity of the entire network. First, we decide on the size of a hexagon as discussed previously,

290

S. Sastry et al. Table 1. Computed Values of Nz,i z k>2 3 3 3 3 3

i

Nz,i

z i k−1

1 size(k) = 7 2 1176 3 18346 4 208372 5 1830282 6 12899198

3 3 3 4 4 5

Nz,i

7 74729943 8 361856172 9 1481515771 2 58653 3 6666849 2 2881200

Table 2. Various values of node failure probability ρ, node density D, and level-z polyhex that yield network connectivity probability exceeding 99% Node No. of Node failure No. of Node failure density D Nodes prob. ρ Nodes prob. ρ 3 5 10 3 5 10 3 5 10

z = 2 (level-2 polyhex) z = 5 (level-5 polyhex) 21 35% 7203 24% 35 53% 12005 40% 70 70% 24010 63% z = 3 (level-3 polyhex) z = 6 (level-6 polyhex) 137 37% 50421 19% 245 50% 84035 36% 490 70% 24010 63% z = 4 (level-4 polyhex) z = 7 (level-7 polyhex) 1029 29% 352947 15% 1715 47% 588245 31% 3430 67% 1176490 57%

and then we consider level-z polyhexes that cover the region. Next, we apply Theorem 3 and Lemma 4 to compute the probability of connectivity of the network for the given values of ρ, D and z, and the precomputed values of Nz,i in Table 1. (2) The results in this paper can be used to design a network with a speciﬁed probability of connectivity. In this case, we decide on a hexagon size that best suits the purposes of the sensor network and determine the level of the polyhex(es) needed to cover the desired area. As an example, consider a 200 sq. km region (approximately circular, so that there are no ‘bottle neck’ regions) that needs to be covered by a sensor network with a 99% connectivity probability. Let the communication radius of each sensor be 50 meters. The average-case value of the length l of the side of the hexagon is 25 meters, and the 200 sq. km region is tiled by a single level-7 polyhex. From Table 2, we see that if the network consists of 3 nodes per hexagon, then the region will require about 352947 nodes with a failure probability of 15% (85% reliability). However, if the node redundancy is increased to 5 nodes per hexagon, then the region will require about 588245 nodes with a failure probability of 31% (69% reliability). If the node density is

Reliable Networks with Unreliable Sensors

291

increased further to 10 nodes per hexagon, then the region will require about 1176490 nodes with a failure probability of 57% (43% reliability). On the lower bounds. An important observation is that these values for node reliability are lower bounds, but are deﬁnitely not tight bounds. This is largely because in order to obtain tighter lower bounds, we need to compute the probability of network connectivity from Theorem 3. However, this requires us to compute the values for Nz,i for all values of i ranging from 1 to z, which is expensive for z exceeding 3. Consequently, we are forced to use the recursive function in Lemma 4 for computing the network connectivity for larger networks. This reduces the accuracy of the lower bound signiﬁcantly. A side eﬀect of this error is that in Table 2, we see that for a given D, ρ decreases as z increases. If we were to invest the time and computing resources to compute Nz,i for higher values of z (5, 6, 7, and greater), then the computed values for ρ in Table 2 would be signiﬁcantly larger.

References 1. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. Ad Hoc Networks 3(3), 325–349 (2005), http://dx.doi.org/10.1016/j.adhoc.2003.09.010 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38(4), 393–422 (2002), http://dx.doi.org/10.1016/S1389-12860100302-4 3. Ammari, H.M., Das, S.K.: Fault tolerance measures for large-scale wireless sensor networks. ACM Transactions on Autonomous and Adaptive Systems 4(1), 1–28 (2009), http://doi.acm.org/10.1145/1462187.1462189 4. Bai, X., Kumar, S., Xuan, D., Yun, Z., Lai, T.H.: Deploying wireless sensors to achieve both coverage and connectivity. In: MobiHoc 2006: Proceedings of the 7th ACM International Symposium on Mobile and Hoc Networking and Computing, pp. 131–142. ACM, New York (2006), http://doi.acm.org/10.1145/1132905.1132921 5. Bhandari, V., Vaidya, N.H.: Reliable broadcast in wireless networks with probabilistic failures. In: Proceedings of the 26th IEEE International Conference on Computer Communications, pp. 715–723 (2007), http://dx.doi.org/10.1109/INFCOM.2007.89 6. Cai, H., Jia, X., Sha, M.: Critical sensor density for partial connectivity in large area wireless sensor networks. In: Proceedings of the 27th IEEE International Conference on Computer Communications, pp. 1–5 (2010), http://dx.doi.org/10.1109/INFCOM.2010.5462211 7. Chen, J., Kanj, I.A., Wang, G.: Hypercube network fault tolerance: A probabilistic approach. Journal of Interconnection Networks 6(1), 17–34 (2005), http://dx.doi.org/10.1142/S0219265905001290 8. Chen, J., Wang, G., Lin, C., Wang, T., Wang, G.: Probabilistic analysis on mesh network fault tolerance. Journal of Parallel and Distributed Computing 67, 100–110 (2007), http://dx.doi.org/10.1016/j.jpdc.2006.09.002 9. Chen, J., Wang, G., Chen, S.: Locally subcube-connected hypercube networks: theoretical analysis and experimental results. IEEE Transactions on Computers 51(5), 530–540 (2002), http://dx.doi.org/10.1109/TC.2002.1004592

292

S. Sastry et al.

10. Kikiras, P., Avaritsiotis, J.: Unattended ground sensor network for force protection. Journal of Battleﬁeld Technology 7(3), 29–34 (2004) 11. Kim, D., Hsin, C.F., Liu, M.: Asymptotic connectivity of low duty-cycled wireless sensor networks. In: Military Communications Conference, pp. 2441–2447 (2005), http://dx.doi.org/10.1109/MILCOM.2005.1606034 12. Li, M., Yang, B.: A survey on topology issues in wireless sensor network. In: Proceedings of the 2006 International Conference On Wireless Networks, pp. 503–509 (2006) 13. Meguerdichian, S., Koushanfar, F., Potkonjak, M., Srivastava, M.: Coverage problems in wireless ad-hoc sensor networks. In: Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 1380– 1387 (2001), http://dx.doi.org/10.1109/INFCOM.2001.916633 14. Rhoads, G.C.: Planar tilings by polyominoes, polyhexes, and polyiamonds. Journal of Computational and Applied Mathematics 174(2), 329–353 (2005), http://dx.doi.org/10.1016/j.cam.2004.05.002 15. Sastry, S., Radeva, T., Chen, J.: Reliable networks with unreliable sensors. Tech. Rep. TAMU-CSE-TR-2010-7-4, Texas A&M University (2010), http://www.cse.tamu.edu/academics/tr/2010-7-4 16. Shakkottai, S., Srikant, R., Shroﬀ, N.B.: Unreliable sensor grids: coverage, connectivity and diameter. Ad Hoc Networks 3(6), 702–716 (2005), http://dx.doi.org/10.1016/j.adhoc.2004.02.001 17. Su, H., Wang, Y., Shen, Z.: The condition for probabilistic connectivity in wireless sensor networks. In: Proceedings of the Third International Conference on Pervasive Computing and Applications, pp. 78–82 (2008), http://dx.doi.org/10.1109/ICPCA.2008.4783653 18. Szewczyk, R., Polastre, J., Mainwaring, A., Culler, D.: Lessons from a sensor network expedition. In: Proceedings of the First European Workshop on Wireless Sensor Networks, pp. 307–322 (2004), http://dx.doi.org/10.1007/978-3-540-24606-0_21 19. Vincent, P., Tummala, M., McEachen, J.: Connectivity in sensor networks. In: Proceedings of the Fortieth Hawaii International Conference on System Sciences, p. 293c (2007), http://dx.doi.org/10.1109/HICSS.2007.145 20. Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., Gill, C.: Integrated coverage and connectivity conﬁguration in wireless sensor networks. In: SenSys 2003: Proceedings of the 1st International Conference on Embedded Networked Sensor Systems, pp. 28–39 (2003), http://doi.acm.org/10.1145/958491.958496 21. Werner-Allen, G., Lorincz, K., Welsh, M., Marcillo, O., Johnson, J., Ruiz, M., Lees, J.: Deploying a wireless sensor network on an active volcano. IEEE Internet Computing 10, 18–25 (2006), http://dx.doi.org/10.1109/MIC.2006.26 22. Zhang, H., Hou, J.: Maintaining sensing coverage and connectivity in large sensor networks. Ad Hoc & Sensor Wireless Networks 1(1-2) (2005), http://oldcitypublishing.com/AHSWN/AHSWNabstracts/AHSWN1.1-2abstracts/ AHSWNv1n1-2p89-124Zhang.html

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada {bari1,arunita,subir}@uwindsor.ca

Abstract. Design of fault-tolerant sensor networks is receiving increasing attention in recent times. In this paper we point out that simply ensuring that a sensor network can tolerate fault (s) is not suﬃcient. It is also important to ensure that the network remains viable for the longest possible time, even if a fault occurs. We have focussed on the problem of designing 2-tier sensor networks using relay nodes as cluster heads. Our objective is to ensure that the network has a communication strategy that extends, as much as possible, the period for which the network remains operational when there is a single relay node failure. We have described an Integer Linear Program (ILP) formulation and have used this formulation to study the eﬀect of single faults. We have compared our results to those obtained using standard routing protocols (Minimum Transmission Energy Model (MTEM)) and the Minimum Hop Routing Model (MHRM)). We have shown that our routing algorithm performs signiﬁcantly better, compared to the MTEM and the MHRM.

1

Introduction

A wireless sensor network (WSN) is a network of battery powered, multi-functional devices, known as sensor nodes. Each sensor node typically consists of a micro-controller, a limited amount of memory, sensing device(s), and wireless trans-ceiver(s) [2]. A sensor network performs its tasks through the collaborative eﬀorts of a large number of sensor nodes that are densely deployed within the sensing ﬁeld [2], [3], [4]. Data from each node in a sensor network are gathered at a central entity, called the base station [2], [5]. Sensor nodes are powered by batteries, and recharging or replacing the batteries is usually not feasible due to economic reasons and/or environmental constraints [2]. Therefore it is extremely important to design communication protocols and algorithms that are energy eﬃcient, so that the duration of useful operation, often called the lifetime [6], of a network, can be extended as much as possible [3], [4], [5], [7], [24]. The lifetime of a sensor network is deﬁned as the time interval from the inception of the operation of the network, to the time when a number of critical nodes “die” [5], [6].

A. Jaekel and S. Bandyopadhyay have been supported by discovery grants from the Natural Sciences and Engineering Research Council of Canada.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 293–302, 2011. c Springer-Verlag Berlin Heidelberg 2011

294

A. Bari, A. Jaekel, and S. Bandyopadhyay

Recently some special nodes, called relay nodes have been proposed for sensor networks [8] - [17]. The use of relay nodes, provisioned with higher power, has been proposed as cluster heads in two-tiered sensor networks [10], [12], [17], [18], [19] where each relay node is responsible for collecting data from the sensor nodes belonging to its own cluster and for forwarding the collected data to the base station. The model for transmission of data from a relay node to the base station may be categorized either as the single-hop data transmission model (SHDTM) or the multi-hop data transmission model (MHDTM) [15], [16], [17], [20]. In MHDTM, each relay node, in general, uses some intermediate relay node(s) to forward the data to the base station. The MHDTM is considered in this paper since it is particularly suitable for larger networks. In the non-flowsplitting model a relay node is not allowed to split the traﬃc, and forwards all its data to a single relay node (or to the base station), and there is always a single path from each relay node to the base station. This is a more appropriate model for 2-tier networks, with important technological advantages [8], [15], and has been used in this paper. In the periodic data gathering model [8] considered in this paper, each period of data gathering (starting from sensing until all data reach the base station) is referred to as a round [20]. Although provisioned with higher power, the relay nodes are also battery operated and hence, are power constrained [16], [17]. In 2-tier networks, the lifetime is primarily determined by the duration for which the relay nodes are operational [10], [19]. It is very important to allocate the sensor nodes to the relay nodes appropriately, and ﬁnd an eﬃcient communication scheme that minimizes the energy dissipation of the relay nodes. We have measured the lifetime of a 2-tier network, following the N-of-N metric [6], by the number of rounds the network operates from the start, until the ﬁrst relay node depletes its energy completely. In a 2-tier network using the N-of-N metric, assuming equal initial energy provisioning in each relay node, the lifetime of the network is deﬁned by the ratio of the initial energy to the maximum energy dissipated by any relay node in a round. Thus, maximizing the lifetime is equivalent to minimizing the maximum energy dissipated by any relay node in a round [8], [19]. In the ﬁrst-order radio model [5], [6] used here, energy is dissipated at a rate of α1 /bit (α2 /bit) for receiving (transmitting ) the data. The transmit ampliﬁer also dissipates β amount of energy to transmit unit bit of data over unit distance. The energy dissipated to receive b bits (transmit b bits over a distance d) is given by ERx = α1 b (ETx (b, d) = α2 b + βbdq ), where q is the path loss exponent, 2 ≤ q ≤ 4, for free space using short to medium-range radio communication. Due to the nature of the wireless media, and based on the territory of the deployment, nodes in a sensor network are prone to faults [23]. A sensor network should ideally be resilient with respect to faults. In 2-tier networks the failure of a single relay node may have a signiﬁcant eﬀect on the overall lifetime of the network [8]. In a fault-free environment, it is suﬃcient that each sensor node is able to send the data it collects to at least one relay node. To provide fault tolerance, we need a placement strategy that allows some redundancy of the relay nodes, so that, in the event of any failure(s) in relay node(s), each sensor

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

295

node belonging to the cluster of a failed relay node should be able to send its data to another fault-free relay node, and data from all fault-free relay nodes should still be able to reach the base station successfully. In [22], the authors have proposed an approximation algorithm to achieve single-connectivity and double-connectivity. In [25], authors have presented a two-step approximation algorithm to obtain a 1-connected and a 2-connected network. In [17], a 2-tier architecture is considered and an optimal placement of relay nodes for each cell is computed to allow 2-connectivity. Even though a signiﬁcant amount of work has focussed on extending the lifetime of a fault-free sensor network, including two-tier networks [5], [10], [15], [19] the primary objective of research on fault tolerant sensor networks was to ensure k-connectivity of the network, for some pre-speciﬁed k > 1. Such a design ensures that the network can handle k −1 faults, since there exists at least one route of fault-free relay nodes from every sensor node to the base station. However, when a fault occurs, some relay nodes will be communicating more data compared to the fault-free case and it is quite likely that the fault may signiﬁcantly aﬀect the lifetime of the network. To the best of our knowledge, no research has attempted to minimize the eﬀect of faults on the lifetime of a sensor network. Our approach is diﬀerent from other research in this area since our objective is to guarantee that the network will be operational for the maximum possible period of time, even in the presence of faults. We have conﬁned our work to the most likely scenario of single relay node failure and have shown how to design the network to maximize the lifetime of the network when a single relay node becomes faulty. Obviously, to handle single faults in any relay node, all approaches, including ours, must design a network which guarantees that each sensor node has a fault-free path to the base station avoiding the faulty relay node. In our approach, for any case of single relay node failure, we select the paths from all sensor nodes to the base station in such a way that a) we avoid the faulty relay node and b) the selected paths are such that the lifetime of the network is guaranteed to be as high as possible.

2 2.1

Fault Tolerant Routing Design Network Model

We consider a two-tiered wireless sensor network model with n relay nodes and a base station. All data from the network are collected at the base station. For convenience we assign labels 1, 2, 3, 4, ..., n to the relay nodes and label n + 1 to the base station. If a sensor node i can send it data to relay node j, we will say that j covers i. We assume that relay nodes are placed in such a way that each sensor node is covered by at least two relay nodes. This ensures that when a relay node fails, all sensor nodes in its cluster can be reassigned to other cluster (s), and the load (in terms of the number of bits) generated in the cluster of the failed node is redistributed among the neighboring relay nodes. A number of recent papers have addressed the issue of fault-tolerant placement of relay nodes to implement double (or multiple) coverage of each sensor node [17], [21], [22],

296

A. Bari, A. Jaekel, and S. Bandyopadhyay

[25]. Such fault-tolerant placement schemes can also indicate the “backup” relay node for each sensor node, when its original cluster head fails. We assume that the initial placement of the relay nodes has been done according to one of these existing approaches, so that the necessary level of coverage is achieved. Based on the average amount of data generated by each cluster and the location of the relay nodes, the Integer Linear Program (ILP) given below calculates the optimal routing schedule such that the worst case lifetime for any single fault scenario is maximized. In the worst case situation, a relay node fails from the very beginning. We therefore consider all single relay node failures, occurring when the network starts operating and determine which failure has the worst eﬀect on the lifetime, even if an optimal routing schedule is followed to handle the failure. This calculation is performed oﬄine, so it is reasonable to use an ILP to compute the most energy-eﬃcient routing schedule. The backup routing schedule for each possible fault can be stored either at the individual relay nodes or at the base station. In the second option, the base station, which is not power constrained, can transmit the updated schedule to the relay nodes, when needed. For each relay node, the energy required for receiving the updated schedules is negligible, compared to the energy required for data transmission, and hence is not expected to aﬀect the overall lifetime signiﬁcantly. In our model, applications are assumed to have long idle time and are able to tolerate some latency [26], [27]. The nodes sleep during the idle time, and transmit/receive when they are awake. Hence, energy is dissipated by a node only during while the nodes are either transmitting, or receiving. We further assume that both sensor and relay nodes communicate through an ideal shared medium. As in [12], [13], we assume that communication between nodes, including the sleep/wake scheduling and the underlying synchronization protocol, is handled by appropriate state-of-the-art MAC protocols, such as those proposed in [26], [27], [28], [29]. 2.2

Notation Used

In our formulation we are given the following data as input: • α1 (α2 ): Energy coeﬃcient for reception (transmission). • β: Energy coeﬃcient for ampliﬁer. • q: Path loss exponent. • bi : Number of bits generated per round by the sensor nodes belonging to cluster i, in the fault-free case. that are reas• bki : Number of bits per round, originally from cluster k, n signed to cluster i, when relay node k fails. Clearly, i=1;i=k bki = bk . • n: Total number of relay nodes.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

297

• n + 1: Index of the base station. • C: A large constant, greater than the total number of bits received by the base station in a round. • dmax : Transmission range of each relay node. • di,j : Euclidean distance from node i to node j. We also deﬁne the following variables: k : A binary variable deﬁned as follows: • Xi,j 1 if node i selects j to send its data when relay node k fails, k Xi,j = 0 otherwise.

• Tik : Number of bits transmitted by relay node i when relay node k fails. • Gki : Amount of energy needed by the ampliﬁer in relay node i to send its data to the next hop in its path to the base station when relay node k fails. • Rik : Number of bits received by relay node i from other relay nodes when relay node k fails. k • fi,j : Amount of ﬂow from relay node i to relay node j, when relay node k fails. • Fmax : The total energy spent per round by the relay node which is being depleted at the fastest rate when any one relay node fails. 2.3

ILP Formulation for Fault Tolerant Routing (ILP-FTR) Minimize Fmax

(1)

Subject to: a) The range of transmission from a relay node is dmax . k · di,j ≤ dmax Xi,j

∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1

(2)

b) Ensure that the non-ﬂow-splitting model is followed so that all data from relay node i are forwarded to only one other node j. n+1

k Xi,j =1

∀i, k, 1 ≤ i, k ≤ n; k = i

(3)

j=1;j=i,k

c) Only one outgoing link from relay node i can have non-zero data ﬂow. k k fi,j ≤ C · Xi,j

∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1; j = k;

(4)

298

A. Bari, A. Jaekel, and S. Bandyopadhyay

d) Satisfy ﬂow constraints. n+1

n

k fi,j −

j=1;j=i,k

k fj,i = bi + bki

∀i, k, 1 ≤ i, k ≤ n; k = i

(5)

j=1;j=i,k

e) Calculate the total number of bits transmitted by relay node i. n+1

Tik =

k fi,j

∀i, k, 1 ≤ i, k ≤ n; k = i

(6)

j=1;j=i,k

f) Calculate the ampliﬁer energy dissipated by relay node i to transmit to the next hop. Gki = β

n+1

k fi,j · (di,j )q

∀i, k, 1 ≤ i, k ≤ n; k = i

(7)

j=1;j=i,k

g) Calculate the number of bits received by node i from other relay node(s). Rik =

n

k fj,i

∀i, k, 1 ≤ i, k ≤ n; k = i

(8)

j=1;j=i,k

h) Energy dissipated per round by relay node i, when node k has failed, must be less than Fmax . α1 (Rik + bki ) + α2 Tik + Gki ≤ Fmax 2.4

∀i, k, 1 ≤ i, k ≤ n; k = i

(9)

Justification of the ILP Equations

Equation (1) is the objective function for the formulation that minimizes the maximum energy dissipation by any individual relay node in one round of data gathering, for all possible fault scenarios. Constraint (2) ensures that a relay node i cannot transmit to a node j, if j is outside the transmission range of node i. Constraints (3) and (4) indicate that, for any given fault (e.g. a fault in node k), a non-faulty relay node i can only transmit data to exactly one other (nonfaulty) node j. Constraint (5) is the standard ﬂow constraint [1], used to ﬁnd a route to the base station, for the data originating in each cluster, when node k fails. We note that the total data (number of bits) generated in cluster i, when node k fails, is given by the number of bits bi , originally generated in cluster i, plus the additional number of bits bki , reassigned from cluster k to cluster i, due to the failure of relay node k. Constraint (6) speciﬁes the total number of bits Tik transmitted by the relay node i, when node k has failed. Constraint (7) is used to calculate Gki , the total ampliﬁer energy needed at relay node i when node k fails, by directly applying the ﬁrst order radio model [5], [6]. Constraint (8) is used to calculate the total number bits Rik received at relay node i from other relay node(s), when node k fails. Finally, (9) gives the total energy dissipated by each relay node, when node k fails. The total energy dissipated by a relay node, for any possible fault scenario (i.e. any value of k), cannot exceed Fmax , which the formulation attempts to minimize.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

3

299

Experimental Results

In this section, we present the simulation results for our fault-tolerant routing scheme. We have considered a 240×240m networking area with seventeen relay nodes, and with sensor nodes randomly distributed in the area. The results were obtained using the ILOG CPLEX 9.1 solver [30]. For each fault scenario (i.e. a speciﬁed relay node becomes faulty), we have measured the achieved lifetime of the network by the number of rounds until the ﬁrst fault-free relay node runs out of battery power. The lifetime in the presence of a fault can vary widely, depending on the location and load of the faulty node. When reporting the lifetime achieved in the presence of a fault, we have taken the worst case value, i.e. the lowest lifetime obtained after considering all possible single node failures, with the node failure occurring immediately after the network has been deployed. For experimental purposes, we have considered a number of diﬀerent sensor node distributions, with the number of sensor nodes in the network varying from 136 to 255. We have assumed that 1. the communication energy dissipation is based on the ﬁrst order radio model, described in Section 1 and 2. the values for the constants are the same as in [5], so that: (a) α1 = α2 = 50nJ/bit, (b) β = 100pJ/bit/m2 and (c) the path-loss exponent, q = 2. 3. the range of each sensor node is 80m, 4. the range of each relay node is 200m, as in [17], and 5. the initial energy of each relay node was 5J, as in [17]. We also assumed that a separate node placement and clustering scheme (as in [17], [21]) is used to ensure that each sensor and relay node has a valid path to the base station for all single fault scenarios, and to pre-assign the sensor nodes to clusters. Under these assumptions, we have compared the performance of our scheme with two existing well-known schemes that are widely used in sensor networks. i) Minimum transmission energy model (MTEM) [5], where each node i transmits to its nearest neighbor j, such that node j is closer to the base station than node i. ii) Minimum hop routing model (MHRM) [12], [14], where each node ﬁnds a path to the base station that minimizes the number of hops. Figure 1 compares the obtained network lifetime using ILP-FTR, MTEM and MHRM schemes. As shown in the ﬁgure, our formulation substantially outperforms both MTEM and MHRM approaches, under any single relay node failure. Furthermore, the ILP guarantees the “best” solution (with respect to the objective being optimized). The results show that, under any single relay node failure, our method can typically achieve an improvement of more than 2.7 times the

300

A. Bari, A. Jaekel, and S. Bandyopadhyay

Lifetime in rounds

2500

2000

1500

MTEM MHRM ILP-FTR

1000

500

0 136

170

255

Number of sensor nodes

Fig. 1. Comparison of the lifetimes in rounds, obtained using the ILP-FTR, MTEM and MHRM on networks with diﬀerent number of sensor nodes

2500

Lifetime in rounds

2000

1500

MTEM MHRM ILP-FTR

1000

500

0 1

3

5

7

9

11

13

15

17

Index of relay nodes

Fig. 2. Variation of the lifetimes in rounds, under the fault of diﬀerent relay nodes, obtained using the ILP-FTR, MTEM and MHRM on a network with 170 sensor nodes

network lifetime, compared to MTEM, and 2.3 times the network lifetime compared to MHRM. Figure 2 shows how the network lifetime varies with the failure of a relay node under ILP-FTR, MTEM and MHRM schemes on a network with 170 sensor nodes. As the ﬁgure shows, our approach provides substantial improvement over the other approaches, in terms of the network lifetime, considering all failure scenarios. Using our approach, it is also possible to identify the relay node(s) that is(are) most critical, and possibly, provide some additional protection (e.g., deployment of back-up node(s)) to guarantee the lifetime. Finally, we note that MTEM appears to be much more vulnerable to ﬂuctuations in lifetime, depending on the particular node that failed, compared to the other two schemes.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

4

301

Conclusions

In this paper we have addressed the problem of maximizing the lifetime of a two-tier sensor network, in the presence of faults. Although many papers have considered energy-aware routing for the fault-free case, and also proposed deploying redundant relay nodes for meeting connectivity and coverage requirements, we believe this is the ﬁrst paper to investigate energy-aware routing for diﬀerent fault-scenarios. Our approach optimizes the network lifetime that can be achieved, and provides the corresponding routing scheme to be followed to achieve this goal, for any single node fault. The simulation results show that the proposed approach can signiﬁcantly improve network lifetime, compared to standard schemes such as MTEM and MHRM.

References 1. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliﬀs (1993) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38, 393–422 (2002) 3. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. IEEE Transactions On Mobile Computing 3(3), 325–349 (2005) 4. Chong, C.-Y., Kumar, S.P.: Sensor Networks: Evolution, Opportunities, and Challenges. The Proceeding of IEEE 91(8), 1247–1256 (2003) 5. Heinzelman, W., Chandrakasan, A., Balakrishnan, H.: Energy eﬀcient communication protocol for wireless micro-sensor networks. In: 33rd HICSS, pp. 3005–3014 (2000) 6. Pan, J., Hou, Y.T., Cai, L., Shi, Y., Shen, S.X.: Topology Control for Wireless Sensor Networks. In: The Proceedings of International Conference on Mobile Computing and Networking, pp. 286–299 (2003) 7. Duarte-Melo, E.J., Liu, M.: Analysis of energy consumption and lifetime of heterogeneous wireless sensor networks. In: The Proceeding of IEEE Global Telecommunications Conference, vol. 1, pp. 21–25 (2002) 8. Bari, A.: Energy Aware Design Strategies for Heterogeneous Sensor Networks. PhD thesis, University of Windsor (2010) 9. Bari, A., Jaekel, A., Bandyopadhyay, S.: Integrated Clustering and Routing Strategies for Large Scale Sensor Networks. In: Akyildiz, I.F., Sivakumar, R., Ekici, E., de Oliveira, J.C., McNair, J. (eds.) NETWORKING 2007. LNCS, vol. 4479, pp. 143–154. Springer, Heidelberg (2007) 10. Bari, A., Jaekel, A., Bandyopadhyay, S.: Optimal Placement and Routing Strategies for Resilient Two-Tiered Sensor Networks. Wireless Communications and Mobile Computing 9(7), 920–937 (2008), doi:10.1002/wcm.639 11. Cheng, X., Du, D.-Z., Wang, L., Xu, B.B.: Relay Sensor Placement in Wireless Sensor Networks. Wireless Networks 14(3), 347–355 (2008) 12. Gupta, G., Younis, M.: Load-balanced clustering of wireless sensor networks. In: IEEE International Conference on Communications, vol. 3, pp. 1848–1852 (2003) 13. Gupta, G., Younis, M.: Fault-tolerant clustering of wireless sensor networks. In: IEEE WCNC, pp. 1579–1584 (2003)

302

A. Bari, A. Jaekel, and S. Bandyopadhyay

14. Gupta, G., Younis, M.: Performance evaluation of load-balanced clustering of wireless sensor networks. In: International Conference on Telecommunications, vol. 2, pp. 1577–1583 (2003) 15. Hou, Y.T., Shi, Y., Pan, J., Midkiﬀ, S.F.: Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing. IEEE Transactions on Mobile Computing 5(9), 1255–1266 (2006) 16. Hou, Y.T., Shi, Y., Sherali, H.D., Midkiﬀ, S.F.: On Energy Provisioning and Relay Node Placement for Wireless Sensor Networks. In: IEEE International Conference on Sensor and Ad Hoc Communications and Networks (SECON), vol. 32 (2005) 17. Tang, J., Hao, B., Sen, A.: Relay node placement in large scale wireless sensor networks. Computer Communications 29(4), 490–501 (2006) 18. Bari, A., Jaekel, A., Bandyopadhyay, S.: Clustering Strategies for Improving the Lifetime of Two-Tiered Sensor Networks. Computer Communications 31(14), 3451– 3459 (2008) 19. Bari, A., Jaekel, A., Bandyopadhyay, S.: A Genetic Algorithm Based Approach for Energy Eﬃcient Routing in Two-Tiered Sensor Networks. Ad Hoc Networks Journal, Special Issue: Bio-Inspired Computing 7(4), 665–676 (2009) 20. Kalpakis, K., Dasgupta, K., Namjoshi, P.: Eﬃcient algorithms for maximum lifetime data gathering and aggregation in wireless sensor networks. Computer Networks 42(6), 697–716 (2003) 21. Bari, A., Wu, Y.: Jaekel, A. Integrated Placement and Routing of Relay Nodes for Fault-Tolerant Hierarchical Sensor Networks. In: IEEE ICCCN - SN, pp. 1–6 (2008) 22. Hao, B., Tang, J., Xue, G.: Fault-tolerant relay node placement in wireless sensor networks: formulation and approximation. In: Workshop on High Performance Switching and Routing (HPSR), pp. 246–250 (2004) 23. Alwan, H., Agarwal, A.: A Survey on Fault Tolerant Routing Techniques in Wireless Sensor Networks. In: SensorComm, pp. 366–371 (2009) 24. Wu, Y., Fahmy, S., Shroﬀ, N.B.: On the Construction of a Maximum-Lifetime Data Gathering Tree in Sensor Networks: NP-Completeness and Approximation Algorithm. In: INFOCOM, pp. 356–360 (2008) 25. Liu, H., Wan, P.-J., Jia, X.: Fault-Tolerant Relay Node Placement in Wireless Sensor Networks. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 230– 239. Springer, Heidelberg (2005) 26. Ye, W., Heidemann, J., Estrin, D.: An Energy-Eﬃcient MAC protocol for Wireless Sensor Networks. In: IEEE INFOCOM, pp. 1567–1576 (2002) 27. Ye, W., Heidemann, J., Estrin, D.: Medium access control with coordinated adaptive sleeping for wireless sensor networks. IEEE/ACM Transactions on Networking 12(3), 493–506 (2004) 28. Wu, Y., Fahmy, S., Shroﬀ, N.B.: Optimal Sleep/Wake Scheduling for timesynchronized sensor networks with QoS guarantees. In: The Proceedings of IEEE IWQoS, pp. 102–111 (2006) 29. Wu, Y., Fahmy, S., Shroﬀ, N.B.: Energy Eﬃcient Sleep/Wake Scheduling for MultiHop Sensor Networks: Non-Convexity and Approximation Algorithm. In: The Proceedings of IEEE INFOCOM, pp. 1568–1576 (2007) 30. ILOG CPLEX 9.1 Documentation. Available at the website, http://www.columbia.edu/~ dano/resources/cplex91_man/index.html

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time Congduc Pham University of Pau, LIUPPA Laboratory, Avenue de l’Universit´e - BP 1155 64013 PAU CEDEX, France [email protected] http://web.univ-pau.fr/~ cpham

Abstract. This paper proposes to use video sensor nodes to provide an eﬃcient intrusion detection system. We use a scheduling mechanism that takes into account the criticality of the surveillance application and present a performance study of various cover set construction strategies that take into account cameras with heterogeneous angle of view and those with very small angle of view. We show by simulation how a dynamic criticality management scheme can provide fast event detection for mission-critical surveillance applications by increasing the network lifetime and providing low stealth time of intrusions. Keywords: Sensor networks, video surveillance, coverage, mission-critical applications.

1

Introduction

The monitoring capability of Wireless Sensor Networks (WSN) make them very suitable for large scale surveillance systems. Most of these applications have a high level of criticality and can not be deployed with the current state of technology. This article focuses on Wireless Video Sensor Networks (WVSN) where sensor nodes are equipped with miniaturized video cameras. We consider WVSN for mission-critical surveillance applications where sensors can be thrown in mass when needed for intrusion detection or disaster relief applications. This article also focuses on taking into account cameras with heterogeneous angle of view and those with very small angle of view. Surveillance applications [1,2,3,4,5] have very speciﬁc needs due to their inherently critical nature associated to security . Early surveillance applications involving WSN have been applied to critical infrastructures such as production systems or oil/water pipeline systems [6,7]. There have also been some propositions for intrusion detection applications [8,9,10,11] but most of these studies focused on coverage and energy optimizations without explicitly having the application’s criticality in the control loop which is the main concern in our work. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 303–314, 2011. c Springer-Verlag Berlin Heidelberg 2011

304

C. Pham

For instance, with video sensors, the higher the capture rate is, the better relevant events could be detected and identiﬁed. However, even in the case of very mission-critical applications, it is not realistic to consider that video nodes should always capture at their maximum rate when in active mode. The notion of cover set has been introduced to deﬁne the redundancy level of a sensor nodes that monitor the same region. In [12] we developed the idea that when a node has several covers, it can increase its frame capture rate because if it runs out of energy it can be replaced by one of its cover sets. Then, depending on the application’s criticality, the frame capture rate of those nodes with large number of cover sets can vary: a low criticality level indicates that the application does not require a high video frame capture rate while a high criticality level does. According to the application’s requirements, an R0 value that indicate the criticality level could be initialized accordingly into all sensors nodes prior to deployment. Based on the criticality model we developed previously in [12], this article has 2 contributions. The ﬁrst contribution is an enhanced model for determining sensor’s cover sets that takes into account cameras with heterogeneous angle of view and those with very small angle of view. The performance of this approach is evaluated through simulation. The second contribution is to show the performance of the multiple cover sets criticality-based scheduling method proposed in [12] for fast event detection in mission-critical applications. The paper is then organized as follows: Section 2 present the coverage model and our approach for quickly building multiple cover sets per sensor. In section 3 we quickly present the dynamic criticality management model and then present the main contribution of this paper that focuses on fast event detection in section 4. We conclude in section 5.

2

Video Sensor Model

A video sensor node v is represented by the FoV of its camera. In our approach, we consider a commonly used 2-D model of a video sensor node where the FoV − → is deﬁned as a triangle (pbc) denoted by a 4-tuple v(P, d, V , α). Here P is the → − position of v, d is the distance pv (depth of view, DoV), V is the vector rep-

(a) Coverage model

(b) Heterogeneous AoV Fig. 1. Coverage model

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

305

resenting the line of sight of the camera’s FoV which determines the sensing → − direction, and α is the angle of the FoV on both sides of V (2α can be denoted as the angle of view, AoV). The left side of ﬁgure 1(a) illustrates the FoV of a video sensor node in our model. The AoV (2α) is 30o and distance bc is the linear FoV which is usually expressed in ft/1000yd or millimeters/meter. By using simple trigonometry relations we can link bc to pv with the following relation sin α bc = 2cos α .pv. We deﬁne a cover set Coi (v) of a video node v as a subset of video nodes such that: v ∈Coi (v) (v ’s FoV area) covers v’s FoV area. Co(v) is deﬁned as the set of all the cover sets Coi (v) of node v. One of the ﬁrst embedded camera on a wireless sensor hardware is the Cyclops board designed for the CrossBow Mica2 sensor [13] which is advertized to have an AoV of 52o . Recently, the IMB400 multimedia board has been designed for the Intel Mote2 sensor and has an AoV of about 20o , which is rather small. Obviously, the linear FoV and the AoV are important criteria in video sensor networks deployed for mission-critical surveillance applications. The DoV is a more subjective parameter. Technically, DoV could be very large but practically it is limited by the fact that an observed object must be suﬃciently big to be identiﬁed. 2.1

Determining Cover Sets

In the case of an omnidirectional sensing, a node can simply determine what parts of the coverage disc is covered by its neighbors. For the FoV coverage the task is more complex and determining whether a sensor’s FoV is completely covered or not by a subset of neighbor sensors is a time consuming task which is usually too resource-consuming for autonomous sensors. A simple approach presented in [14] is to use signiﬁcant points of a sensor’s FoV to quickly determine cover sets that may not completely cover sensor v’s FoV but a high percentage of it. First, sensor v can classify its neighbors into 3 categories of nodes, (i) those that cover point p, (ii) those cover point b and (iii) those that cover point c. Then, in order to avoid selecting neighbors that cover only a small portion of v’s FoV, we add a fourth point taken near the center of v’s FoV to construct a fourth set and require that candidate neighbors covers at least one of the 3 vertices and the fourth point. It is possible to use pbc’s center of gravity, noted point g, as depicted in ﬁgure 1(a)(right). In this case, a node v can practically computes Co(v) by ﬁnding the following sets, where N (v) represents the set of neighbors of node v: – P/B/C/G = {v ∈ N (v) : v covers point p/b/c/g of the FoV} – P G = {P ∩ G}, BG = {B ∩ G}, CG = {C ∩ G} Then, Co(v) can be computed as the Cartesian product of sets P G, BG and CG ({P G × BG × CG}). However, compared to the basic approach described in [14], point g may not be the best choice in case of heterogeneous camera’s AoV and very small AoV as will be explained in the next subsections.

306

2.2

C. Pham

The Case of Heterogeneous AoV

It is highly possible that video sensors with diﬀerent angles of view are randomly deployed. In this case, a wide-angle FoV could be covered by narrow-angle FoV sensors and vice-versa. Figure 1(b) shows these cases and the left part of the ﬁgure shows the most problematic case when a wide FoV (2α = 60o ) has to be covered by a narrow FoV (2α = 30o ). As we can see, it becomes very diﬃcult for a narrow angle node to cover pbc’s center of gravity g and one of the vertice at the same time.

(a) Heterogeneous AoV

(b) Very small AoV

Fig. 2. Using more alternate points

The solution we propose in this paper is to use alternate points gp , gb and gc that are set in ﬁgure 2(a)(left) as the mid-point of segment [pg], [bg] and [cg] respectively. It is also possible to give diﬀerent weights as shown in the right part of the ﬁgure. When using these additional points, it is possible to require that a sensor vx either covers both c and gc or gc and g (the same for b and gb , and p and gp ) depending on whether the edges or the center of sensor v’s FoV are privileged. Generalizing this method by using diﬀerent weights to set gc , gb and gp closer or farther from there respective vertices can be useful to set which parts v’s FoV has more priority as depicted in ﬁgure 2(a)(right) where gc has moved closer to g, gb closer to b and gp closer to p. 2.3

The Case of Very Small AoV

On some hardware, the AoV can be very small. This is the case for instance with the IMB400 multimedia board on the iMote2 which has an AoV of 2α = 20o . Figure 2(b)(left) shows that in this case, the most diﬃcult scenario is to be able to cover both point p and point gp if gp is set too far from p. As it is not interesting to move gp closer to p with such a small AoV, the solution we propose is to discard point p and only consider point gp that could move along segment [pg] as previously. Therefore in the scenario depicted in ﬁgure 2(b)(right), we have P G = {v3 , v6 }, BG = {v1 , v2 , v5 } and CG = {v4 } resulting in Co(v) = {{v3 , v1 , v4 }, {v3 , v2 , v4 }, {v3 , v5 , v4 }, {v6 , v1 , v4 }, {v6 , v2 , v4 }, {v6 , v5 , v4 }}.

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

2.4

307

Accuracy of the Proposed Method

Using speciﬁc points is of course approximative and a cover can satisfy the speciﬁc points coverage conditions without ensuring the coverage of the entire FoV. To evaluate the accuracy of our cover sets construction technique, especially for very small AoV, we conducted a series of simulations based on the discrete event simulator OMNet++ (http://www.omnetpp.org/). The results were obtained from iterations with various node populations on a 75m.75m area. Nodes → − have random position P , random line of sight V , equal communication ranges of 30m (which determines neighbor nodes), equal DoV of 25m and an oﬀset angle α. We will test with 2α = 20o (α = π/18), 2α = 36o (α = π/10) and 2α = 60o (α = π/6). We run each simulation 15 times to reduce the impact of randomness. The results (averaged over the 15 simulation runs) are summarized in table 1. We will denote by COpbcG , COpbcApbc and CObcApbc the following respective strategies: (i) the triangle points are used with g, which is pbc’s center of gravity, when determining eligible neighbors to be included in a sensor’s cover sets, (ii) alternates points gp , gb and gc are used with the triangle points and, (iii) same as previously except that point p is discarded. The ”stddev of %coverage” column is the standard deviation over all the simulation runs. A small standard deviation value means that the various cover sets have percentages of coverage of the initial FoV close to each other. When ”stddev of %coverage” is 0, it means that each simulation run gives only 1 node with 1 cover set. This is usually the case when the strategy to construct cover sets is too restrictive. Table 1 is divided in 3 parts. The ﬁrst part shows the COpbcG strategy with 2α = 60o , 2α = 36o and 2α = 20o . We can see that using point g gives very high percentage of coverage but with 2α = 36o very few nodes do have cover sets compared to the case when 2α = 60o . With very small AoV, the position of point g is not suitable as no cover sets are found. The second part of table 1 shows the COpbcApbc strategy, where alternate points gp , gb and gc are used along with the triangle vertices, with 2α = 36o and 2α = 20o . For 2α = 36o , this strategy succeeds in providing both a high percentage of coverage and a larger number of nodes with cover sets. When 2α = 20o the percentage of coverage is over 70% but once again very few nodes do have cover sets. This second part also shows CObcApbc (point p is discarded) with 2α = 20o . We can see that this strategy is quite interesting as the number of nodes with cover sets increases for a percentage of coverage very close to the previous case. In addition, the mean number of cover sets per node greatly increases which is highly interesting as nodes with high number of cover sets could act as sentry nodes in the network. The last part of table 1 uses a mixed AoV scenario where 80% of nodes have an AoV of 20o and 20% of nodes an AoV of 36o . This last part shows the performance of the 3 strategies and we can see that CObcApbc presents the best tradeoﬀ in terms of percentage of coverage, number of nodes with cover sets and mean number of cover sets per nodes when many nodes have a small AoV.

308

C. Pham

Table 1. Results for COpbcG , COpbcApbc and CObcApbc . 2α = 20o , 2α = 36o and mixed AoV COpbcG #nodes 75 100 125 150 175

60o

COpbcG #nodes 75 100 125 150 175

36o

COpbcG #nodes all cases

20o

% nodes with mean coverset coverage 4.89 7.13 11.73 17.11 26.19

94.04 94.63 95.06 95.44 94.64

% nodes with mean coverset coverage 0 1 1.87 1.78 3.43

0 92,03 91.45 95.06 94.42

% nodes with mean coverset coverage 0

0

COpbcApbc 36o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

12.44 20.13 30.67 35.11 48.57

COpbcApbc 20o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

1.13 2 2.67 4 7.43

CObcApbc 20o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

7.56 9.13 12.53 21.13 25.13

COpbcG 20o (80%) 36o (20%)

% nodes with mean coverset coverage

#nodes 75,100,125 150 175 COpbcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175 CObcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175

0 0.66 0.57

77.48 79.62 76.89 78.47 77.76

70.61 73.89 71.78 71.67 75.50

73.79 67.16 70.12 70.10 71.79

0 92.13 93.45

% nodes with mean coverset coverage

3.11 3 4.80 8.67 10.19

81.89 69.83 78.58 78.12 76.60

% nodes with mean coverset coverage

9.13 6 10.93 20 20.95

81.48 80.10 73.15 72.12 75.15

% min,max % cover- stddev of age/coverset coverage 90.16,98.15 86.99,98.49 85.10,99.52 84,99.82 83.57,99.89

3.67 4.40 4.12 3.98 4.01

% min,max % cover- stddev of age/coverset coverage 0,0 89.78,98.64 88.83,93.15 91.47,98.19 87.60,99.03

nan 0 2.97 4.06 4.40

% min,max % cover- stddev of age/coverset coverage 0,0

nan

% min,max % cover- stddev of age/coverset coverage 56.46,91.81 53.65,98.98 50.53,97.92 52.07,96.09 49.97,98.10

13.13 12.05 11.58 10.60 10.54

% min,max % cover- stddev of age/coverset coverage 57.60,91.54 69.45,79.80 58.67,84.98 54.18,92.19 54.69,94.01

0 9.50 12.45 14.10 12.87

% min,max % cover- stddev of age/coverset coverage 56.18,88.54 47.78,88.71 40.41,87.46 45.72,91.57 44.15,94.18

12.45 13.80 13.11 11.57 11.91

% min,max % cover- stddev of age/coverset coverage

0,0 83.64,95.83 85.75,96.14

nan 0 0

% min,max % cover- stddev of age/coverset coverage

78.13,89.02 65.50,74.55 69.52,90.92 56.41,97.59 50.4,95.47

8.15 8.18 8.03 13.71 13.48

% min,max % cover- stddev of age/coverset coverage

69.18,93.72 62.82,90.16 47.14,92.14 45.53,95.94 43.01,97.57

9.72 11.81 14.43 12.19 12.59

% min,max #coverset/node 1,5.66 1,6 1,13 1,16.13 1,35.66 % min,max #coverset/node 0,0 1,1 1.13,2 1,3 1.13,2.66 % min,max #coverset/node 0,0 % min,max #coverset/node 1.13,9.13 1,10.66 1,34 1,31.13 1,50.13 % min,max #coverset/node 1,1 1.13,2 1.13,2 1,3.66 1,8 % min,max #coverset/node 1,5 1,4.66 1,11.13 1,19.13 1,37 % min,max #coverset/node

0,0 1,1 1,1 % min,max #coverset/node

1.13,2 1,3.66 1,3.13 1,5 1,8.66 % min,max #coverset/node

1,5.66 1,3.66 1.13,9.13 1,16.66 1,18.13

mean #coverset/node 2.10 2.99 3.53 4.15 6.40 mean #coverset/node 0 1 1.56 1.94 1.92 mean #coverset/node 0 mean #coverset/node 3.62 3.94 5.40 6.90 11.57 mean #coverset/node 1 1.58 1.75 1.91 2.74 mean #coverset/node 2.10 2.14 3.17 4.18 7.05 mean #coverset/node

0 1 1 mean #coverset/node

1.58 1.89 1.56 1.95 2.62 mean #coverset/node

2.06 1.94 3.65 4.83 5.15

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

3

309

Criticality-Based Scheduling of Randomly Deployed Nodes with Cover Sets

As said previously, the frame capture rate is an important parameter that deﬁnes the surveillance quality. In [12], we proposed to link a sensor’s frame capture rate to the size of its cover set. In our approach we deﬁne two classes of applications: low and high criticality applications. This criticality level can oscillate from a concave to a convex shape as illustrated in Figure 3 with the following interesting properties: – Class 1 ”low criticality”, does not need high frame capture rate. This characteristic can be represented by a concave curve (ﬁgure 3(a) box A), most projections of x values are gathered close to 0. – Class 2 ”high criticality”, needs high frame capture rate. This characteristic can be represented by a convex curve (ﬁgure 3(a) box B), most projections of x values are gathered close to the max frame capture rate.

(a) Application classes

(b) The Behavior curve functions

Fig. 3. Modeling criticality

[12] proposes to use a Bezier curve to model the 2 application classes. The advantage of using Bezier curves is that with only three points we can easily deﬁne a ready-to-use convex (high criticality) or concave (low criticality) curve. In ﬁgure 3(b) P0 (0, 0) is the origin point, P1 (bx , by ) is the behavior point and P2 (hx , hy ) is the threshold point where hx is the highest cover cardinality and hy is the maximum frame capture rate determined by the sensor node hardware capabilities. As illustrated in Figure 3(b), by moving the behavior point P1 inside the rectangle deﬁned by P0 and P2 , we are able to adjust the curvature of the Bezier curve, therefore adjusting the risk level r0 introduced in the introduction of this paper. According to this level, we deﬁne the risk function called Rk which operates on the behavior point P1 to control the BV function curvature. According to the position of point P1 the Bezier curve will morph between a convex and a concave form. As illustrated in ﬁgure 3(b) the ﬁrst and the last points delimit the curve frame. This frame is a rectangle and is deﬁned by the source

310

C. Pham Table 2. Capture rate in fps when P2 is at (12,3) r0 0 .1 .4 .6 .8 1

1 .01 .07 .17 .16 .75 1.5

2 .02 .15 .15 .69 1.1 1.9

3 .05 .15 .55 1.0 1.6 2.1

4 0.1 .17 .75 1.1 1.9 2.4

5 .17 .51 .97 1.5 2.1 2.6

6 .16 .67 1.1 1.8 2.1 2.7

7 .18 .86 1.4 2.0 2.5 2.8

8 .54 1.1 1.7 2.1 2.6 2.9

9 .75 1.4 2.0 2.4 2.7 2.9

10 1.1 1.7 2.1 2.6 2.8 2.9

11 1.5 2.1 2.6 2.8 2.9 2

12 3 3 3 3 3 3

point P0 (0, 0) and the threshold point P2 (hx , hy ). The middle point P1 (bx , by ) deﬁnes the risk level. We assume that this point can move through the second −h diagonal of the deﬁned rectangle bx = hxy ∗ by + hy . Table 2 shows the corresponding capture rate for some relevant values of r0 . The cover set cardinality |Co(v)| ∈ [1, 12] and the maximum frame capture rate is set to 3fps.

4

Fast Event Detection with Criticality Management

We are evaluating in this section the performance of an intrusion detection system by investigating the stealth time of the system. For these set of simulations, 150 sensor nodes are randomly deployed in a 75m ∗ 75m area. Unless speciﬁed, sensors have an 36o AoV and the COpbcApbc strategy is used to construct cover sets. Each sensor node captures with a given number of frames per second (between 0.01fps and 3fps) according to the model deﬁned in ﬁgure 3(b). Nodes with 12 or more cover sets will capture at the maximum speed. Simulation ends when there are no active nodes anymore. 4.1

Static Criticality-Based Scheduling

We ran simulations for 4 levels of criticality: r0 = 0.1, 0.4, 0.6 and 0.8. The corresponding capture rates are those shown in table 2. Nodes with high capture rate will use more battery power until they run out of battery (initial battery level is 100 units and 1 captured image consumes 1 unit) but, according to the scheduling model, nodes with high capture rate are also those with large number of cover sets. Note that it is the number of valid cover sets that deﬁnes the capture rate and not the number of cover sets found at the beginning of the cover sets construction procedure. In order to show the beneﬁt of the adaptive behavior, we computed the mean capture rate for each criticality level and then used that value as a ﬁxed capture rate for all the sensor nodes in the simulation model. r0 = 0.1 gives a mean capture rate of 0.12fps, r0 = 0.4 gives 0.56fps, r0 = 0.6 gives 0.83fps and r0 = 0.8 gives 1.18fps. Table 3 shows the network lifetime for the various criticality and frame capture rate values. Using the adaptive frame rate is very eﬃcient as the network lifetime is 2900s for r0 = 0.1 while the 0.12fps ﬁxed capture rate last only 620s. In order to evaluate further the quality of surveillance we show in ﬁgure 4(top) the mean

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

311

Table 3. Network lifetime r 0 = 0.1 0.12 fps r0 = 0.4 0.56 fps r 0 = 0.6 0.83 fps r0 = 0.8 1.18 fps 2900s 620s 1160s 360s 560s 240s 270s 170s

stealth time when r0 = 0.1, f ps = 0.12, r0 = 0.4 and f ps = 0.56, and in ﬁgure 4(bottom) the case when r0 = 0.6, f ps = 0.83, r0 = 0.8 and f ps = 1.18. The stealth time is the time during which an intruder can travel in the ﬁeld without being seen. The ﬁrst intrusion starts at time 10s at a random position in the ﬁeld. The scan line mobility model is then used with a constant velocity of 5m/s to make the intruder moving to the right part of the ﬁeld. When the intruder is seen for the ﬁrst time by a sensor, the stealth time is recorded and the mean stealth time computed. Then a new intrusion appears at another random position. This process is repeated until the simulation ends. 4

meanStealthTime r°=0.2

mean stealth time (second)

meanStealthTime 0.32fps meanStealthTime r°=0.4 meanStealthTime 0.56fps

3

2

1

0 1E2

1E3 time (second)

2

meanStealthTime r°=0.6

mean stealth time (second)

meanStealthTime 0.83fps meanStealthTime r°=0.8 1.5

meanStealthTime 1.18fps

1

0.5

0 1E2 time (second)

Fig. 4. Mean stealth time. Top: r0 = 0.1, f ps = 0.12, r0 = 0.4, f ps = 0.56. Bottom: r 0 = 0.6, f ps = 0.83, r0 = 0.8, f ps = 1.18.

Figure 5(left) shows for a criticality level r0 = 0.6 the special case of small AoV sensor nodes. When 2α = 20o , we compare the stealth time under the COpbcGpbc and the CObcGpbc strategies. Discarding point p in the cover set construction procedure gives a larger number of nodes with larger number of cover sets, as shown previously in table 1. In ﬁgure 5(left) we can see that the stealth time is very close to the COpbcGpbc case while the network lifetime almost doubles to reach 420s instead of 212s. The explanation is as follows: as more nodes have cover sets, they act as sentry nodes allowing the other nodes to be in sleep mode while ensuring a high responsiveness of the network.

312

C. Pham

1.6

stealthTime, AoV=20°, bcGpbc (discard point p) stealthTime, AoV=20°, pbcGpbc

stealth time (second)

1.4

1.2

1

0

100

200

300

400

time (second)

Fig. 5. Left: Stealth time, sliding winavg with 20 samples batch, r 0 = 0.6, AoV=20o , COpbcGpbc and CObcGpbc . Right: Rectangle with 8 signiﬁcant points, initial sensor v and 2 diﬀerent cover sets.

In addition, for the particular case of disambiguation, we introduce a 8m.4m rectangle at random positions in the ﬁeld. COpbcGpbc is used and 2α = 36o . The rectangle has 8 signiﬁcant points as depicted in ﬁgure 5(right) and moves at the velocity of 5m/s in a scan line mobility model (left to right). Each time a sensor node covers at least 1 signiﬁcant point or when the rectangle reaches the right boundary of the ﬁeld, it appears at another random position. This process starts at time t = 10s and is repeated until the simulation ends. The purpose is to determine how many signiﬁcant points are covered by the initial sensor v and how many can be covered by using one of v’s cover set. For instance, ﬁgure 5(right) shows a scenario where v’s FoV covers 3 points, the left cover set ({v3 , v1 , v4 }) covers 5 points while the right cover set ({v3 , v2 , v4 }) covers 6 points. In the simulations, each time a sensor v covers at least 1 signiﬁcant point of the intrusion rectangle, it determines how many signiﬁcant points are covered by each of its cover sets. The minimum and the maximum number of signiﬁcant points covered by v’s cover sets are recorded along with the number of signiﬁcant points v was able to cover initially. Figure 6 shows these results using a sliding window averaging ﬁlter with a batch window of 10 samples. We can see that node’s cover sets always succeed in identifying more signiﬁcant points. Figure 7 shows that with the rectangle intrusion (that could represent a group of intruders instead of a single intruder) the stealth time can be further reduced.

8

max number of points (slidingwinavg10) min number of points (slidingwinavg10)

number of covered points

initial number of points (slidingwinavg10)

6

4

2

0

100

200

time (second)

Fig. 6. Number of covered points of an intrusion rectangle. Sliding winavg of 10.

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes 4

313

stealthTime r°=0.8 (winavg10) stealthTime 1.18fps (winavg10) stealthTime r°=0.8 (winavg10), rectangle intrusion

stealth time (second)

3

2

1

0 100

200

time (second)

Fig. 7. Stealth time, winavg with 10 samples batch, r0 = 0.8, f ps = 1.18 and r 0 = 0.8 with rectangle intrusion

4.2

Dynamic Criticality-Based Scheduling

In this section we are presenting preliminary results in dynamically varying the criticality level during the network lifetime. The purpose is to only set the surveillance network in an alerted mode (high criticality value) when needed, i.e. on intrusions. With the same network topology than the previous simulations, we set the initial criticality level of all the sensor nodes to r0 = 0.1. As shown in the previous simulations, some nodes with large number of cover sets will act as sentries in the surveillance network. When a sensor node detects an intrusion, it sends an alert message to its neighbors and increases its criticality level to r0 = 0.8. Alerted nodes will then also increase their criticality level to r0 = 0.8. Both the node that detects the intrusion and the alerted nodes will run at a high criticality level for an alerted period, noted Ta , before going back to r0 = 0.1. Nodes may be alerted several times but an already alerted nodes will not increase its Ta value any further in this simple scenario. As said previously, we do not attempt here to optimize the Ta value nor using several level of criticality values. Figure 8shows the mean stealth time with this dynamic behavior. Ta is varied from 5s to 60s. We can see that this simple dynamic scenario already succeeds in reducing the mean stealth time while increasing the network lifetime when compared to a static scenario that provides the same level of service.

4

mean stealthTime Ta=5s

mean stealth time (second)

mean stealthTime Ta=10s mean stealthTime Ta=15s mean stealthTime Ta=20s

3

mean stealthTime Ta=30s mean stealthTime Ta=40s 2

1

500

1000

1500

time (second)

Fig. 8. Mean stealth time with dynamic criticality management

314

5

C. Pham

Conclusions

This paper presented the performances of cover sets construction strategies and dynamic criticality scheduling that enable fast event detection for mission-critical surveillance with video sensors. We focused on taking into account cameras with heterogeneous angle of view and those with very small angle of view. We show that our approach improves the network lifetime while providing low stealth time in case of intrusion detection systems. Preliminary results with dynamic criticality management also show that the network lifetime can further be increased. These results show that besides providing a model for translating a subjective criticality level into a quantitative parameter, our approach for video sensor nodes also optimize the resource usage by dynamically adjusting the provided service level. Acknowledgment. This work is partially supported by the FEDER POCTEFA EFA35/08 PIREGRID project, the Aquitaine-Aragon OMNI-DATA project and by the PHC Tassili project 09MDU784.

References 1. Collins, H.F.R., Lipton, A., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE 89(10) (2001) 2. Yan, T., He, T., Stankovi, J.A.: Diﬀerentiated surveillance for sensor networks. In: ACM SenSys (2003) 3. He, T., et al.: Energy-eﬃcient surveillance system using wireless sensor networks. In: ACM MobiSys (2004) 4. Oh, S., Chen, P., Manzo, M., Sastry, S.: Instrumenting wireless sensor networks for real-time surveillance. In: International Conference on Robotics and Automation (2006) 5. Cucchiara, R., et al.: Using a wireless sensor network to enhance video surveillance. J. Ubiquitous Computing and Intelligence 1(2) (2007) 6. Stoianov, L.N.I., Madden, S.: Pipenet: A wireless sensor network for pipeline monitoring. In: ACM IPSN (2007) 7. Albano, M., Pietro, R.D.: A model with applications for data survivability in critical infrastructures. J. of Information Assurance and Security 4 (2009) 8. Dousse, O., Tavoularis, C., Thiran, P.: Delay of intrusion detection in wireless sensor networks. In: ACM MobiHoc (2006) 9. Zhu, Y., Ni, L.M.: Probabilistic approach to provisioning guaranteed qos for distributed event detection. In: IEEE INFOCOM (2008) 10. Freitas, E., et al.: Evaluation of coordination strategies for heterogeneous sensor networks aiming at surveillance applications. In: IEEE Sensors (2009) 11. Keally, M., Zhou, G., Xing, G.: Watchdog: Conﬁdent event detection in heterogeneous sensor networks. In: IEEE Real-Time and Embedded Technology and Applications Symposium (2010) 12. A. Makhoul, R. Saadi, and C. Pham: Risk management in intrusion detection applications with wireless video sensor networks. In: IEEE WCNC (2010). 13. Rahimi, M., et al.: Cyclops: In situ image sensing and interpretation in wireless sensor networks. In: ACM SenSys (2005) 14. Makhoul, A., Pham, C.: Dynamic scheduling of cover-sets in randomly deployed wireless video sensor networks for surveillance applications. In: IFIP Wireless Days (2009)

An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices Nicholas Martin1, Yamin Al-Mousa1, and Nirmala Shenoy2 1 College of Computing and Information Science, Networking Security and Systems Administration Dept, Rochester Institute of Technology, 1 Lomb Dr, Rochester, NY USA 14623 [email protected], {ysa49,nxsvks}@rit.edu 2

Abstract. In this paper we present an integrated solution that combines routing, clustering and medium access control operations while basing them on a common meshed tree algorithm. The aim is to achieve an efficient airborne surveillance network of unmanned aerial vehicles, wherein any loss of captured data is kept to the minimum while maintaining low latency in packet and data delivery. Surveillance networks of varying sizes were evaluated with varying numbers of senders, while the physical layer was maintained invariant. Keywords: meshed trees, burst forwarding medium access control, surveillance.

1 Introduction Mobile Ad Hoc Networks (MANETs) of unmanned aerial vehicles (UAVs) face severe challenges to deliver surveillance data without loss of information to specific aggregation nodes. Depending on the time sensitivity of the captured data, the end to end packet and file delivery latency could also be critical metrics. Two major protocols from a networking perspective that can impact lossless and timely delivery are the medium access control (MAC) and routing protocols. Physical layer and transport layer protocols will certainly play a major role; however, we limit the scope of this work to MAC and routing protocols. These types of surveillance networks require several UAVs to cover a wide area while the UAVs normally travel at speeds of 300 to 400 km/h. These features pose additional significant challenges to the design of MANET routing and MAC protocols as they now must be both scalable and resilient: being able to handle the frequent route breaks due to node mobility. The predominant traffic pattern in surveillance networks is converge-cast, where data travels from several nodes to an aggregation node. We leverage this property in the proposed solution. We also integrate routing and MAC functions into a single protocol layer, where both routing and MAC operations are achieved with a single address. The routing protocol uses the inherent path information contained in the addresses, while the MAC uses the same addresses for hop by hop packet forwarding. Data aggregation or converge-cast types of traffic are best handled through multi hop clustering, wherein a cluster head (CH) is the special type of node that aggregates M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 315–327, 2011. © Springer-Verlag Berlin Heidelberg 2011

316

N. Martin, Y. Al-Mousa, and N. Shenoy

the data and manages the cluster. The solution we propose uses one such clustering scheme that is based on a ‘meshed tree’ principle [1], where the root of the meshed tree is the CH. As the cluster is a tree, the branches connecting the cluster clients (CCs) to the CH provide a path to send the data from the CCs to the CH. Thus, a clustering mechanism is integrated into the routing and MAC framework. The ‘meshing’ of the tree branches allows one node to reside in multiple tree branches that originate from the root, namely the CH. The duration of residency on a branch depends on the movement pattern and speeds of the nodes. Thus, as nodes move, they may leave one or more branches and connect to new branches. Most importantly, even if a node loses one path to the CH, it likely remains connected to the CH via another branch and thus has an alternate path. The clustering scheme also allows for the creation of several overlapped multi hop clusters leading to the notion of multi meshed trees (MMT). The overlap is achieved by allowing the branches of one meshed tree to further mesh with the branches in neighboring clusters. This provides connectivity to cluster clients moving across clusters. It also helps extend the coverage area of the surveillance network to address scalability.

2 Related Work The topic areas of major contribution in this article relate to routing protocols, clustering algorithms and MAC protocols for mobile ad hoc networks. The significance in this framework solution lies in the closely integrated operations of routing, clustering and MAC. To the best of our knowledge in the literature published thus far no solution targets such an approach. Cross layered approaches, which break down the limitations of inter-layer communications to facilitate more effective integration and coordination between protocol layers, is one approach that has similar goals. However, our solution is not a cross layered approach. We felt that in a dedicated and critical MANET application, such as a surveillance network, one should not be constrained by the protocol layers or stacks, but achieve the operations through efficient integration of required functions. For the above reasons, it becomes difficult to cite and discuss related work that has an approach similar to ours. However, as we use a multi hop clustering scheme we will highlight multi hop clustering algorithms discussed in the literature. Our solution includes a routing scheme, so we will discuss some proactive, reactive and hybrid routing algorithms to highlight the differences in the proposed routing scheme. We will cite some framework solutions that combine clustering and routing to explain the difference in the approaches. The MAC adopted in this work, is based of CSMA/CA, but uses the addresses adopted in our solution to achieve efficient data aggregation by sending several packets in a burst i.e. a sequence of packets. Several survey articles published on MANET routing and clustering schemes from different perspectives indicate the continuing challenges in this topic area [2, 3]. Proactive routing protocols require dissemination of link information periodically so that a node can use standard algorithms such as Dijkstra’s to compute routes, to all other nodes in the network or in a given zone [4]. Link information dissemination requires flooding of messages that contain such information. In large networks such transmissions or control messages can consume significant amounts of the bandwidth

An Integrated Routing and Medium Access Control Framework

317

making the proactive routing approach not scalable. Several proactive routing protocols thus target mechanisms to reduce this control overhead i.e. bandwidth used for control messages. Fisheye State Routing (FSR) [8] Fuzzy Sighted Link State Hazy Sighted Link State [10] Optimized Link State [6] and Topology Broadcast Reverse Path Forwarding [9] are some such approaches. Reactive routing protocols avoid the periodic link information dissemination and allow a node to discover routes to a destination node only when it has data to send to that destination node. The reactive route discovery process can result in the source node receiving several route responses which it may cache. As mobility increases, route caching may become ineffective as pre-discovered routes may become stale and unusable. Dynamic Source Routing (DSR) [5], Ad Hoc On-demand Distance Vector (AODV) [4] Temporally Ordered Routing Algorithm [3] and Light-Weight Mobile Routing [13] are some the more popular reactive routing approaches. Partitioning a MANET physically or logically and introducing hierarchy has been used to limit message flooding and also addresses scalability. Mobile Backbone Networks (MBNs) [14] use hierarchy to form a higher level backbone network by utilizing special backbone nodes with low mobility to have an additional powerful radio to establish wireless links amongst themselves. LANMAR [13], Sharp Hybrid Adaptive Routing Protocol (SHARP) [15] Hybrid Routing for Path Optimality [11], Zone Routing Protocol (ZRP) [12] are protocols under this category. Nodes physically close to each other form clusters with a CH communicating to other nodes on behalf of the cluster. Different routing strategies can be used inside and outside a cluster. Several cluster based routing protocols address scalability issues faced in MANETs. Cluster head Gateway Switch Routing (CGSR) [16] and Hierarchical State Routing (HSR) [17] are two popular cluster based routing schemes.

3 Meshed Tree Clustering It is important to understand the cluster formation in the clustering scheme under consideration and the routing capabilities within the cluster for data aggregation at the CH. The multi hop clustering scheme and the cluster formation based on ‘meshed’ tree algorithm is described with the aid of Figure 1. The dotted lines connect nodes that are in communication range with one another at the physical layer. The data aggregation node or cluster head is labeled ‘CH’. Nodes A through G are the CCs. 142

F

141

C 14 142

E

143 131

CH 111

13 D

G

B

11

132

Fig. 1. Cluster Formation Based on Meshed Trees

12

A

121

318

N. Martin, Y. Al-Mousa, and N. Shenoy

At each node several ‘values’ have been noted. These are the virtual IDs (VIDs) assigned to the node when it joins the cluster. In Figure 1, each arrow from the CH is a branch of connection to the CCs. Each branch is a sequence of VIDs that is assigned to CCs connecting at different points of the branch. The branch denoted by VIDs 14, 142 and 1421, connects nodes C (via VID 14), F (via VID 142) and E (via VID 1421) respectively to the CH. Assuming that the CH has a VID ‘1’, the CCs in this cluster will have ‘1’ as the first prefix in their VIDs. Any CC that attaches to a branch is assigned a VID, which will inherit its prefix from its parent node, followed by an integer, which indicates the child number under that parent. This pattern of inheriting the parent’s VID will be clear if the reader follows through the branches identified in Figure 1 by the arrows. The meshed tree cluster is formed in a distributed manner, where a node listens to its neighbor nodes advertising their VIDs, and decides to join any or all of branches as noted in the advertised VIDs. A VID contains information about number of hops from the CH. This is inherent in the VID length that can then be used by a node to decide the branch it would like to join if shortest hop count is a criterion. Once a node decides to join a branch it has to inform the CH. The CH then registers the node as its CC and confirms its admittance to the cluster and accordingly updates a VID table of its CCs. A CH can restrict admittance of nodes that are within a certain number of hops or not admit new nodes to keep the number of CCs in the cluster under a certain value. This is useful to contain the data collection zone of a cluster. Routing in the Cluster: The branches of the meshed tree provide the route to send and receive data and control packets between the CCs and the CH. As an example, consider packet routing where the CH has a packet to send to node E. The CH may decide to use the path given by VID 1421 to E. The CH will include its VID ‘1’ as the source address and E’s VID 1421, as the destination address and broadcast the packet. The nodes that will perform the hop by hop forwarding are nodes C and F. This is so, as from the source VID and destination VID, C will know that it is the next hop en route, because it has a VID 14 and the packet came from VID ‘1’ and is destined to 1421 i.e. it uses a path vector concept. When C broadcasts the packet subsequently, F will receive and eventually forward to E. The VID of a node thus provides a virtual path vector from the CH to itself. Note that the CH could have also used VIDs 143 or 131 for node E, in which case the path taken by the packet would have been CH-C-E or CH-D-E respectively. Thus between the CH and node E there are multiple routes as identified by the multiple VIDs. The concept of support for multiple routes through multiple VIDs, allows for robust and dynamic route adaptability to topology changes in the cluster. 221 J

251 F

21

K

142

2

H

142 E

CH2

143 131

1 1 CH1

25

23 L 241

M

24

B

1

14

211

22

14

C

111 252 G

D

13

132

Fig. 2. Overlapped Cluster Formation Based on Meshed Trees

12

A

121

An Integrated Routing and Medium Access Control Framework

319

Route failures: Capturing all data without loss is very important in surveillance networks used in tactical applications. Loss of data can be caused due to route failures or collisions at the MAC. There are two cases of route failures that can occur, yet be swiftly rectified, in the proposed solution. In the first case, a node may be in the process of sending data, and has even sent part of the data using a particular VID, only to discover that said VID or path is not valid anymore. In the second case, a node may be forwarding data for another node, but after collecting and forwarding a few data packets, this forwarding node also loses the VID which was being used. Case I: Source node loses a route: For example, node B in Figure 2 is sending a 1 MB file to the CH using its shortest VID ‘11’. Assume that node B was able to send ½ MB, at which time due to its mobility it lost its VID ‘11’ but was still able to continue with VID ‘121’ and send the rest ½ MB of data using VID ‘121’. Case II: Intermediate node loses a route: Let us continue the above example. Node A is forwarding the data from node B on its VID 12 (data comes from node B via its VID 121). After sending a ¼ MB assume that node A moves in the direction of node D, loses its VID 12 but gains a new VID ‘131’ as it joins the branch under node D. Node A can continue sending the rest of the file using its new VID 131. As the knowledge about the destination node is consistent (i.e. it is the CH with VID ‘1’) any node is able to forward the collected data towards the CH. Disconnects: In a disconnect situation, a missing VID link may first be noticed by the parent or child of a node with whom the link is shared. In such cases, the parent node will inform the CH of the missing child VID, such that the CH will not send any messages to it. Meanwhile the child node, which is downstream on the branch, will notify its children about their lost VIDs (VIDs derived from the missing VID) so that they will invalidate those VIDs and hence not use them to send data to the CH. Inter-cluster Overlap and Scalability: As a surveillance network can have several tens of nodes the solution proposed must be scalable. We assume that several ‘data aggregation nodes (i.e. CHs)’ be uniformly distributed among the non-data aggregation nodes during deployment of the surveillance network. Meshed tree clusters can be formed around each of the data aggregation nodes by assuming them to be the CHs. Nodes bordering two or more clusters are allowed to join the branches originating from different CHs, and will accordingly inform their respective CHs about their multiple VIDs under the different clusters. When a node moves away from one cluster, it can still be connected to other clusters, and the surveillance data collected by that node will not be lost. Also, by allowing nodes to belong to multiple clusters, the single meshed tree cluster-based data collection can be extended to multiple overlapping meshed tree clusters that can collect data from several nodes deployed over a wider area with a very low loss probability of the captured data. Figure 2 shows two overlapped clusters and some border nodes that share multiple VIDs across the two clusters. The concept is extendable to several neighboring clusters. Nodes G and F have VIDs 142, 132 under CH1 and VIDs 251 and 252 under CH2, respectively. Note that a node is aware of the cluster under which it has a VID as the information is inherent in the VIDs it acquires, thus a node has some intelligence to decide which VIDs it would like to acquire – i.e. it can decide to have several VIDs under one cluster, or acquire VIDs that span several clusters and so on.

320

N. Martin, Y. Al-Mousa, and N. Shenoy

Significance of the Approach: From the meshed tree based clustering and routing scheme described thus far, it should be clear that our scheme adopts a proactive routing approach, where the proactive routes between CCs and CH in a cluster are established as the meshed trees or clusters are formed around each CH. Thus using a single algorithm during the cluster joining process a node automatically acquires routes to the CH. There is flexibility in dimensioning the cluster in terms of CCs in a cluster and the maximum hops a CC is allowed from a CH. The tree formation is different from the spanning trees discussed in the literature, as a node is allowed to simultaneously reside in several branches, and thus allowing for dynamic adaptability to route changes as nodes move. This also enhances robustness in connectivity to the CH. This approach is ideal for data aggregation from the CCs to the CH, and is very suitable for MANETs with highly mobile nodes.

4 Burst Forwarding Medium Access Control Protocol The Burst Forwarding Medium Access Control (BF-MAC) is primarily focused with reducing collisions while providing the capability of MAC forwarding of multiple data packets from one node to another node in the same cluster. Additionally, the MAC allows for sequential ‘node’ forwarding where all intermediate nodes forward a burst of packets one after another in a sequence between a source and destination node through multiple hops. These capabilities are created through careful creation of MAC data sessions, which encompass the time necessary to burst multiple packets across multiple hops. For non-data control packets, such as those from the routing and cluster formation process, the MAC uses a system based on Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA). 121

Data

12

Data

CH 1

Fig. 3. Illustration of traffic forwarding along a single tree branch

The above type of MAC forwarding is possible due to the VIDs, which carry information about a node’s CH, and the intermediate nodes which the MAC makes use of. As such, a node’s data will physically travel up a VID branch to the CH in that tree. Therefore, by knowing which VID was used by a node to send a data packet, and that packet’s intended destination (the CH), an overhearing node can determine the next VID in the path. This process is used by all overhearing nodes to forward in their turn a packet all the way to the CH. This is illustrated in Figure 3, where when the node with VID 121 has data to send to CH1, the intermediate node with VID 12 will pick up and forward to the CH. The MAC process at a node that has data to send creates a MAC data session. A Request to Send (RTS) packet is sent by the node and is forwarded by the intermediate nodes till it reaches the CH. When a recipient node (i.e. a forwarding

An Integrated Routing and Medium Access Control Framework

321

node) along the path receives the RTS, it becomes a part of the data session. A set of data packets may then be sent to the intended destination, in this case the CH, along the same path as the RTS packet. The final node in the path, the CH, will send an explicit acknowledgement (eACK) packet to the previous node for a reliability check. eACKs are not forwarded back to the initial sender. Nodes in the path of the data session, except for the penultimate node, instead listen for the packet just sent to the next node. This packet will be the same packet being forwarded by the next node in the data session path (be it either an RTS or a data packet). Receiving this packet is an indication of an implicit acknowledgment (iACK), as the next node must have received the sent packet if it is now attempting to forward it. Note that the iACK is really the forwarded RTS or data packet. Not receiving any type of acknowledgment will cause a node to use the MAC retry model, discussed below. During a data session collisions from neighboring nodes are prevented in the same way as the collision avoidance mechanism in CSMA/CA. Nodes that hear a session in progress keep silent. When a node overhears an RTS, eACK or data packet, for which it is not the destination or the next node in line to forward, it will switch to a Not Clear to Send (NCTS) mode. This will prevent a node from sending any control packets or joining a data session. If a node is already part of a separate data session, the node will continue with that data session. The NCTS mode lasts for a duration specified as the Session on Wait (SOW) time, noted in the packets being transmitted during the session. The SOW time is calculated by the initial sender within a data session, and marks the amount of time left for a particular data session. At each hop, it is decremented by the transmission time of the current packet to send plus a guard time to account for propagation delay as shown in Figure 4. When SOW time has elapsed, the data session is over and all nodes return to a Clear to Send (CTS) mode. A node in CTS mode may start a new data session, join a data session via forwarding, or send control packets.

SOW p-n A

X

Y SOW p-2n

SOW p-n

B

SOW p-3n

SOW p-2n

Z

eACK SOW p-5n

C SOW p-3n D

SOW p-5n SOW p-4n

CH

Fig. 4. Illustration of dissemination of SOW timings. ‘p’ represents the time necessary for all packets remaining to be sent, while n represents the time to transmit a single packet plus propagation delay

Control packets from the routing and clustering process are queued and sent using CSMA/CA whenever a node is in CTS mode. To take further advantage of the MAC’s data sessions in preventing possible collisions, nodes are also allowed to send control packets within a data session by extending SOW time a fixed amount. Retry Model: The MAC stores any RTS or data packet sent into a retry queue. Until an eACK or iACK is heard for that packet, the packet will be retried up to three times within a single data session. Nodes will continue to receive data and issue eACKs for

322

N. Martin, Y. Al-Mousa, and N. Shenoy

data packets while retrying the other packet. At the end of the data session, nodes will move any outstanding packets into their own data queues and will send them subsequently pretending to be the initial sender. If a packet fails to be sent in two separate data sessions, an error report is sent to the routing and clustering process for further action. The MAC brings the added capability of any node taking over and forwarding the packets to the destination, which is the CH, and uses the VIDs, which burst forward packets from CCs to the CH. This is the uniqueness of the proposed solution and the primary reason for integrating the different operation due to the natural dependency of all three schemes upon the one algorithm. Separating them into different layers would have resulted in suboptimal performance of the framework, which may not be an efficient solution for such critical applications as surveillance networks.

5 Simulation and Performance While there are numerous routing and cluster based routing algorithms proposed in the literature, they have not been evaluated for the type of surveillance applications stated in this article, nor are the performance metrics the same as ours. Hence the results published for these algorithms cannot be compared with our results. Neither would it be reasonable for us to model the solutions that we decide to be suitable solutions for a comparative study with the proposed solution. Hence we decided to conduct our comparison with two well known routing protocols OLSR and AODV. The first is a proactive routing protocol and the second is a reactive routing protocol. We use the proactive routing protocol OLSR to evaluate and compare with the performance of our solution to small networks of size around 20 nodes. Furthermore to make the studies comparable, we designated certain nodes as data collection nodes and the destination for data sending nodes in it vicinity, such that we overcome the cluster formation problem. We used the reactive routing protocol to evaluate and compare the performances from the control overhead perspective in networks of sizes 50 and 100. In this case also the collection nodes were designated as destination for nodes in its vicinity. For completeness we evaluated OLSR, AODV and the MMT for all 20, 50 and 100 node scenarios, with varying numbers of senders. For OLSR and AODV we used the custom developed 802.11 CSMA/CA models available with Opnet. This work was conducted as part of an ONR funded project, where it was expected for us to use the ns2 simulation tool. We used ns2.3.4. The OLSR and AODV models available in ns2 were not designed to operate in network scenarios as those outlined above, hence we used the custom developed models of OLSR and AODV in Opnet. These Opnet models provide flexibility in selecting optimal parameters and thus optimal operational conditions through proper setting of retry times, intervals for sending ‘hello’, ‘topology control’ and other control messages for OLSR and AODV. The scenario set up in the MMT solution in ns2 however faced constraints due to the random placement and selection of sending nodes as compared to selecting the closest nodes to send to the closest designated destination (alias the CH) as in Opnet. We therefore recorded the average hops between a source and destination node in all our test scenarios, to serve as a baseline for comparison.

An Integrated Routing and Medium Access Control Framework

323

Simulation parameters: The transmission range was maintained at approximately 10 km. The data rate was set to 11 Mbps, the standard 802.11 data rates. No error correction was used for the transmitted packets and any packet with a single bit error was dropped. Circular trajectories with radii of 10 Km were used. The reasons for using circular trajectories was to introduce more stress into the test scenarios, as these trajectories would result in more route breaks than elliptical trajectories, which should have been used normally. Some of the trajectories used clockwise movement, while others used an anti-clockwise movement. This was done again to introduce stress in the test scenarios. The UAV speeds of the nodes varied between 300 and 400 km/h. Hello interval was maintained at 10 seconds. The above scenario parameters were maintained consistent for all test scenarios. The performance metrics targeted were • • •

Success rate, calculated as the percentage of number of packets successfully delivered to the destination node. Average end to end packet delivery latency calculated in seconds. Overhead was calculated as the ratio of control bits to the sum of control and data bits during data delivery for compatibility of comparisons with reactive routing.

All the above performance metrics were recorded along with the average hops between sender and receiver nodes, for 20, 50 and 100 nodes, where the number of sending nodes was varied depending on the test scenario. The file sizes used for data sessions were each 1 MB, and the packet sizes were 2 KB. In a session all senders would start sending the 1 MB file simultaneously towards the CH. We provide indepth explanation for the 20 node graphs; the graphs in 50 and 100 nodes have a similar trend, hence we do not repeat the explanations.

Fig. 5. Performance Graphs for 20 Node Scenario

324

N. Martin, Y. Al-Mousa, and N. Shenoy

Analysis of results for the 20 nodes test scenario: Figure 5 shows the four performance graphs based on results collected under the 20 nodes scenario. The number of senders was varied from 5 to 10 to 16, where in the last case as there were 4 data aggregation nodes, all other nodes i.e. all CCs were sending data to their respective CHs. The first graph is the plot of the success rate versus the number of sending nodes. In the MMT based framework, the success rate was 100% as the number of sending nodes was increased from 5, 10 to 16. For AODV and OLSR the success rate was high with 5 senders but decreased with an increase in the number of senders. While the success rate for AODV drops to 82%, for OLSR it dropped only to 87%. The success rate for OLSR with 10 senders is less than with 16 senders. This discrepancy will be clear if we look at the average number of hops between sending and receiving nodes: with 5 senders the average hops recorded was 1.38, for 10 senders it was 1.32 and for 16 senders it dropped down to 1.22. This happened because the 5 senders selected first were further away from the designated destination node. In the case of 10 senders the added 5 senders were now closer to the destination node but when the last 6 senders were included they were still closer to the destination node bringing down the average hops, and thus were able to increase the success rate in packet delivery. However between 5 senders and 10 senders, due to the increase in traffic in the network, the average hops dropped by 0.6, yet the success rate still experienced a decrease. A similar explanation holds for the MMT framework too, where the average hops with 10 senders is lower than with 16 senders; however this did not affect the success rate and all packets were delivered successfully. MMT and AODV show very low latency as compared to OLSR. Due to the reduced success rates in the case of AODV, fewer packets were delivered and thus there is a dip in the average latency for 10 sending nodes, as the amount of traffic due to data packets is less in the network and also the packets which were taking longer did not make it to the destination. OLSR shows a higher latency due to the control traffic which delays the data traffic. The MMT solution has very low overhead compared to OLSR and AODV in all 3 cases of 5, 10 and 16 senders. The reason for this can be attributed to the local recovery of any link failures as handled by MMT as compared to OLSR which requires resending the updated link information, or in the case of AODV, which has to rediscover routes if the cached routes are stale. The second reason could be the reduced collision and better throughput due to the BF-MAC. A point worth noting is that though MMT adopts a proactive routing approach, its overhead is very much lower than the reactive routing used in AODV even with fewer number of sending nodes i.e. 5 senders. Validation of the Comparison Process: It may seem to the reader that there are several improved variations of OLSR and AODV that have may have performed better than just OLSR and AODV. However, it should be noted that the proposed framework outperforms OLSR and AODV significantly in all performance aspects, especially for the type of surveillance applications considered in the work. This is despite the fact that the average number of hops encountered between the sending and receiving nodes in the MMT framework is higher than OLSR by a significant amount in all 3 cases of 5, 10 and 16 senders and comparable with AODV for the 10 and 16 senders, but higher in the case of 5 senders.

An Integrated Routing and Medium Access Control Framework

325

Fig. 6. Performance Graphs for 50 Node Scenario

Analysis of results for the 50 node test scenario: Figure 6 shows the four graphs for the 50 node scenario. The MMT based solution continues to maintain the success rate very close to 100% as the number of senders increased to 40, where all CCs send to their respective CHs. OLSR and AODV show a decrease in the success rate with AODV drop being higher than OLSR with 40 senders, which can be attributed to the increased number of senders, which is a well known phenomenon with reactive routing protocols. The average end to end packet delivery latency for OLSR is higher than AODV, because of the higher number of average hops with 20 senders and higher successful packets transmitted at 40 senders. The end to end packet delivery latency for MMT is still quite low and comparable to that achieved with AODV, in which 15 to 35% of the packets were not delivered. The overhead with MMT is now at 10% compared with OLSR’s around 20% and AODV with over 30%. Analysis of results for the 100 node test scenario: Figure 10 shows the four graphs for the 100 node scenario. While MMT consistently exhibits a similar performance as seen for the 20 and 50 nodes with a slight increase in the overhead and latency with increased number of senders, while the average hops still being greater than AODV and OLSR. OLSR shows a further drop in the success rate as compared to the 50 node scenario, which is due to the limitations faced when flooding the topology control messages. While the AODV success rate starts at 75% and drops to 68% for 40 senders and 47.5% for 80 senders, which is as expected. Overhead for AODV is higher than for the 50 nodes scenario as there are more discovery messages, while OLSR maintains the overhead between 20% to 30%.

326

N. Martin, Y. Al-Mousa, and N. Shenoy

Fig. 7. Performance Graphs for 100 Node Scenario

6 Conclusion In this paper, we presented an integrated routing, clustering and MAC framework based on a meshed tree principle, where all three operations use features of the meshed tree algorithm for their operation. The framework was designed especially to handle airborne surveillance networks for collection of surveillance data with the least data loss and in a timely manner. We evaluated the framework and compared it with the two standard protocols, OLSR and AODV, by providing comparable network settings in each case. The performance of the proposed solution indicates its high suitability to such surveillance applications.

References 1. Shenoy, N., Pan, Y., Narayan, D., Ross, D., Lutzer, C.: Route Robustness of a Multimeshed Tree Routing Scheme for Internet MANETs. In: Proceeding of IEEE Globecom 2005, St Louis, November 28-December 2 (2005) 2. Abolhasan, M., Wysocki, T., Dutkiewicz, E.: A review of routing protocols for mobile ad hoc networks. Journal of ad hoc networks (2004) 3. Royer, E.M., Toh, C.-K.: A Review of Current Routing Protocols for Ad Hoc Mobile Wireless Networks. IEEE Personal Communications Magazine, 46–55 (April 1999) 4. Perkins, C.E., Royer, E.M., Das, S.R.: Ad Hoc On-Demand Distance Vector (AODV) Routing, IETF Mobile Ad Hoc Networks Working Group. IETF RFC 3561 5. Johnson, D.B., Maltz, D.A., Hu, Y.-C.: Dynamic Source Routing Protocol for Mobile Ad Hoc Networks. IETF MANET Working Group, Internet Draft (February 24, 2003)

An Integrated Routing and Medium Access Control Framework

327

6. Clausen, T., Jacquet, P.: Optimized Link State Routing Protocol (OLSR), Network Working Group, Request for Comments: 3626 7. Das, S., Castaneda, R., Yan, J.: Simulation-Based Performance Evaluation of Routing Protocols for MANETs. In: Mobile Networks and Applications 2000, vol. 5, pp. 179–189 (2000) 8. Pei, G., Gerla, M., Chen, T.-W.: Fisheye State Routing: A Routing Scheme for Ad Hoc Wireless Networks. In: IEEE ICC, vol. 1, pp. 70–74 (2000) 9. Bellur, B., Ogier, R.G.: A Reliable, Efficient Topology Broadcast Protocol for Dynamic Networks. In: Proc. IEEE INFOCOM 1999, New York (March 1999) 10. Santivanez, C., Ramanathan, R., Stavrakakis, I.: Making Link-State Routing Scale for Ad Hoc Networks. In: Proceedings of Mobihoc 2001, Long Beach, California (October 2001) 11. Pei, G., Gerla, M., Hong, X., Chiang, C.-C.: A Wireless Hierarchical Routing Protocol with Group Mobility. In: Proceedings of IEEE WCNC 1999, New Orleans, LA (September 1999) 12. Haas, Z.J., Pearlman, M.R.: Performance of Query Control Schemes for Zone Routing Protocol. ACM/IEEE Transactions on Networking 9(4), 427–438 (2001) 13. Pei, G., Gerla, M., Hong, X.: LANMAR: Landmark Routing for Large Scale Wireless Ad Hoc Networks with Group Mobility. In: Proceedings of IEEE/ACM MobiHOC 2000, Boston, MA, pp. 11–18 (August 2000) 14. Xu, K., Hong, X., Gerla, M.: An Ad hoc Network with Mobile Backbone. In: Proceeding ICC 2002, April 28-May 2, vol. 5, pp. 3138–3143 (2002) 15. Ramasubramanian, V., Haas, Z.J., Sirer, E.G.: SHARP: A Hybrid Adaptive Routing Protocol for Mobile Ad Hoc Networks 16. Lin, C.R., Gerla, M.: Adaptive clustering for mobile wireless networks. IEEE Journal on Selected Areas in Communications 15(7), 1265–1275 (1997) 17. Basagni, S.: Distributed and mobility-adaptive clustering for multimedia support in multihop wireless networks. In: IEEE VTS 50th Vehicular Technology Conference, VTC 1999 Fall, vol. 2, pp. 889–893 (1999)

Security in the Cache and Forward Architecture for the Next Generation Internet G.C. Hadjichristofi1, C.N. Hadjicostis1, and D. Raychaudhuri2 1 University of Cyprus, Cyprus WINLAB, Rutgers University, USA [email protected], [email protected], [email protected] 2

Abstract. The future Internet architecture will be comprised predominately of wireless devices. It is evident at this stage that the TCP/IP protocol that was developed decades ago will not properly support the required network functionalities since contemporary communication profiles tend to be datadriven rather than host-based. To address this paradigm shift in data propagation, a next generation architecture has been proposed, the Cache and Forward (CNF) architecture. This research investigates security aspects of this new Internet architecture. More specifically, we discuss content privacy, secure routing, key management and trust management. We identify security weaknesses of this architecture that need to be addressed and we derive security requirements that should guide future research directions. Aspects of the research can be adopted as a step-stone as we build the future Internet. Keywords: wireless networks, security, cache and forward, key management, trust management, next generation Internet.

1 Introduction The number of wireless devices has increased exponentially in the last few years indicating that wireless will be the key driver to future communication paradigms. This explosion of wireless devices has shifted the Internet architecture from one whose structure is based mainly on wired communication to a hybrid of wired and wireless communication. Wireless devices are no longer considered the edge devices of the Internet, but are also shifting into the role of mobile routers that transmit data over multiple hops to other wireless devices. In the current Internet, TCP/IP was designed as the network protocol for transmitting information and has served the Internet well for several decades. However, wireless connections are characterized by intermittent, error-prone, and low bandwidth connectivity, which causes TCP to fail [1]. Therefore, the nature of the networking problem now is different requiring a drastic shift in the solution space and, with that a new Internet architecture. These next-generation Internet architectures aim to shift away from TCP/IP-based communication that assumes stable connectivity between end-hosts, and instead move into a paradigm where communication is content-driven. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 328–339, 2011. © Springer-Verlag Berlin Heidelberg 2011

Security in the CNF Architecture for the Next Generation Internet

329

A recently proposed next-generation Internet architecture, is the Cache and Forward architecture (CNF). The objective of this architecture is to move files or packages from source to destination over both wired and wireless hops as connectivity becomes available, i.e., use opportunistic transport. The architecture is built on a subset of existing Internet routers and leverages the decreasing cost of memory storage. Due to the difference in the operation of this architecture, security aspects (such as key management) need to be revisited and augmented accordingly. This research aims to be an investigation of security aspects in the future Internet architecture. We investigate ways with which the CNF architecture can be used to provide the required security regarding issues of data privacy, secure routing of files at higher OSI layers (i.e., at the CNF layers), key management, and trust management. The aim of this paper is not to present complete solutions for the aforementioned security areas, but rather to present security strength and weaknesses of the CNF architecture and discuss possible solution scenarios as a means to point security vulnerabilities, and to motivate and direct future research. Based on the discussion we extract key challenges that need to be addressed to provide a more complete system security solution. It is important to ensure that security is built into systems to allow the secure and dynamic access of information. To the best of our knowledge, it is the first investigation of security issues in this architecture. Section 2 describes the CNF architecture. Section 3 provides the security analysis of the CNF architecture and extracts the security requirements for this new architecture. Topics covered are content privacy, secure routing, key management, and trust management. Section 4 concludes the paper.

2 Cache and Forward Architecture Existing and, even more so, future Internet routers will have higher processing power and storage. In the CNF architecture it is envisioned that the wired core network consists of such high capacity routers [2]. It is not necessary that all the nodes in the network will have high capacity storage. Thus, we use the term CNF routers to signify that the router has higher capabilities. In addition to CNF routers, the future Internet will have edge networks with access points called post offices (POs), and multi-hop wireless forwarding nodes called Cache and Carry (CNC) routers. POs are CNF routers that link mobile nodes to the wired backbone and act as post offices by withholding files for the mobile nodes when disconnected or unavailable. CNC routes are mobile wireless routers with relatively smaller capacity compared to CNF routers. The storage cache of nodes in the CNF architecture is used to store packets in transit, as well as to offer in-network caching of popular content. The unit of transportation in the CNF architecture is a package. A package may represent an entire file or it may represent a portion of a file when the file is very large (e.g., a couple of Gigabytes). Therefore, it is expected that fragmentation of files will be executed by the CNF architecture. Fragmentation allows more flexibility in terms of routing and Quality of Service (QoS), and makes the data propagation more robust over single CNF hops especially between wireless devices. In this paper, we will use the terms package or file in an interleaved manner, to denote the unit of transportation within the CNF architecture.

330

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

The main service offered by this architecture is the provision of content delivery by overcoming the pitfalls of TCP/IP in the face of intermittent connectivity that characterizes wireless networks. Files are transferred hop-to-hop either in “push” or “pull” mode, i.e., a mobile end user may request a specific piece of content, or the content provider may push the content to one (unicast) or more (multicast) end-users. A performance evaluation of the CNF architecture is out of the scope of this paper and has been offered in [3]-[7]. Fig. 1 shows the concept of the CNF network. At the edge of the wired core network, POs, such as CNF1 and CNF4 serve as holding and forwarding points for content (or pointers to content) intended for mobiles, which may be disconnected at times. The sender, Mobile Node 1 (MN1) forwards the file, or portions of the file, to the receiver’s PO (CNF4) using conventional point-to-point routing. CNF4 holds the file or pointer until contacting MN3 to arrange delivery. Delivery from CNF4 could be by direct transmission if the mobile is in range, or by a series of wireless hops as determined by the routing protocol. In the latter case a CNC node, such as MN2 is used. A GUI-based demonstration of the operation of the CNF architecture as opposed to the traditional TCP type of communication has been developed and can be viewed at [8].

Sender’s Post Office

CNF2 CNF3

Receiver’s Post Office

CNC

Receiver

Sender CNF4

CNF1

MN2

MN3

MN1 CNF : Cache and Forward CNC: Cache and Carry MN: Mobile Node

Fig. 1. The Cache and Forward Architecture

CNF architecture operates above the IP layer and its operation is supported by a series of protocols that handle the propagation of packages (see Fig. 2). The role of the various CNF layers is similar to operation of the OSI layer stack, but their scope is different as they focus on handling of packages. The CNF Transport Protocol (CNF TP) is responsible for sending content query and receiving content. The content fragmentation and reassembly are also implemented here. It also checks the content error and maintains a content cache. The CNF Network Protocol (CNF NP) is responsible for content discovery and routing content towards the destination after it has been located in the network. The CNF Link Protocol (CNF LP) is designed to reliably deliver the IP packets of a package to the next hop. The control plane of the CNF architecture is supported by three protocols. The routing protocol is responsible for establishing routing paths across CNF routers. The Content Name Resolution Service (CNRS) provides an indexing mechanism to map the content ID to multiple locations of the content. The location closest to the client can be chosen for content

Security in the CNF Architecture for the Next Generation Internet

331

retrieval. The Cache Management Protocol (CMP) is used to facilitate content discovery by maintaining and updating a summary cache containing all the contents that are cached within an autonomous system (AS). Nodes within an AS update the summary cache and adjacent AS gateways exchange summary cache information. control plane CNF TP CNF NP CNRS

CNF LL

Protocol

Routing Protocol

Cache mgmt Protocol

802.11 / 802.3 (IP and MAC) Physical Layer (RF)

Fig. 2. The Cache and Forward protocol stack

3 Security Analysis In this section we look at security in the CNF architecture. Before proceeding with the security analysis, we describe some key aspects regarding the physiology of the CNF architecture, which are considered in our security analysis. The objective of the CNF architecture is to overcome intermittent connectivity over wireless links and facilitate communication among a multitude of wireless devices. Thus, we envision the existing wired Internet as a sphere surrounded by an increasing number of wireless devices as shown in Fig. 3. Outer layers represent wireless devices at different number of hops away from the wired Internet. For a specific flow, a specific number of outer layers or hops can represent the number of multiple wireless routers required to provide service to nodes (see Fig. 3). This figure emphasizes that communication may come in different forms depending on the type of communication over this spherical representation of the future Internet. We classify the communication patterns of the CNF architecture in 3 generic variations: 1) communication within strictly the wireless Internet or strictly the wired Internet 2) communication from the wireless Internet to a node that belongs to the wired Internet and vice versa, and 3) communication that will link two wireless parts of the Internet through the existing wired infrastructure. The third communication pattern poses more of a challenge in this architecture compared to the first and second pattern as it is more dynamic due to the change in connectivity. Furthermore, wireless nodes may also move and connect to the Internet through different CNC routers that may belong to different POs that are not even collocated. Therefore, the communication patterns change dynamically as connectivity to POs and other wireless devices vary. This architecture uses content caching, which introduces more complexity as the communication patterns for acquiring a specific content can vary with time due to the dynamic and changing number of CNF routers holding that specific file. Caching is vital in this architecture as it can decrease the bandwidth utilization in the Internet by increasing content availability. During content propagation over the Internet, CNF

332

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

routers may save a copy of a file prior to transmission. Nodes may obtain data from a cache that is closer compared to the original source. Even though, an investigation of cache optimization over the CNF architecture is offered in [3], security aspects were not taken into account.

Layers 1 to n End-user layer

End-user

CNF router Post Office

Traditional router CNC router

Fig. 3. Layering in the CNF architecture. Multiple layers representing intermediate wireless routers can exist.

In this research, we assume that the CNF architecture provides a methodology to name content. The development of supporting mechanisms for such a service has not been addressed and no issues of security have been taken into account. 3.1 Privacy and File Propagation One of the key aspects of data propagation is that entire files can be located on CNF routers. This functionality enables the caching of content at specific CNF locations while in transit. In terms of security this aspect provides the benefit of stopping malicious activity early in its transmission stage by having the CNF router check the content and validate that it is not virus or spam. Furthermore, it can counteract attacks whose aim is to overload the network and simply utilize its bandwidth, e.g., by checking for repeated transmissions of the same content. However, it concurrently enables access to content that is sensitive as entire files are located on routers. Thus, it bridges privacy policies for the specific file. Caching over this architecture complicates further privacy issues as it increases the exposure of sensitive content. A CNF router can be dynamically configured via a set of security policies to execute selective caching and stop caching sensitive content. However, there is a need to verify that such security policies have been correctly executed, i.e., a file has not been cached. Such verification is complicated as a control mechanism needs to be in place across CNF routers to check for possible propagation of specific content that should have been deleted by the routers. Furthermore, such a selective mechanism provides privacy from the perspective of

Security in the CNF Architecture for the Next Generation Internet

333

limiting the disclosure of information by limiting the number of cached copies of sensitive files. However, it does not really prevent CNF routers from viewing the content prior to transmission. Therefore, there still exists the issue of how privacy can be guaranteed, while harnessing the advantages of spam control or virus detection, which is an inherent security strength of the CNF architecture. To promote privacy, cryptographic methods need to be utilized such that propagation of entire files from one CNF router to the next minimizes the disclosure of information. Typically, cryptography utilized at the user level, i.e., between two end users can hinder the disclosure of information and provide privacy. However, Internet users do not tend to utilize cryptographic methods for a variety of reasons, such as lack of knowledge regarding security mechanisms, or the hidden belief that the Internet is safe, or carelessness in handling their company’s data. Regardless, at the Internet architecture level, end-to-end cryptography is not desirable because it removes the benefits that can be obtained through this architecture in terms of spam or virus control. CNF routers can no longer analyze the content of encrypted files. A simple way around this is to check the content for spam or viruses prior to transmission and then encrypt it. More specifically, have the CNF architecture protocol that assigns content identifiers to files, to also verify and sign that content with a public key signature certifying that it is spam or virus free. The assigned content identifier can also be checked by CNF routers for reply attacks that aim to consume the network bandwidth. This approach can be a good preliminary solution, but it does not account for the vast amount of data generated every second in the Internet. It is virtually impossible to have a service verify the validity of the content prior to assigning a content identifier because vast amounts of data are being generated from all possible locations of the Internet in the world and from all types of nodes and networks. A 2008 International Data Corporation analysis estimated that 281 exabytes of data were produced in 2007, which is equivalent to 281 trillion digitized novels [9]. Therefore, the challenge is to have the distributed service that provides globally certifiable content identifiers check the content and guarantee virus or spam free content. A variation to the above possible solution would be to assign the responsibility of checking content to the CNF routers. CNF routers close to the source (that are at the initial steps of the data propagation,) can check the data for spam and then carry out encryption and provide privacy. The above methodology of handling content provides spam or virus free content as intended by this architecture. However, relating these mechanisms back to the original scope of this architecture, encrypting the files to provide confidentially across CNF routers hinders caching. To provide privacy, content is encrypted at the first CNF router close to the source and decrypted by the CNF router closer to the destination. When using symmetric key cryptography between these two CNF routers, an encrypted file that is cached at any intermediate CNF router cannot be accessed. The file can only be decrypted by the two CNF routers that have the corresponding key, i.e., the key that was originally used to facilitate the communication. Even though privacy is provided, caching is infeasible. Caching on intermediate CNF routers along the routing path cannot work as an intermediate CNF router would need to have the symmetric key that was used by the two CNF routers to be able to redistribute the encrypted cached file. Utilizing public keys can still not address this issue. A package encrypted with the public key of the CNF router will provide privacy of the content as only the CNF router with the

334

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

corresponding private key can decrypt the content. Thus, a file on route cannot be cached and redistributed as needed. It is evident that more complex content distribution schemes are needed among CNF routers to balance the requirements of privacy and caching, while allowing the inherent security features (such as spam or virus control) of this architecture to exist. These solutions should strategically cache files on intermediary CNF routers while concurrently allowing them to decrypt and redistribute files to multiple parties. To achieve this, symmetric keys could be shared among dynamically selected groups of CNF routers. These group selections need to be integrated with caching algorithms that optimize content availability. In addition, the selection of groups of CNFs routers to handle specific content needs to be in accordance with the privacy requirements of the content. This is a topic of future work that needs to be addressed for this architecture. Summarizing, content in the CNF architecture must be handled at its initial stages of transmission to provide spam or virus free content. This aspect can be executed by checking the content during the provision of a content identifier or at the first CNF router close to the source. Issues of privacy need to revisited keeping caching in mind such that strategically chosen CNF routers can increase availability while complying with the content security requirements. The key management system should provide the necessary key distribution methodologies to support dynamic group keys for dynamically formed groups of CNF routers based on traffic patterns. 3.2 Secure Routing over the CNF Architecture Our focus in terms of secure routing is to look at ways that the selection of secure paths for the propagation of packages can be guided through the internal components of the CNF architecture. Even though packetization of data in the future Internet may still be facilitated by the IP protocol, the CNF architecture operates at higher layers as shown in Fig. 2 by the CNF TP, CNF NP, and the CNF LP. The CNF architecture deals with the propagation of data files or packages between CNF routers. Thus, the CNF routers create an overlay network with different topology as opposed to the actual connectivity of the existing routers. The aspect that needs to be investigated in terms of secure routing at the CNF layers is the content requirements in terms of security. The CNF architecture needs to facilitate the creation of content-based security classification. More specifically, information may have restrictions in terms of how, when, and where it is propagated. Some content may need to propagate the Internet though specific routes. It is evident that certain locations in the world may have more malicious activity compared to others or that specific data should not be disclosed to certain areas. In addition, data may need to propagate within specific time boundaries or within specific periods after which they have to be deleted. Moreover, data exposure requirements in terms of visibility may be different based on the content. These content security requirements create a complex problem as they need to be taken into account while trying to optimize caching and address other functional aspects of the CNF architecture such as QoS. Another aspect of secure routing is trust management. Trust can be used as a means to guide the selection of routing paths for packages. Over the years several methods have been developed that enable the dynamic assessment of the trustworthiness of nodes. This assessment is done by looking at certain functionalities,

Security in the CNF Architecture for the Next Generation Internet

335

such as packet forwarding, and allowing routers to check and balance one another. CNF routers can be graded for trust at the IP layer by counting IP packet forwarding (or generally packet forwarding below the CNF layers) However, not every router on the Internet will have CNF capabilities, which implies that non-CNF routers within one-hop connectivity of CNF routers need to evaluate CNF routers. This mechanism requires communication between the overlay of CNF routers and non-CNF routers, since reporting needs to be communicated and made believable to the CNF routers. If such integration is not feasible, then reporting will need to be executed among CNF routers. In this case, reputation in the CNF overlay will provide granularity at the CNF layer, meaning that assessments have to be made in term of packages and not IP packets (or lower layer packets). More specifically the control mechanism will need to exist among CNF-routers. The checks executed to evaluate trust can be extended beyond forwarded packages. They can be on the correctness of the applied cryptographic algorithms, the correctness of the secure routing operation at the CNF layers, and the compliance with the security policies regarding content requirements. If there is an overwhelmingly large amount of content that needs to be handled, the checks and balances may be confined to randomly chosen packages or to packages with highly sensitive content. Summarizing, there is a need to classify content based on its security and functional requirements, such as QoS, to effectively execute secure path selection over the CNF overlay. Trustworthiness information derived from internal characteristics of the operation of the CNF architecture operation can further guide routing decisions among CNF routers. 3.3 Key and Trust Management A Key Management System (KMS) creates, distributes, and manages the identification credentials used during authentication. The role of key management is important as it provide a means of user authentication. Authentication is needed for 2 main reasons: accountability and continuity of communication. Knowing a person ID during a communication provides an entity to hold accountable for any malicious activity or for any untrustworthy information shared. In addition, it provides a link to utilize to continue communication in the future. This continuity enables the communicating parties to assess and build trust towards one another based on their experiences of collaboration that they share. Thus, the verification of identity is linked with authentication and provides accountability, whereas the behavior grading is linked with continuity of communication. These aspects have been treated separately through key management and trust management. The authors in [10] support that in social networks as trust develops it obtains the form of an identity (i.e., identity-based trust) since two peers know each other well and know what to expect from one another. Thus, it is important for the future Internet that emphasis is placed on the link between the two areas. Verifying an identity’s credentials does not and should not imply that a peer is trustworthy as trust is a quantity that typically varies dynamically over time. Authentication implies some initial level of trust, but individual behavior should fluctuate dynamically that initial trust level. Therefore, trust management needs to be taken into account so as to assess the reliability of authenticated individuals.

336

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

In the CNF architecture, global user authentication is required for the multitude of wireless nodes that dynamically connect to the Internet via POs. Thus, in our description of trust and key management we focus on the wireless Internet that the CNF architecture aims to accommodate. More specifically, wireless nodes need to have an identity that is believable across multiple POs enabling secure communication between wireless nodes that may not be in the same wireless network. POs have a vital role in terms of key management because they are the last link on the wired Internet and are responsible for package delivery to MNs. Their location is empowering as it enables them to link wireless nodes to the rest of the Internet. POs deliver the data to wireless nodes by transmitting the files using opportunistic connectivity. Thus, they can act as delegated certificate authorities in a public key infrastructure, verifying the identification of nodes and overall managing certificates of wireless nodes. Their location is also key as it can facilitate the integration with trust management. Until now there has been no global methodology of quantifying trust. Requiring such a global metric desires an Internet architecture that enables the extraction of this information dynamically at a global level. One of the main differences between the existing Internet architecture and the CNF architecture is the paradigm of communication. In the CNF architecture the emphasis is placed on content. Another aspect that is different in this architecture is the usage of POs on the edges of the wired Internet. Using these structural characteristics, the CNF architecture can provide a base on top of which to address trust and key management. More specifically, the coupling of identification and trustworthiness can be achieved by utilizing the aforementioned two characteristics of the CNF architecture: (1) the content of the data (2) location of POs. Since the CNF architecture is content driven, the content can provide some form of characterization of an identity so as to better understand the trust that can be placed on a user. For example, an entity that downloads movies can be placed on a different trust level compared to an entity that downloads documents regarding explosives. Utilizing such a functionality to provide trust for this architecture requires the careful marking of content to indicate certain categories of data classifications that can characterize trust. In addition to this classification, the location of POs within the CNF architecture can further assist in assessing trust. Nowadays certain areas in the world suffer from higher crime tendencies compared to others. Therefore, there is an increasing possibility to utilize the Internet to reflect that behavior. Based on this conjecture, one can approximate that if sensitive data are acquired by POs at specific locations, the level of trust must be adjusted taking into account the locations of the POs as well. Trust metrics based on PO location can be carefully selected to reflect malicious activity in those locations. Those values would have to be monitored and dynamically adjusted over time. (Note that the IP address assignment is handled by IANA and therefore the location of POs can be obtained with some level of accuracy.) Utilizing the above architectural characteristics allows the POs to manage certificates for all the wireless nodes that it services. Based on specific data that nodes acquire POs can introduce some form of trust marking on certificates that they publish that characterizes the behavior of wireless nodes. That information can guide the decision of SA establishment during authentication. Some other criteria that the POs can record are the type of node (e.g., visiting, duration of visit), and activity in

Security in the CNF Architecture for the Next Generation Internet

337

bytes downloaded. In addition, those decisions of trust marking could be further guided by metrics of behavior obtained from local reputation mechanisms within a wireless network (e.g., whether wireless nodes collaborate with their peers by forwarding packets).Overall, this trustworthiness information can guide future interactions among nodes. An overview of previous work that deals with extracting trust information in mobile wireless environments is offered in [11]. At the global scale, there is also a need to grade and monitor the trustworthiness of POs that act on behalf of the wireless network. Similar to grading nodes in the wireless network, the type of content and the location of the POs can be considered. However, to grade a PO, paths of data content to/from that PO can be considered. A distributed mechanism can be introduced in the CNF overlay that uses the CNF architecture to mark the flows of trustworthiness grading of the POs in the Internet. Fig. 4 demonstrates the various flows that may exist. The POs form a circle at the edge of the wired Internet and data paths flow in multiple directions. MN PO MN

PO

PO

Wired Internet

PO

MN

PO

PO

MN

PO PO MN

PO MN

MN

PO

PO PO Data flow

MN

Fig. 4. Directions of flows in the CNF architecture

Based on the integration with trust information, a KMS can provide trust criteria for authenticated users at a global scale. One aspect that needs to be addressed is whether one can predict the behavior of nodes in the future based on their existing trust level. This need for behavior prediction opens up the question of whether trust can be modeled at a global scale and how accurately that can be assessed dynamically. An issue that also needs to be considered is the notion of positive grading for trust. For example, downloading information about helping other human beings or helping the environment may not indicate the presence of a trustworthy destination. If that aspect is considered to improve the trust of a destination, then the issue that needs to be resolved is that certain data may simply be requested to trick the grading mechanism to improve one’s trustworthiness in the Internet environment. Other aspects that need to be considered is the frequency that certain data types get directed to specific POs. Acquiring one file about making explosives may not be the same as acquiring a hundred documents. In terms of the type of information, it is very important that the categories that characterize trust are carefully selected. If a person acquires medical documents about

338

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

diseases, and someone else requests physics-related information that does not necessarily translate to a specific level of trust. There is a need to investigate how the type of data or content can represent the trust placed on nodes and indirectly on users. In addition, there is a need to link different types of content that may indicate certain trends of behavior. Such a classification is a complex issue as it lies in the complexity of human societal behavioral patterns. Summarizing, POs in the CNF architecture can provide identification criteria to the multitude of wireless devices and link them to the Internet. Identity verification coupled with behavior at the local and global scale can guide trust placed on the interaction among peers. Further research is required to come up with meaningful ways of assessing trust based on the content-driven paradigm of communication of the CNF architecture.

4 Conclusions In this research, we have looked at security aspects of the CNF architecture, identified strengths and weaknesses, and derived security requirements that need to taken into account. We have discussed possible security solutions for data privacy, secure routing, key management and trust management, while concurrently identifying future issues that need to be addressed. Even though the CNF architecture has functional benefits in terms of overcoming the limitation of intermittent connectivity in the now predominantly wireless Internet it also introduces security challenges. A balance needs to be obtained between data content privacy, caching, and secure routing. We need to ensure content privacy while taking advantage of security features that naturally emerge from the CNF architecture, such as spam control, virus control, and other related attacks. In addition, since the architecture is content driven there is a need to define content security requirements and route based on those requirements. However, for secure routing to exist there is also a need to be able to assess the trustworthiness of CNF routers, which requires additional mechanisms in place to assess the correct operation of CNF routers. The need for trustworthiness implies the existence of authentication as it provides the base to build trust. Authentication of the multitude of wireless devices in the CNF architecture may be facilitated by a key management system operated by the POs. Their location enables the integration of key management with trustworthiness information. Since the CNF architecture is content-driven, trust can be extracted by examining the content flows through POs. This aspect brings the challenge of coming up with meaningful ways of evaluating trust at a global scale to match the requirements of users or applications. These issues need to be carefully assessed in the future keeping also other aspects in mind, such as QoS. Overall, this analysis has brought into light the security requirements and open research issues that exist in the CNF architecture. Aspects of this investigation can serve as guidance for the design of secure future Internet architectures. Acknowledgments. George Hadjichristofi was supported by the Cyprus Research Promotion Foundation under grant agreement (ΤΠΕ/ΠΛΗΡΟ/0308(ΒΕ)/10). Christoforos Hadjicostis would also like to acknowledge funding from the European

Security in the CNF Architecture for the Next Generation Internet

339

Commission’s Seventh Framework Programme (FP7/2007-2013) under grant agreements INFSO-ICT-223844 and PIRG02-GA-2007-224877. Any opinions, f

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6522

Marcos K. Aguilera Haifeng Yu Nitin H. Vaidya Vikram Srinivasan Romit Roy Choudhury (Eds.)

Distributed Computing and Networking 12th International Conference, ICDCN 2011 Bangalore, India, January 2-5, 2011 Proceedings

13

Volume Editors Marcos K. Aguilera Microsoft Research Silicon Valley 1065 La Avenida – bldg. 6, Mountain View, CA 94043, USA E-mail: [email protected] Haifeng Yu National University of Singapore School of Computing, COM2-04-25 15 Computing Drive, Republic of Singapore 117418 E-mail: [email protected] Nitin H. Vaidya University of Illinois at Urbana-Champaign 458 Coordinated Science Laboratory MC-228, 1308 West Main Street, Urbana, IL 61801, USA E-mail: [email protected] Vikram Srinivasan Alcatel-Lucent Technologies Manyata Technology Park, Nagawara, Bangalore 560045, India E-mail: [email protected] Romit Roy Choudhury Duke University, ECE Department 130 Hudson Hall, Box 90291, Durham, NC 27708, USA E-mail: [email protected] Library of Congress Control Number: 2010940620 CR Subject Classification (1998): C.2, D.1.3, D.2.12, D.4, F.2, F.1.2, H.4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-17678-X Springer Berlin Heidelberg New York 978-3-642-17678-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Message from the General Chairs

On behalf of the Conference Committee for ICDCN 2011, it is our pleasure to welcome you to Bangalore, India for the 12th International Conference on Distributed Computing and Networking. ICDCN is a premier international forum for distributed computing and networking researchers, vendors, practitioners, application developers, and users, organized every year with support from industry and academic sponsors. Since the ﬁrst conference on distributed computing held in 2000, ICDCN has become a leading forum for researchers and practitioners to exchange ideas and share best practices in the ﬁeld of distributed computing and networking. In addition, ICDCN serves as a forum for PhD students to share their research ideas and get quality feedback from renowned experts in the ﬁeld. The only way this reputation can be achieved is from the quality of the work submitted, the standard of tutorials and workshops organized, the dedication and sincerity of the Technical Program Committee members, the quality of the keynote speakers, the ability of the Steering Committee to react to change, and the policy of being friendly to students by sponsoring a large number of travel grants and keeping the registration cost the lowest among international conferences anywhere in the world. This 12th ICDCN illustrates the intense productivity and cutting-edge research of the members of the distributed computing and networking community across the globe. This is the ﬁrst time ICDCN is hosted in Bangalore, India, the Silicon Valley of India. It is jointly hosted by the leading global information technology company Infosys Technologies headquartered in Bangalore and the renowned research and academic institution the International Institute of Information Technology-Bangalore (IIIT-B). Bangalore is the hub of information technology companies and the seat of innovations in high-tech industry in India. The richness of its culture and history blended with the modern lifestyle and the vibrancy of its young professional population, together with its position at the heart of southern India, make Bangalore one of the major Indian tourist destinations. We are grateful for the generous support of our numerous sponsors: Infosys, Google, Microsoft Research, HP, IBM, Alcatel-Lucent, NetApp, and NIIT University. Their sponsorship is critical to the success of this conference. The success of the conference depended on the help of many other people too, and our thanks go to each and every one of them: the Steering Committee, which helped us in all stages of the conference, the Technical Program Committee, which meticulously evaluated each and every paper submitted to the conference, the Workshop and Tutorial Committee, which put together top-notch and topical workshops and tutorials, the Local Arrangements and Finance Committee, who worked day in and day out to make sure that each and every attendee of the conference feels

VI

Preface

at home before and during the conference, and the other Chairs who toiled hard to maintain the high standards of the conference making it a great success. Welcome and enjoy ICDCN 2011, Bangalore and India. January 2011

Sanjoy Paul Lorenzo Alvisi

Message from the Technical Program Chairs

The 12th International Conference on Distributed Computing and Networking (ICDCN 2011) continues to grow as a leading forum for disseminating the latest research results in distributed computing and networking. It is our greatest pleasure to present the proceedings of the technical program of ICDCN 2011. This year we received 140 submissions from all over the world, including Austria, Canada, China, Finland, France, Germany, India, Iran, Israel, The Netherlands, Portugal, Singapore, Spain, Sri Lanka, Switzerland, and USA. These submissions were carefully reviewed and evaluated by the Program Committee, which consisted of 36 members for the Distributed Computing track and 48 members for the Networking track. For some submissions, the Program Committee further solicited additional help from external reviewers. The Program Committee eventually selected 31 regular papers and 3 short papers for inclusion in the proceedings and presentation at the conference. It is our distinct honor to recognize the paper “Generating Fast Indulgent Algorithms” by Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers as the Best Paper in the Distributed Computing track, and the paper “GoDisco: Selective Gossip-Based Dissemination of Information in Social Community-Based Overlays” by Anwitaman Datta and Rajesh Sharma as the Best Paper in the Networking track. In both the reviewing and the best paper selection processes, PC members and PC Chairs who had a conﬂict of interest with any given paper were excluded from the decision-making process related to that paper. Besides the core technical program, ICDCN 2011 oﬀers a number of other stimulating events. Before the main conference program, we have a full day of tutorials. During the main conference, we are fortunate to have several distinguished scientists as keynote speakers. The main conference is further followed by several other exciting events including the PhD forum. We thank all authors who submitted a paper to ICDCN 2011, which allowed us to select a strong technical program. We thank the Program Committee members and external reviewers for their diligence and commitment, both during the reviewing process and during the online discussion phase. We thank the conference General Chairs and other Organizing Committee members for working with us to make ICDCN 2011 a success. January 2011

Marcos K. Aguilera Romit Roy Choudhury Vikram Srinivasan Nitin Vaidya Haifeng Yu

Organization

General Chairs Lorenzo Alvisi Sanjoy Paul

University of Texas at Austin, USA (Distributed Computing Track) Infosys Technologies, Bangalore, India (Networking Track)

Program Chairs Networking Track Vikram Srinivasan (Co-chair) Nitin Vaidya (Co-chair) Romit Roy Choudhury (Vice Chair)

Alcatel-Lucent, India University of Illinois at Urbana-Champaign, USA Duke University, USA

Distributed Computing Track Marcos K. Aguilera (Co-chair) Microsoft Research Silicon Valley, USA Haifeng Yu (Co-chair) National University of Singapore, Singapore

Keynote Chair Sajal Das Prasad Jayanti

University of Texas at Arlington and NSF, USA Dartmouth College, USA

Tutorial Chairs Vijay Garg Samir Das

University of Texas at Austin, USA Stony Brook University, USA

Publication Chair Marcos K. Aguilera Haifeng Yu Vikram Srinivasan

Microsoft Research Silicon Valley, USA National University of Singapore, Singapore Alcatel-Lucent, India

X

Organization

Publicity Chair Luciano Bononi Dipanjan Chakraborty Anwitaman Datta Rui Fan

University of Bologna, Italy IBM Research Lab, India NTU, Singapore Microsoft, USA

Industry Chairs Ajay Bakre

Intel, India

Finance Chair Santonu Sarkar

Infosys Technologies, India

PhD Forum Chairs Mainak Chatterjee Sriram Pemmaraju

University of Central Florida, USA University of Iowa, Iowa City, USA

Local Arrangements Chairs Srinivas Padmanabhuni Amitabha Das Debabrata Das

Infosys Technologies, India Infosys Technologies, India International Institute of Information Technology, Bangalore, India

International Advisory Committee Prith Banerjee Prasad Jayanti Krishna Kant Dipankar Raychaudhuri S. Sadagopan Rajeev Shorey Nitin Vaidya Roger Wattenhofer

HP Labs, USA Dartmouth College, USA Intel and NSF, USA Rutgers University, USA IIIT Bangalore, India NIIT University, India University of Illinois at Urbana-Champaign, USA ETH Zurich, Switzerland

Program Committee: Networking Track Arup Acharya Habib M. Ammari Vartika Bhandari

IBM Research, USA Hofstra University, USA Google, USA

Organization

Bharat Bhargava Saad Biaz Luciano Bononi Mainak Chatterjee Mun Choon Chan Carla-Fabiana Chiasserini Romit Roy Choudhury Marco Conti Amitabha Das Samir Das Roy Friedman Marco Gruteser Katherine H. Guo Mahbub Hassan Gavin Holland Sanjay Jha Andreas Kassler Salil Kanhere Jai-Hoon Kim Myungchul Kim Young-Bae Ko Jerzy Konorski Bhaskar Krishnamachari Mohan Kumar Joy Kuri Baochun Li Xiangyang Li Ben Liang Anutosh Maitra Archan Misra Mehul Motani Asis Nasipuri Srihari Nelakuditi Sotiris Nikoletseas Kumar Padmanabh Chiara Petrioli Bhaskaran Raman Catherine Rosenberg Rajashri Roy Bahareh Sadeghi Moushumi Sen Srinivas Shakkottai Wang Wei Xue Yang Yanyong Zhang

XI

Purdue University, USA Auburn University, USA University of Bologna, Italy University of Central Florida, USA National University of Singapore, Singapore Politecnico Di Torino,Italy Duke University, USA University of Bologna, Italy Infosys, India Stony Brook University, USA Technion, Israel Rutgers University, USA Bell Labs, USA University of New South Wales, Australia HRL Laboratories, USA University of New South Wales, Australia Karlstad University, Sweden University of New South Wales, Australia Ajou University, South Korea Information and Communication University, South Korea Ajou University, South Korea Gdansk University of Technology, Poland University of Southern California, USA University of Texas -Arlington, USA IISc, Bangalore, India University of Toronto, Canada Illinois Institute of Technology, USA University of Toronto, Canada Infosys, India Telcordia Lab, USA National University of Singapore, Singapore University of North Carolina at Charlotte, USA University of South Carolina, USA Patras University, Greece Infosys, India University of Rome La Sapienza, Italy IIT Bombay, India University of Waterloo, Canada IIT Kharagpur, India Intel, USA Motorola, India Texas A&M University, USA ZTE, China Intel, USA Rutgers University, USA

XII

Organization

Program Committee: Distributed Computing Track Mustaque Ahamad Hagit Attiya Rida A. Bazzi Ken Birman Pei Cao Haowen Chan Wei Chen Gregory Chockler Jeremy Elson Rui Fan Christof Fetzer Pierre Fraigniaud Seth Gilbert Rachid Guerraoui Tim Harris Maurice Herlihy Prasad Jayanti Chip Killian Arvind Krishnamurthy Fabian Kuhn Zvi Lotker Victor Luchangco Petros Maniatis Alessia Milani Yoram Moses Gopal Pandurangan Sergio Rajsbaum C. Pandu Rangan Andre Schiper Stefan Schmid Neeraj Suri Srikanta Tirthapura Sam Toueg Mark Tuttle Krishnamurthy Vidyasankar Hakim Weatherspoon

Georgia Institute of Technology, USA Technion, Israel Arizona State University, USA Cornell University, USA Stanford University, USA Carnegie Mellon University, USA Microsoft Research Asia, China IBM Research Haifa Labs, Israel Microsoft Research, USA Technion, Israel Dresden University of Technology, Germany CNRS and University of Paris Diderot, France National University of Singapore, Singapore EPFL, Switzerland Microsoft Research, UK Brown University, USA Dartmouth College, USA Purdue University, USA University of Washington, USA University of Lugano, Switzerland Ben-Gurion University of the Negev, Israel Sun Labs, Oracle, USA Intel Labs Berkeley, USA Universite Pierre & Marie Curie, France Technion, Israel Brown University and Nanyang Technological University, Singapore Universidad Nacional Autonoma de Mexico, Mexico Indian Institute of Technology Madras, India EPFL, Switzerland T-Labs/TU Berlin, Germany TU Darmstadt, Germany Iowa State University, USA University of Toronto, Canada Intel Corporation, USA Memorial University of Newfoundland, Canada Cornell University, USA

Organization

Additional Referees: Networking Track Rik Sarkar Kangseok Kim Maheswaran Sathiamoorthy Karim El Defrawy Sangho Oh Michele Nati Sung-Hwa Lim Yi Gai Tam Vu Young-June Choi Jaehyun Kim Amitabha Ghosh

Giordano Fusco Ge Zhang Sanjoy Paul Aditya Vashistha Bo Yu Sung-Hwa Lim Vijayaraghavan Varadharajan Ying Chen Francesco Malandrino Majed Alresaini Pralhad Deshpande

Additional Referees: Distributed Computing Track John Augustine Ioannis Avramopoulos Binbin Chen Atish Das Sarma Carole Delporte-Gallet Michael Elkin Hugues Fauconnier Danny Hendler Damien Imbs

Maleq Khan Huijia Lin Danupon Nanongkai Noam Rinetzky Nuno Santos Andreas Tielmann Amitabh Trehan Maysam Yabandeh

XIII

Table of Contents

The Inherent Complexity of Transactional Memory and What to Do about It (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya

1

Sustainable Ecosystems: Enabled by Supply and Demand Management (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chandrakant D. Patel and IEEE Fellow

12

Unclouded Vision (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Crowcroft, Anil Madhavapeddy, Malte Schwarzkopf, Theodore Hong, and Richard Mortier

29

Generating Fast Indulgent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers

41

An Eﬃcient Decentralized Algorithm for the Distributed Trigger Counting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, Vijay K. Garg, and Yogish Sabharwal Deterministic Dominating Set Construction in Networks with Bounded Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roy Friedman and Alex Kogan PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Bradler, Lachezar Krumov, Max M¨ uhlh¨ auser, and Jussi Kangasharju

53

65

77

Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya and Eshcar Hillel

83

Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sathya Peri and Krishnamurthy Vidyasankar

95

Locality-Conscious Lock-Free Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasia Braginsky and Erez Petrank Speciﬁcation and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vibhor Bhatt and Prasad Jayanti

107

119

XVI

Table of Contents

On the Performance of Distributed Lock-Based Synchronization . . . . . . . Yuval Lubowich and Gadi Taubenfeld

131

Distributed Generalized Dynamic Barrier Synchronization . . . . . . . . . . . . . Shivali Agarwal, Saurabh Joshi, and Rudrapatna K. Shyamasundar

143

A High-Level Framework for Distributed Processing of Large-Scale Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal Aﬃnity Driven Distributed Scheduling Algorithm for Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Narang, Abhinav Srivastava, Naga Praveen Kumar, and Rudrapatna K. Shyamasundar Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shamimuddin Sheerazuddin Relating L -Resilience and Wait-Freedom via Hitting Sets . . . . . . . . . . . . . . Eli Gafni and Petr Kuznetsov

155

167

179 191

Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valerie King, Steven Lonargan, Jared Saia, and Amitabh Trehan

203

A Necessary and Suﬃcient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olivier Baldellon, Achour Most´efaoui, and Michel Raynal

215

GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anwitaman Datta and Rajesh Sharma

227

Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maitreya Natu, Vaishali Sadaphal, Sangameshwar Patil, and Ankit Mehrotra On the Hardness of Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.B. Acharya and M.G. Gouda

239

251

An Algorithm for Traﬃc Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukanta Bhattacharya, Tanmay De, and Ajit Pal

263

Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bala Kalyanasundaram and Mahendran Velauthapillai

269

Table of Contents

Reliable Networks with Unreliable Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . Srikanth Sastry, Tsvetomira Radeva, Jianer Chen, and Jennifer L. Welch

XVII

281

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay

293

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congduc Pham

303

An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Martin, Yamin Al-Mousa, and Nirmala Shenoy

315

Security in the Cache and Forward Architecture for the Next Generation Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.C. Hadjichristoﬁ, C.N. Hadjicostis, and D. Raychaudhuri

328

Characterization of Asymmetry in Low-Power Wireless Links: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prasant Misra, Nadeem Ahmed, Diethelm Ostry, and Sanjay Jha

340

Model Based Bandwidth Scavenging for Device Coexistence in Wireless LANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Plummer Jr., Mahmoud Taghizadeh, and Subir Biswas

352

Minimal Time Broadcasting in Cognitive Radio Networks . . . . . . . . . . . . . Chanaka J. Liyana Arachchige, S. Venkatesan, R. Chandrasekaran, and Neeraj Mittal

364

Traﬃc Congestion Estimation in VANETs and Its Application to Information Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rayman Preet Singh and Arobinda Gupta

376

A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshihiro Nozaki, Hasan Tuncer, and Nirmala Shenoy

382

DHCP Origin Traceback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saugat Majumdar, Dhananjay Kulkarni, and Chinya V. Ravishankar A Realistic Framework for Delay-Tolerant Network Routing in Open Terrains with Continuous Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Veeramani Mahendran, Sivaraman K. Anirudh, and C. Siva Ram Murthy Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

394

407

419

Invited Paper: The Inherent Complexity of Transactional Memory and What to Do about It Hagit Attiya Department of Computer Science, Technion [email protected]

Abstract. This paper overviews some of the lower bounds on the complexity of implementing software transactional memory, and explains their underlying assumptions. It discusses how these lower bounds align with experimental results and design choices made in existing implementations to indicate that the transactional approach for concurrent programming must compromise either programming simplicity or scalability. There are several contemporary research avenues that address the challenge of concurrent programming. For example, optimizing coarse-grained techniques, and concurrent programming with mini-transactions—simple atomic operations on a small number of locations.

1 The TM Approach to Concurrent Programming As anyone with a laptop or an Internet connection (that is, everyone) knows, the multicore revolution is here. Almost any computing appliance contains several processing cores, and the number of cores in servers is in the low teens. With the improved hardware comes the need to harness the power of concurrency, since the processing power of individual cores does not increase. Applications must be restructured in order to reap the benefits of multiple processing units, without paying a costly price for coordination among them. It has been argued that writing concurrent applications is significantly more challenging than writing sequential ones. Surely, there is a longer history of creating and analyzing sequential code, and this is reflected in undergraduate eduction. Many programmers are mystified by the intricacies of interaction between multiple processes or threads, and the need to coordinate and synchronize them. Transactional memory (TM) has been suggested as a way to deal with the alleged difficulty of writing concurrent applications. In its simplest form, the programmer need only wrap code with operations denoting the beginning and end of a transaction. The transactional memory will take care of synchronizing the shared memory accesses so that each transaction seems to execute sequentially and in isolation. Originally suggested as a hardware platform by Herlihy and Moss [29], TM has resurfaced as a software mechanism a couple of years later. The first software implementation of transactional memory [43] provided, in essence, support for a multi-word synchronization operations on a static set of data items, in terms of a unary operation (LL/SC), somewhat optimized over prior implementations, e.g., [46, 9]. Shavit and Touitou coined the term software transactional memory (STM) to describe this implementation. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 1–11, 2011. c Springer-Verlag Berlin Heidelberg 2011

2

H. Attiya

Only when the termination condition was relaxed to obstruction freedom (see Section 2.2), the first STM handling a dynamic set of data items was presented [28]. Work by Rajwar et al., e.g., [37, 42], helped to popularize the TM approach in the programming languages and hardware communities.

2 Formalizing TM This section outlines how transactional memory can be formally captured. A comprehensive in-depth treatment can be found in [33]. The model encompasses at least two levels of abstraction: The high level has transactions, each of which is a sequence of operations accessing data items. At the low level, the operations are translated into executions in which a sequence of events apply primitive operations to base objects, containing the data and the meta-data needed for the implementation. A transaction is a sequence of operations executed by a single process on a set of data items, shared with other transactions. Data items are accessed by read and write operations; some systems also support other operations. The interface also includes a try-commit and try-abort operations, in which a transaction requests to commit or abort, respectively. Any of these operations, not just try-abort, may cause the transaction to abort; in this case, we say that the transaction forcibly aborted. The collection of data items accessed by a transaction is its data set; the items written by the transaction are its write set, with the other items being its read set. A software implementation of transactional memory (abbreviated STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitive operations (abbreviated primitives) on the base objects. These procedures are followed by asynchronous processes in order to execute the operations of transactions. The primitives can be simple reads and writes, but also more sophisticated ones, like CAS or DCAS, typically applied to memory locations, which are the base objects for the implementation. When processes invoke these procedures, in an interleaved manner, we obtain executions, in the standard sense of asynchronous distributed computing (cf. [8]). Executions consist of configurations, describing a complete state of the system, and events, describing a single step by an individual process, including an application of a single primitive to base objects (possibly several objects, e.g., in case of DCAS). The interval of a transaction T is the execution interval that starts at the first event of T and ends at the last event of T . If T does not have a last event in the execution, then the interval of T is the (possibly infinite) execution interval starting at the first event of T . Two transactions overlap if their intervals overlap. 2.1 Safety: Consistency Properties of TM An STM is serializable if transactions appear to execute sequentially, one after the other [39]. An STM is strictly serializable if the serialization order preserves the order

The Inherent Complexity of Transactional Memory and What to Do about It

3

of non-overlapping transactions [39]. This notion is called order-preserving serializability in [47], and is the analogue of linearizability [31] for transactions.1 Opacity [23] further demands that even partially executed transactions, which may later abort, must be serializable (in an order-preserving manner). Opacity also accommodates operations beyond read and write. While opacity is a stronger condition than serializability, snapshot isolation [10] is a consistency condition weaker than serializability. Roughly stated, snapshot isolation ensures that all read operations in a transaction return the most recent value as of the time the transaction starts; the write sets of concurrent transactions must be disjoint. (Cf. [47, Definition 10.3].) 2.2 Progress: Termination Guarantees for TM One of the innovations of TM is in allowing transactions not to commit, when they are faced with conflicting transactions. This, however, admits trivial implementations where no progress is ever made. Finding the right balance between nontriviality and efficiency has lead to several progress properties. They are first and foremost distinguished by whether locking is accommodated or not. When locks are not allowed, the strongest requirement—rarely provided—is of waitfreedom, namely, that each transaction has to eventually commit. A weaker property ensures that some transaction eventually commits, or that a transaction commits when running by itself. The last property is called obstruction-freedom [28] (see further discussion in [3]). A lock-based STM (e.g., TL2 [16]) is often required to be weakly progressive [24], namely, a transaction that does not encounter conflicts must commit. Several lower bounds assume a minimal progress property, ensuring that a transaction terminates successfully if it runs alone, from a situation in which no other transactions is pending. This property encompasses both obstruction freedom and weak progressiveness. Related definitions [34, 20, 24] further attempt to capture the distinction between aborts that are necessary in order to maintain the safety properties (e.g., opacity) and spurious aborts that are not mandated by the consistency property, and to measure their ratio. Strong progressiveness [24] ensures that even when there are conflicts, some transaction commits. More specifically, an STM is strongly progressive if a transaction without nontrivial conflicts is not forcibly aborted, and if a set of transactions have nontrivial conflicts on a single item then not all of them are forcibly aborted. (Recall that a transaction is forcibly aborted, when the abort was not requested by a try-abort operation of the transaction, i.e., the abort is in response to try-commit, read or write operations.) Another definition [40] says that an STM is multi-version (MV)-permissive if a transaction is forcibly aborted only if it is an update transaction that has a nontrivial conflict with another update transaction. 1

Linearizability, like sequential consistency [36], talks about implementing abstract data structures, and hence they involve one abstraction—from the high-level operations of the data structure to the low level primitives. It also provides the semantics of the operations, and their expected results at the high-level, on the data structure.

4

H. Attiya

Strong progressiveness and MV-permissiveness are incomparable: The former allows a read-only transaction to abort, if it has a conflict with another update transaction, while the latter does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. Strictly speaking, these properties are not liveness properties in the classical sense [35], since they can be checked in finite executions. 2.3 Predicting Performance There has been some theoretical attempts to predict how well will TM implementations scale, resulting in definitions that postulate behaviors that are expected to yield superior performance. Disjoint-access parallelism. The most accepted such notion is disjoint-access parallelism, capturing the requirement that unrelated transactions progress independently, even if they occur at the same time. That is, an implementation should not cause two transactions, which are unrelated at the high-level, to simultaneously access the same low-level shared memory. We explain what it means for two transactions to be unrelated through a conflict graph that represents the relations between transactions. The conflict graph of an execution interval I is an undirected graph, where vertices represent transactions and edges connect transactions that share a data item. Two transactions T1 and T2 are disjoint access if there is no path between the vertices representing them in the conflict graph of their execution intervals; they are strictly disjoint access if there is no edge between these vertices. Two events contend on a base object o if they both access o, and at least one of them applies a non-trivial primitive to o.2 Transactions concurrently contend on a base object o if they have pending events at the same configuration that contend on o. Property 1 (Disjoint access parallelism). An STM implementation is disjoint-access parallel if two transactions concurrently contend on the same base object, only if they are not disjoint access. This definition captures the first condition of the disjoint-access parallelism property of Israeli and Rappoport [32], in accordance with most of the literature (cf. [30]). It is somewhat weaker, as it allows two processes to apply a trivial primitive on the same base object, e.g., read, even when executing disjoint-access transactions. Moreover, this definition only prohibits concurrent contending accesses, allowing transactions to contend on a base object o at different points of the execution. The original disjoint-access parallelism definition [32] also restricts the impact of concurrent transactions on the step complexity of a transaction. For more precise definitions and discussion, see [7]. 2

A primitive is non-trivial if it may change the value of the object, e.g., a write or CAS; otherwise, it is trivial, e.g., a read.

The Inherent Complexity of Transactional Memory and What to Do about It

5

Invisible reads. It is expected that many typical applications will generate workloads that include a significant portion of read-only transactions. This includes, for example, transactions to search a data structure, and find whether it contains a particular data. Many STMs attempt to optimize read-only transactions, and more generally, the implementation of read operations inside the transaction. By their very nature, read operations, and even more so, read-only transactions, need not leave a mark on the shared memory, and therefore, it is desirable to avoid writing in such transactions, i.e., to make sure that reads are invisible, and certainly, that read-only transactions do not write at all. Remark 1. Some authors [15] refer to a transaction as having invisible reads even if it writes, but the information is not sufficiently detailed to supply the exact details about the transaction’s data set. (In their words, “the STM does not know which, or even how many, readers are accessing a given memory location.”) This behavior is captured by the stronger notion of an oblivious STM [5].

3 Some Lower Bound Results This section overview some of the recent work on showing the inherent complexity of TM. This includes a few impossibility results showing that certain properties simply cannot be achieved by a TM, and several worst-case lower bounds showing that other properties put a high price on the TM, often in terms of the number of steps that should be performed, or as bounds on the local computation involved. The rest of the section mentions some of these results. 3.1 The Cost of Validation A very interesting result shows the additional cost of opacity over serializability, namely, making sure that the values read by a transaction are consistent as it is in progress (and not just at commit time, as done in many database implementations). Guerraoui and Kapalka [23] showed that the number of steps in a read operation is linear in the size of the invoking transaction’s read set, assuming that reads are invisible, the STM keeps only a single version of each data item, and is progressive (i.e., it never aborts a transaction unless it conflicts with another pending transaction). In contrast, when only serializability is guaranteed, then the values read should only be validated at commit time, leading to significant savings. 3.2 The Consensus Number of TM It has been shown that lock-based and obstruction-free TMs can solve consensus for at most two processes [22], that is, their consensus number [26] is 2. An intermediate step shows that such TMs are equivalent to shared objects that fail in a very clean manner [3]. Roughly speaking, this is a consensus object providing a familiar propose operation, allowing a thread to provide an input and wait for a unanimous decision value; however, the propose operation may return a definite fail indication, which ensures that the proposed value will not be decided upon. Intuitively, an aborted transaction corresponds to a propose operation returning false. To get the full result, further mechanisms are needed to handle the long-lived nature of a transactional memory.

6

H. Attiya

3.3 Achieving Disjoint-Access Parallelism Guerraoui and Kapalka [22] prove that obstruction-free implementations of software transactional memory cannot ensure strict disjoint-access parallelism. This property requires transactions with disjoint data sets not to access a common base object. This notion is stronger than disjoint-access parallelism (Property 1), which allows two transactions with disjoint data sets to access the same base objects, provided they are connected via other transactions. Note that the lower bound does not hold under this more standard notion, as Herlihy et al. [28] present an obstruction-free and disjoint-access parallel STM. For the stronger case of wait-free read-only transactions, the assumption of strict disjoint-access parallel can be replaced with the assumption that read-only transactions are invisible. We have proved [7] that an STM cannot be disjoint-access parallel and have invisible read-only transactions that always terminate successfully. A read-only transaction not only has to write, but the number of writes is linear in the size of its read set. Both results hold for strict serializability, and hence also for opacity. With a slight modification of the notion of disjoint-access parallelism, they also hold for serializability and snapshot isolation. 3.4 Privatization An important goal for STM is to access certain items by simple reads and writes, without paying the overhead of the transactional memory. It has been shown [21] that, in many cases, this cannot be achieved without prior privatization [45, 44], namely, invoking a privatization transaction, or some other kind of a privatizing barrier [15]. We have recently proved [5] that, unless parallelism (in terms of progressiveness) is greatly compromised or detailed information about non-conflicting transactions is tracked (the STM is not oblivious), privatization cost must be linear in the number of items that are privatized. 3.5 Avoiding Aborts It have been shown [34] that an opaque, strongly progressive STM requires NPcomplete local computation, while a weaker, online notion requires visible reads.

4 Interlude: How Well Does TM Work in Practice? Collectively, the results that will be described here demonstrate that TM faces significant limitations: It cannot provide clean semantics without weakening the consistency semantics or compromising the progress guarantees. The implementations are also significantly limited in their scalability. Finally, it is not clear how expressive is the programming idiom they provide (since their consensus number is only two). One might argue that these are just theoretical results, which anyway, (mostly) describe only the worst case, so, in practice, we are just fine. However, while the results

The Inherent Complexity of Transactional Memory and What to Do about It

7

are mostly stated for the worst case, these are often not corner cases, unlikely to happen in practice, but natural cases, representative of typical scenarios. Moreover, it is difficult to design an STM that behaves differently in different scenarios, or to expose these specific scenarios to the programmer using intricate guarantees. There is evidence that implementation-focused research has also been hitting a similar wall [11]. Design choices done in existing TMs, whether in hardware or in software, compromise either the claimed simplicity of the model (e.g., elastic transactions [19]), or its transparency and generality (e.g., transactional boosting [27]). Alternatively, there are TMs with reduced scalability, weakening progress guarantees or performance.

5 Concurrent Programming in a Post-TM Era The TM approach “infantilizes” programmers, telling them that the TM will take care of making sure their programs runs correctly and efficiently, even in a concurrent setting. Given that this approach may not be able to deliver the promised combination of efficiency and programming simplicity, and it must expose many of the complications of consistency or progress guarantees, perhaps we should stop sheltering the programmer from the reality of concurrency? It might be possible to expose a cleaner model of a multi-core system to programmers, while providing them with better methodologies, tools and programming patterns that will simplify the design of concurrent code, without hiding its tradeoffs. It is my belief that a multitude of approaches should be proposed, besides TM, catering to different needs and setups. This section mentions two, somewhat complementary, approaches to alleviating the difficulty of designing concurrent applications. 5.1 Optimizing Coarse-Grain Programming For small-scale applications, or with moderate amount of contention for the data, the overhead of managing the memory might outweight the cost of delays due to synchronization [17]. In such situations, it might be simpler to rely on coarse-grained synchronization, that is, design applications in which shared data is mostly accessed “in exclusion”. This does not mean a return to simplistic programming with critical sections and mutex. Instead, this recommends the use of novel methods that have several processes compete for the lock, and then, to avoid additional contention, have the lock holder carry out all (or many of the) pending operations on the data [25]. For non-locking algorithms, this can be seen as a return to Herlihy’s universal construction [26], somewhat optimized to improve memory utilization [13]. 5.2 Programming with Mini-transactions A complementary approach is motivated by the observation that many of the lower bounds rely on the fact that TM must deal with large, unpredictable (dynamic) data sets,

8

H. Attiya

accessed with an arbitrary set of operations, and interleaved with generic calculations (including I/O). What if the TM had to deal only with short transactions, with simple functionality and small, known-in-advance (static) data sets, to which only simple arithmetic, comparison, and memory access operations are applied? My claim is that such minitransactions could greatly alleviate the burden of concurrent programming, while still allowing efficient implementation. It is obvious that mini-transactions avoid many of the costs indicated by the lower bounds and complexity results with TM, because they are so restricted. Looking from an implementation perspective, mini-transactions is a design choice that simplifies and improves the performance of TM. Indeed, they are already almost provided by recent hardware TM proposals from AMD [2] and Sun [12]. The support is best-effort in nature, since, in addition to data conflicts, transactions can be aborted due to other reasons, for example, TLB misses, interrupts, certain function-call sequences and instructions like division [38]. Mini-transactions. Mini-transactions are a simple extension of DCAS, or its extension to k CAS, with small values of k, e.g., 3 CAS, 4 CAS. In fact, mini-transactions are a natural, multi-location variant of the LL/SC pair supported in IBM’s PowerPC [1] and DEC Alpha [18]. A mini transaction works on a small, if possible, static, data set, and applies simple functionality, without I/O, out-of-core memory accesses, etc. It is supposed to be short, in order to ensure success. Yet, even if all these conditions are satisfied, the application should be prepared to deal with spurious failures, and not violate the integrity of the data. An important issue is to allow “native” (uninstrumented) access to the locations accessed by mini-transactions, through a clean, implicit mechanism. Thus, they are subject to concerns similar to those arising when privatizing transactional data. Algorithmic challenges. Mini-transactions can provide a significant handle on the difficult task of writing concurrent applications, based on our experience of leveraging even the fairly restricted DCAS [6,4], and others’ experience in utilizing recent hardware TM support [14]. Nevertheless, using mini-transactions still leaves several algorithmic challenges. The first, already discussed above, is the design of algorithms accommodating the best-effort nature of mini-transactions. Another one is to deal with their limited arity, i.e., the small data set, in a systematic manner. An interesting approach could be to understand how mini-transactions can support the needs of amorphous data parallelism [41]. Finally, even with convenient synchronization of accesses to several locations, it is still necessary to find ways to exploit the parallelism, by having thread make progress on their individual tasks, without interfering with each other, and while helping each other as necessary. Part of the challenge is to span the full spectrum: from virtually sequential situations in which threads operate almost in isolation from each other, all the way to highly-parallel situations, where many concurrent threads should be harnessed to perform work efficiently, rather than slow progress due to high contention.

The Inherent Complexity of Transactional Memory and What to Do about It

9

6 Summary This paper describes recent research on formalizing transactional memory, and exploring its inherent limitations. It suggests ways to facilitate the design of efficient and correct concurrent applications, in the post-TM era, while still capitalizing on the lessons learned in designing TM, and the wide interest it generated. The least-explored of them is the design of algorithms and programming patterns that accommodate best-effort mini-transactions, in a way that does not compromise safety, and guarantees liveness in an eventual sense. Acknowledgements. I have benefited from discussions about concurrent programming, and transactional memory with many people, but would like to especially thank my Ph.D. student Eshcar Hillel for many illuminating discussions and comments on this paper. Part of this work was done while the author was on sabbatical at EPFL. The author is supported in part by the Israel Science Foundation (grants 953/06 and 1227/10).

References 1. PowerPC Microprocessor Family: The Programming Environment (1991) 2. Advanced Micro Devices, Inc. Advanced Synchronization Facility - Proposed Architectural Specification, 2.1 edition (March 2009) 3. Attiya, H., Guerraoui, R., Hendler, D., Kuznetsov, P.: The complexity of obstruction-free implementations. J. ACM 56(4) (2009) 4. Attiya, H., Hillel, E.: Built-in coloring for highly-concurrent doubly-linked lists. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 31–45. Springer, Heidelberg (2006) 5. Attiya, H., Hillel, E.: The cost of privatization. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 35–49. Springer, Heidelberg (2010) 6. Attiya, H., Hillel, E.: Highly-concurrent multi-word synchronization. In: Rao, S., Chatterjee, M., Jayanti, P., Murthy, C.S.R., Saha, S.K. (eds.) ICDCN 2008. LNCS, vol. 4904, pp. 112– 123. Springer, Heidelberg (2008) 7. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009 (2009) 8. Attiya, H., Welch, J.L.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn. Wiley, Chichester (2004) 9. Barnes, G.: A method for implementing lock-free shared-data structures. In: SPAA 1993, pp. 261–270 (1993) 10. Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P.: A critique of ANSI SQL isolation levels. In: SIGMOD 1995, pp. 1–10 (1995) 11. Cascaval, C., Blundell, C., Michael, M., Cain, H.W., Wu, P., Chiras, S., Chatterjee, S.: Software transactional memory: why is it only a research toy? Commun. ACM 51(11), 40–46 (2008) 12. Chaudhry, S., Cypher, R., Ekman, M., Karlsson, M., Landin, A., Yip, S., Zeffer, H., Tremblay, M.: Rock: A high-performance SPARC CMT processor. IEEE Micro. 29(2), 6–16 (2009) 13. Chuong, P., Ellen, F., Ramachandran, V.: A universal construction for wait-free transaction friendly data structures. In: SPAA 2010, pp. 335–344 (2010)

10

H. Attiya

14. Dice, D., Lev, Y., Marathe, V., Moir, M., Olszewski, M., Nussbaum, D.: Simplifying concurrent algorithms by exploiting hardware tm. In: SPAA 2010, pp. 325–334 (2010) 15. Dice, D., Matveev, A., Shavit, N.: Implicit privatization using private transactions. In: Transact 2010 (2010) 16. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 17. Dice, D., Shavit, N.: What really makes transactions fast? In: Transact 2006 (2006) 18. Digital Equipment Corporation. Alpha Architecture Handbook (1992) 19. Felber, P., Gramoli, V., Guerraoui, R.: Elastic transactions. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 93–107. Springer, Heidelberg (2010) 20. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 21. Guerraoui, R., Henzinger, T., Kapalka, M., Singh, V.: Transactions in the jungle. In: SPAA 2010, pp. 263–272 (2010) 22. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 23. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 24. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 25. Hendler, D., Incze, I., Shavit, N., Tzafrir, M.: Flat combining and the synchronizationparallelism tradeoff. In: SPAA 2010, pp. 355–364 (2010) 26. Herlihy, M.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991) 27. Herlihy, M., Koskinen, E.: Transactional boosting: a methodology for highly-concurrent transactional objects. In: PPoPP 2008, pp. 207–216 (2008) 28. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 29. Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: ISCA 1993 (1993) 30. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008) 31. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990) 32. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 2004, pp. 151–160 (2004) 33. Kapalka, M.: Theory of Transactional Memory. Nr. 4664, EPFL (2010) 34. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 35. Lamport, L.: Proving the correctness of multiprocess programs. IEEE Transactions on Software Engineering SE-3(2), 125–143 (1977) 36. Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Transactions on Computers 100(28), 690–691 (1979) 37. Larus, J.R., Rajwar, R.: Transactional Memory. Morgan and Claypool, San Francisco (2007) 38. Moir, M., Moore, K., Nussbaum, D.: The adaptive transactional memory test platform: A tool for experimenting with transactional code for Rock. In: Transact 2008 (2008) 39. Papadimitriou, C.H.: The serializability of concurrent database updates. J. ACM 26(4), 631–653 (1979)

The Inherent Complexity of Transactional Memory and What to Do about It

11

40. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 41. Pingali, K., Kulkarni, M., Nguyen, D., Burtscher, M., Mendez-Lojo, M., Prountzos, D., Sui, X., Zhong, Z.: Amorphous data-parallelism in irregular algorithms. Technical Report TR-0905, The University of Texas at Austin, Department of Computer Sciences (2009) 42. Rajwar, R., Goodman, J.R.: Transactional lock-free execution of lock-based programs. In: ASPLOS 2002, pp. 5–17 (2002) 43. Shavit, N., Touitou, D.: Software transactional memory. In: PODC 1995, pp. 204–213 (1995) 44. Shpeisman, T., Menon, V., Adl-Tabatabai, A.-R., Balensiefer, S., Grossman, D., Hudson, R.L., Moore, K.F., Saha, B.: Enforcing isolation and ordering in STM. SIGPLAN Not. 42(6), 78–88 (2007) 45. Spear, M.F., Marathe, V.J., Dalessandro, L., Scott, M.L.: Privatization techniques for software transactional memory. Technical Report Tr 915, Dept. of Computer Science, Univ. of Rochester (2007) 46. Turek, J., Shasha, D., Prakash, S.: Locking without blocking: making lock based concurrent data structure algorithms nonblocking. In: PODS 1992, pp. 212–222 (1992) 47. Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, San Francisco (2001)

Sustainable Ecosystems: Enabled by Supply and Demand Management Chandrakant D. Patel and IEEE Fellow Hewlett Packard Laboratories, Palo Alto, CA 94304, USA [email protected] Abstract. Continued population growth, coupled with increased per capita consumption of resources, poses a challenge to the quality of life of current and future generations. We cannot expect to meet the future needs of society simply by extending existing infrastructures. The necessary transformation can be enabled by a sustainable IT ecosystem made up of billions of service-oriented client devices and thousands of data centers. The IT ecosystem, with data centers at its core and pervasive measurement at the edges, will need to be seamlessly integrated into future communities to enable need-based provisioning of critical resources. Such a transformation requires a systemic approach based on supply and demand of resources. A supply side perspective necessitates using local resources of available energy, alongside design and management that minimizes the energy required to extract, manufacture, mitigate waste, transport, operate and reclaim components. The demand side perspective requires provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy-based control. This paper presents a systemic framework for supply-demand management in IT—in particular, on building sustainable data centers—and suggests how the approach can be extended to manage resources at the scale of urban infrastructures. Keywords: available energy, exergy, energy, data center, IT, sustainable, ecosystems, sustainability.

1 Introduction 1.1 Motivation Environmental sustainability has gained great mindshare. Actions and behaviors are often classified as either “green” or “not green” using a variety of metrics. Many of today’s “green” actions are based on products that are already built but classified as “environmentally friendly” based on greenhouse gas emission and energy consumption in use phase. Such compliance-time thinking lacks a sustainability framework that could holistically address global challenges associated with resource consumption. These resource consumption challenges will stem from various drivers. The world population is expected to reach 9 billion by 2050 [1]. How do we deal with the increasing strain that the economic growth is placing on our dwindling natural M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 12–28, 2011. © Springer-Verlag Berlin Heidelberg 2011

Sustainable Ecosystems: Enabled by Supply and Demand Management

13

resources? Can we expect to meet the needs of society by solely relying on replicating and extending the existing physical infrastructure to cope with economic and population growth? Indeed, anecdotal evidence of the strain that society is placing on the supply side—the resources used for goods and services—is apparent: rising prices for critical materials, such as copper and steel; the dramatic reduction in output of the Pemex Canatrell oil field in Mexico, one of the largest in the world; and limitations in city scale waste disposal. Furthermore, a rise in the price of fuel has led to inflationary impact that could threaten the quality of life of billions. Thus, depletion of limited natural resources and increases in the cost of basic goods necessitates new business models and infrastructures that are designed, built and operated using the least possible amount of appropriate materials and energy. The supply side must be considered together with the societal demand for resources. This paper presents a holistic framework for sustainable design and management. Unlike past work that has mostly focused on operational energy considerations of devices, this contribution weaves lifecycle implications into a broader supply-demand framework. The following are the salient contributions: • • • •

Use of available energy (also called exergy) from 2nd law of thermodynamics as a metric for quantifying sustainability. Formulation of a supply-demand framework based on available energy. Application of this framework to IT, in particular, to data centers. Extension of the framework to other ecosystems such as cities.

1.2 Role of the IT Ecosystem Consider the information technology (IT) ecosystem made up of billions of service oriented client devices, thousands of data centers and digital print factories. As shown in Figure 1, the IT ecosystem and other human managed ecosystems such as transportation, waste management, power delivery, industrial systems, etc. draw from a pool of available energy. In this context, IT has the opportunity to change existing business models and deliver a net positive impact with respect to consumption of available energy. To do so, sustainability of the IT ecosystem itself must be addressed holistically. Given a sustainable IT ecosystem, imagine the scale of impact when billons in growth economies like India utilize IT services to conduct transactions such as purchasing railway tickets, banking, availing healthcare, government services, etc. As the billions board the IT bus, and shun other business models, such as ones that require the use of physical transportation means like an auto-rickshaw to go to the train station to buy tickets, the net reduction in the consumption of available energy can be significant. Indeed, when one overlays a scenario where everything will be delivered as a service, a picture emerges of billions of end users utilizing trillions of applications through a cloud of networked data centers. However, to reach the desired price point where such services will be feasible—especially in emerging economies, where Internet access is desired at approximately US $1 per month—the total cost-ofownership (TCO) of the physical infrastructure that supports the cloud will need to be revisited. There are about 81 million Internet connections in India [2]. There has been progress in reducing the cost of access devices [3], but the cost to avail services still needs to be addressed. In this regard, without addressing the cost of data centers—the foundation for services to the masses—scaling to billions of users is not possible.

14

C.D. Patel

Fig. 1. Consumption of Available Energy

With respect to data centers, prior work has shown that a significant fraction of the TCO comes from the recurring energy consumed in the operation of the data center, and from the burdened capital expenditures associated with the supporting physical infrastructure [4]. The burdened cost of power and cooling, inclusive of redundancy, is estimated to be 25% to 30% of the total cost of ownership in typical enterprise data centers [4]. These power and cooling infrastructure costs may match, or even exceed, the cost of the IT hardware within the data center. Thus, including the cost of IT hardware, over half of the TCO in a typical data center is associated with design and management of the physical infrastructure. For Internet services providers, with thinner layers of software and licensing costs, the physical infrastructure could be responsible for as much as 75% of the TCO. Conventional approaches in building data centers with multiple levels of redundancies and excessive material—an “always-on” mantra with no regard to service level agreement, and lack of dynamic provisioning of resources—leads to excessive over provisioning and cost. Therefore, cost reduction requires an end to end approach that delivers least materials, least energy data centers. Indeed, contrary to the oft held view of sustainability as “paying more to be green”, minimizing the overall lifecycle available energy consumption and thereby building sustainable data centers leads to lowest cost data centers.

2 Available Energy or Exergy as a Metric 2.1 Exergy IT and other ecosystems draw from a pool of available energy as shown in Figure 1. Available energy, also called exergy, refers to energy that is available for performing work [5]. While energy refers to the quantity of energy, exergy quantifies the useful portion (or “quality”) of energy. As an example, in a vehicle, the combustion of a

Sustainable Ecosystems: Enabled by Supply and Demand Management

15

given mass of fuel such as diesel results in propulsion of vehicle (useful work done), dissipation of heat energy and a waste stream of exhaust gases at a given temperature. From the first law of thermodynamics, the quantity of energy was conserved in the combustion process as the sum of the energy in the products equals that in the fuel. However, from the 2nd law of thermodynamics, the usefulness of energy was destroyed since there is not much useful work that can be harnessed from the waste streams e.g. exhaust gases. One can also state that the combustion of fuel resulted in increase of entropy or disorder in the universe – going from a more ordered state in fuel to less ordered state in waste streams. As all processes result in increase in entropy, and consequent destruction of exergy due to entropy generation, minimizing the destruction of exergy is an important sustainability consideration. From a holistic supply-demand point of view, one can say that we are drawing from a finite pool of available energy, and minimizing destruction of available energy is key for future generations to enjoy the same quality of life as the current generation. With respect to making the most of available energy, it is also important to understand and avail opportunities in extracting available energy from waste streams. Indeed, it is instructive to examine the combustion example further to understand the exergy content of waste streams. Classical thermodynamics dictates the upper limit of the work, A, that could be recovered from a heat source, Q (in Joules), at temperature Tj (in Kelvins) emitting to a reservoir at ground state temperature Ta as:

⎛ T ⎞ A = ⎜1 − a ⎟ Q ⎜ T ⎟ j ⎠ ⎝

(1)

For example, with reference to equation 1, relative to a temperature of 298 K (25 oC), 1 joule of heat energy at 773 K (500 oC)—such as exhaust gases from a gas turbine— can give 0.614 joules of available energy. Therefore, a waste stream at this high temperature has good availability (61%) that can be harvested. By the same token, the same joule at 323 K (50 oC)—such as exhaust air from a high power server—can only give 0.077 joules of work. While this determines the theoretical maximum Carnot work that can be availed with a perfect reversible engine, the actual work is much less due to irreversible losses such as friction. Stated simply, the 2nd law of thermodynamics places a limit on the amount of energy that can be converted from one form to another. Similarly, laws of thermodynamics can be applied to other conversion means e.g. electrochemical reactions in fuel cells to estimate the portion of reaction enthalpy that can be converted to electricity [6]. Traditional methods of design involve the completion of an energy balance based on the conservation theory of the first law of thermodynamics. Such a balance can provide the information necessary to reduce thermal losses or enhance heat recovery, but an energy analysis fails to account for degradation in the quality of energy due to irreversibilities predicted by the second law of thermodynamics. Thus, an approach based on the second law of thermodynamics is necessary for analyzing available energy or exergy consumption across the lifecycle of a product—from “cradle to cradle”. Furthermore, it can also be used to create the analytics necessary to run operations that minimize destruction of exergy and create inference analytics that can enable need-based provisioning of resources. Lastly, exergy analysis is important to

16

C.D. Patel

determine the value of the waste stream and tie it to an appropriate process that can make the most of it. For example, the value of converting exhaust heat energy to electrical energy using a thermo-electric conversion process may apply in some cases, but not in others when one takes into account the exergy requirement to build and operate the thermo-electric conversion means. 2.2 Exergy Chain in IT Electrical energy is produced from conversion of energy from one form to another — a common chain starts with converting the chemical energy in the fuel to thermal energy from the combustion of fuel to mechanical energy in a rotating physical device to electrical energy from a magnetically based dynamo. Alternatively, available energy in water—such as potential energy at a given height in a dam—can be converted to mechanical energy and to electrical energy. The electrical energy is 100% available. However, as electrical energy is transmitted and distributed from the source to the point of use, losses along the way in transmission and distribution lead to destruction of availability. The source of power for most data centers (i.e., thermal power station) operates at an efficiency in the neighborhood of 35% to 60% [7]. Transmission and distribution losses can range from 5% to 12%. System level efficiency in the data center power delivery infrastructures (i.e., from building to chip) can range from 60% to 85% depending on the component efficiency and load. Around 80% is typical for a fully loaded state-of-the-art data center. Overall, out of every watt generated at the source, only about 0.3 W to 0.4 W is used for computation. If the generation cycle itself as well as overhead of the data center infrastructure (i.e., cooling) is taken into account, the coal-to-chip power delivery efficiency will be around 5% to 12%. In addition to consumption of exergy in operation, the material within the data center has exergy embedded in it. The embedded exergy stems from exergy required to extract, manufacture, mitigate waste, and reclaim the material. Exergy is also embedded in IT as result of direct use of water (for cooling) and indirect use of water (for production of parts, electricity, etc). Water too can be represented using exergy. As an example, assuming nature desalinates water and there is sufficient fresh water available from natural cycle, one can represent exergy embedded in water as a result of distribution (exergy required to pump) and treatment (exergy required to treat waste water). On average, treatment and distribution of a million gallons of surface water requires 1.5 MWh of electrical energy. Similarly, treatment of a million gallons of waste water consumes 2.5 MWh of electrical energy [8].

3 Supply Side and Demand Side Management 3.1 Architectural Framework In order to build sustainable ecosystems, the following systemic framework articulates the management of supply and demand side of available energy based on the needs of the users.

Sustainable Ecosystems: Enabled by Supply and Demand Management •

•

17

On the supply side: o minimizing the exergy required to extract, manufacture, mitigatewaste, transport, operate and reclaim components; o design and management using local sources of available energy to minimize the destruction of exergy in transmission and distribution, e.g., dissipation in transmission; and, take advantage of exergy in the waste streams, e.g., exhaust heat from turbine. On the demand side: o minimizing the consumption of exergy by provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy based control.

Sustainable ecosystems, given the supply and demand side definition above, are then built on delivering to the needs of the user. The needs of the user are derived from the service level agreement (SLA), decomposed into lower level metrics that can be applied in the field to enable integrated management of supply and demand. The balance of the paper steps through the framework by examining lifetime exergy consumption in IT, evolving a supply-demand framework for data centers and closing by extending the framework to other ecosystems. 3.2 Quantifying Lifetime Exergy Consumption As noted earlier, exergy or available energy stemming from the second law of thermodynamics fuses information about materials and energy use into a single meaningful measure. It estimates the maximum work in Joules that could theoretically have been extracted from a given amount of material or energy. By equating a given system in terms of its lifetime exergy consumption, it becomes possible to remove dependencies on the type of material or the type of energy (heat, electricity, etc) consumed. Therefore, given a lifecycle of a product, as shown in Figure 2, one can now create an abstract information plane that can be commonly applied across any arbitrary infrastructure. Lifecycle design then implies inputting the entire supply chain from “cradle to cradle” to account for exergy consumed in extraction, manufacturing, waste mitigation, transportation, operation and reclamation. From a supply side perspective, designers can focus on minimizing lifetime energy consumption through de-materialization, material choices, transportation, process choices, etc. across the lifecycle. The design toolkit requires a repository of available energy consumption data for various materials and processes. With respect to IT, the following provides an overview of the salient “hotspots” discerned using an exergy based lifetime analysis [9]: •

For service oriented access devices such as laptops, given typical residential usage pattern, the lifetime operational exergy consumption is 20-30% of the total exergy consumed while the rest is embedded (exergy consumed in extraction, manufacturing, transportation, reclamation). o Of the 70-80% of the embedded lifetime exergy consumption, display is a big component.

18

C.D. Patel

Fig. 2. Lifecycle of a product •

For data centers, for a given server, the lifetime operational exergy consumption is about 60% to 80% of the total lifetime exergy consumption [10]. o The large operational component stems from high electricity consumption in the IT equipment and the data center level cooling infrastructure [11][12].

From a strategic perspective, for handhelds, laptops and other forms of access devices, reducing the embedded exergy is critical. And, in order to minimize embedded exergy, least exergy process and material innovations are important. As an example, innovations in display technologies can reduce the embedded footprint of laptops and handhelds. On the other hand, for data centers, it is important to devise an architecture that focuses on minimizing electricity (100% available energy) consumed in operation. Next section presents a supply-demand based architecture for peak exergetic efficiency associated with synthesis and operation of a data center. A sustainable data center—built on lifetime exergy considerations, flexible and configurable resource micro-grids, pervasive sensing, communications and aggregation of sensed data, knowledge discovery and policy based autonomous control—is proposed. 3.3 Synthesis of a Sustainable Data Center Using Lifetime Exergy Analysis Business goals drive the synthesis of a data center. For example, assuming a million users subscribing to a service at US $1/month, the expected revenue would be US $12 million per year. Correspondingly, it may be desirable to limit the infrastructure (excluding software licenses, personnel) TCO to be 1/5th of that amount, or roughly US $2.4 million per year. A simple understanding of impact of power can be had by estimating the cost implications in areas where low cost utility power is not available, and diesel generators are used as a primary source. For example, if the data center supporting the million users consumes 1 MW of power at 1 W per user or 8.76 million KWh per year, the cost of just powering with diesel at approximately $0.25 per KWh will be about $2.2 million per year. Thus, growth economies strained on the

Sustainable Ecosystems: Enabled by Supply and Demand Management

19

resource side and reliant on local power generation with diesel will be at a great disadvantage. Having understood such constraints, the data center must meet the target total cost of ownership (TCO) and uptime based on the service level agreements for a variety of workloads. Data center synthesis can be enabled by using a library of IT and facility templates to create a variety of design options. Given the variety of supply-demand design options, the key areas of analysis for sustainability and cost become: 1. Lifecycle analysis to evaluate each IT-Facility template and systematically dematerialize to drive towards least lifetime embedded exergy design and lowest capital outlay, e.g., systematically reduce the number of IT, power and cooling units, remove excessive material in the physical design, etc. • Overall exergy analysis should also consider exergy in waste streams, and locality of the data center to avail supply side resources (power and cooling). 2. Performance modeling toolkit to determine the performance of the ensemble and estimate consumption of exergy during operation. 3. Reliability modeling toolkit to discover and design for various levels of uptime within the data center. 4. Performance modeling toolkit to determine the ability to meet the SLAs for a given IT-Facility template. 5. TCO modeling to estimate the deviation from the target TCO. Combining all the key elements noted above enables a structure for analysis for a set of applicable data center design templates for a given business need. The data center can be benchmarked in terms of performance per Joules of lifetime available energy or exergy destroyed. The lifetime exergy can be incorporated in a total cost of ownership model that includes software, personnel and license to determine the total cost of ownership of a rack [4] and used to price a cloud business model such as “infrastructure as a service”. 3.4 Demand Side Management of Sustainable Data Centers On the demand side, it is instructive to trace the energy flow in a data center. Electrical energy, all of which is available to do useful work, is transferred to IT equipment. Most of the available electrical energy drawn by the IT hardware is dissipated as heat energy, while useful work is availed through information processing. The amount of useful work is not proportional to the power consumed by the IT hardware. Even in idle mode IT hardware typically consumes more than 50% of its maximum power consumption [32]. As noted in [32], it is important to devise energy-proportional machines. However, it is also important to increase the utilization of IT hardware and reduce the total amount of required hardware. [31] presents such a resource management architecture. A common approach to increasing utilization is executing applications in virtual machines and consolidating the virtual machines onto fewer larger servers [33]. As shown in [15] workload consolidation has the potential to reduce the IT power demand significantly.

20

C.D. Patel

Next, additional exergy is used to actively transfer the heat energy from chip to the external ambient. All of the exergy delivered to the cooling equipment to remove heat is not used to affect the heat transfer. While a fraction of the electrical energy provided to a blower or pump is converted to flow work (product of pressure in N/m2 and volume flow in m3/s), and likewise a portion of the electrical energy applied to a compressor is converted to thermodynamic work (to reduce temperature of the data center), the bulk of the exergy provided is destroyed due to irreversibility. Therefore, in order to build an exergy efficient system, the mantra for demand side management in data center becomes one of allocation of IT (compute, networking and storage), power and cooling resources based on the need with the following salient considerations: •

•

•

Decompose SLAs to Service Level Objectives o based on the SLOs, allocate appropriate IT resources while meeting the performance and uptime requirements [28][29][30]; o account for spatial and temporal efficiencies and redundancies associated with thermo-fluids behavior in a given data center based on heat loads and cooling boundary conditions[14]. Consolidate workloads while taking into account the spatial and temporal efficiencies noted above, e.g., place critical workloads in “gold zones” of data centers which have inherent redundancies due to intersection of fluid flows from multiple air conditioning units and turn off or scale back power to IT equipment not in use [13][14][31]. Enable the data center cooling equipment to scale based on the heat load distribution in the data center [11].

Dynamic implementation of the key points described above can result in better utilization of resources, reduction of active redundant components and reduction in electrical power consumption by half [13][15]. As an example, a data center designed for 1 MW of power at maximum IT load can run up to 80% capacity with workload consolidation and dynamic control. The balance of 200 KW can be used for cooling and other support equipment. Indeed, besides availing failover margin by operating at 80%, the coefficient of performance of the power and cooling ensemble is often optimal at about 80% loading given the efficiency curves of UPSs, and mechanical equipment such as blowers, compressors, etc. 3.5 Coefficient of Performance of the Ensemble Figure 3 shows a schematic of energy transfer in a typical air-cooled data center through flow and thermodynamic processes. Heat is transferred from the heat sinks on a variety of chips—microprocessors, memory, etc—to the cooling fluid in the system e.g. driven by fans, air as a coolant enters the system and undergoes a temperature rise based on the mass flow and is exhausted out into the room. Fluid streams from different servers undergo mixing and other thermodynamic and flow processes in the exhaust area of the racks. As an example, for air cooled servers and racks, the dominant irreversibilities that lead to destruction of exergy arise from cold and hot air streams mixing and mechanical inefficiency of air moving devices. These streams (or some fraction of them) flow back to the modular computer room air conditioning units

Sustainable Ecosystems: Enabled by Supply and Demand Management

21

(CRACs) and transfer heat to the chilled water (or refrigerant) in the cooling coils. Heat transferred to the chilled water at the cooling coils is transported to the chillers through a hydronics network. The coolant in the hydronics network, water in this case, undergoes a pressure drop and heat transfer until it loses heat to the expanding refrigerant in the evaporator coils of the chiller. The heat extracted by the chiller is dissipated through the cooling tower. Work is added at each stage to change the flow and thermodynamic state of the fluid. While this example shows a chilled water infrastructure, the problem definition and analysis can be extended to other forms of cooling infrastructure. Qdatacenter

Hot Fluid Cold Fluid

Power Grid Wblower

Wensemble System Exergo-Thermo Volumes

Cooling Grid

Wpump

Outside Air, Wblower

Wpump Wcompressor Chiller

System Blower (s)

COPG

¦W

Qdc ¦Wblower ¦Wpump ¦ Wcompressor ¦ Wcoolingtower

system

k

l

m

n

o

Fig. 3. Energy Flow in the IT stack – Supply and Demand Side

Development of a performance model at each stage in the heat flow path can enable efficient cooling equipment design, and provide a holistic view of operational exergy consumption from chips to the cooling tower. The performance model should be agnostic and be applicable to an ensemble of components for any environmental control infrastructure. [16] proposes such an operational metric to quantify the performance of the ensemble from chips to cooling tower. The metric, called coefficient of performance of the ensemble, COPG, builds on the thermodynamic metric called coefficient of performance [16]. Maximizing the coefficient of performance of the ensemble leads to minimization of exergy required to operate the cooling equipment. In Figure 3, the systems—such as processor, networking and storage blades—are modeled as “exergo-thermo-volumes” (ETV), an abstraction to represent lifetime exergy consumption of the IT building blocks and their cooling performance [17][18]. The thermo-volumes portion represents the coolant volume flow and resistance to the & ) and pressure drop (ΔP) respectively, to afflow, characterized by volume flow ( V fect heat energy removal from the ETVs. The product of pressure drop (ΔP in N/m2) & in m3/s) determines the flow work required to move a given and volume flow ( V coolant (air here) through the given IT building block represented as an ETV. The & ) for a given temperature rise required through the minimum coolant volume flow ( V

22

C.D. Patel

ETV, shown by a dashed line in Fig 3, can be determined from the energy equation (Eq. 2).

& & =ρ V Q = m C p (T out − T in ) where m .

.

(2)

& is the mass flow in kg/s, ρ is density in where Q& is the heat dissipated in Watts, m 3 kg/m of the coolant (air in this example), Cp is specific heat capacity of air (J/kg-K) and Tout and Tin represent inlet and outlet temperatures of air. As shown in Equation 3, the electrical power (100% available energy) required by the blower (Wb) is the ratio of the calculated flow work to the blower wire to air efficiency, ζb. The blower characteristic curves show the efficiency (ζb) and are important to understand the optimal capacity at the ensemble level. Wb =

( Δ P etv × V& etv )

ς

(3)

b

Total heat load of the datacenter is assumed to be a direct summation of the power delivered to the computational equipment via UPS and PDUs. Extending the coefficient of performance (COP) to encompass power required by cooling resources in form of flow and thermodynamic work, the ratio of total heat load to the power consumed by the cooling infrastructure is defined as: COP = =

Total Heat Dissipation

(Flow Work + Thermodynamic Work ) of Cooling system Heat Extracted by Air Conditione rs Net Work Input

(4)

The ensemble COP is then represented as shown below. The reader is referred to [16] for details. COPG =

Qdatacenter ∑W + ∑ Wblower + ∑ W pump + Wcompressor + W coolingtower k system l m

(5)

3.6 Supply Side of a Sustainable Data Center Design The supply side motivation to design and manage the data center using local sources of available energy is intended to minimize the destruction of exergy in distribution. Besides reducing exergy loss in distribution, a local micro-grid [19] can take advantage of resources that might otherwise remain unutilized and also present an opportunity to use the exergy in waste streams. Figure 4 shows a power grid with solar and methane based generator at a dairy farm. The methane is produced by anaerobic digestion of manure from dairy cows [20]. The use of biogas from manure is well known and has been used all over the world [7]. The advantage of co-location [20]

Sustainable Ecosystems: Enabled by Supply and Demand Management

23

Fig. 4. Biogas and solar electric

stems from the use of heat energy exhausted by the server racks—one Joule of which has a maximum theoretical available energy of 0.077 W at 323 K (50 oC)—to enhance methane production as shown in Figure 4. The hot water from the data center is circulated through the “soup” in the digester. Furthermore, [20] also suggests the use of available energy in the exhaust stream of the electric generator to drive an adsorption refrigeration cycle to cool the data center. Thus, multiple local options for power—wind, sun, biogas and natural gas— sourced locally can power a data center. And, high and low grade exergy in waste streams such as exhaust gases, can be utilized to drive other systems. Indeed, cooling for the data center ought to follow the same principles —a cooling grid made up of local sources. A cooling grid can be made up of ground coupled loops to dissipate heat into the ground, and use outside air (Figure 3) when at appropriate temperature and humidity to cool the data center. 3.7 Integrated Supply-Demand Management of Sustainable Data Centers Figure 5 shows the architectural framework for integrated supply-demand management of a data center. The key components of the data center—IT (compute, networking and storage), power and cooling—have five key horizontal elements. The foundational design elements of the data center are lifecycle design using exergy as a measure and flexible micro-grids of power, cooling and IT building blocks. The micro-grids enable the integrated manager the ability to choose between multiple supply side sources of power, multiple supply side sources of cooling and multiple types of IT hardware and software. The flexibility in power and cooling provides the ability to set power levels of IT systems and the ability to vary cooling (speed of the blowers, etc). The design flexibility in IT comes from intelligent scheduling framework, multiple power states and virtualization [15]. On this design foundation, the management layers are sensing and aggregation, knowledge discovery and policy based control.

24

C.D. Patel

IT

Power

Cooling

Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design

Fig. 5. Architectural framework for a Sustainable Data Center

At runtime, the integrated IT-Facility manager maintains the run-time status of the IT and facility elements of the datacenter. A lower level facility manager collects physical, environmental and process data from racks, room, chillers, power distribution components, etc. Based on the higher level requirements passed down by the integrated manager, the facility management system creates low-level SLAs for operation of power and cooling devices e.g. translating a high level energy efficiency goals to lower level temperature and utilization levels for facility elements to guarantee SLAs. The integrated manager has a knowledge discovery and visualization module that has data analytics for monitoring lifetime reliability, availability and downtimes for preventive maintenance. It has modules that provide insights that are otherwise not apparent at runtime e.g. temporal data mining of facility historical data. As an example, data mining techniques have been explored for more efficient operation of an ensemble of chillers [23][34]. In [23], operational patterns (or motifs) are mined in historical data pertaining to an ensemble of water and air cooled chillers. These patterns are characterized in terms of their COPG, thus allowing comparison in terms of their operational energy efficiency. At the control level in Figure 5, a variety of approaches can be taken. As an example, in one approach, the cooling controller maintains dynamic control of the facility infrastructure (includes CRACs, UPS, Chillers, supply side power etc.) at levels determined by the facility manager to optimize the coefficient of performance of the COPG, e.g., for providing requisite air flow to the racks to maintain the temperature at the inlet of the racks between 25 oC and 30 oC [11][12][14][21]. While exercising dynamic cooling control through the facility manager, the controller also provides information to the IT manager to consolidate workloads to optimize the performance of the data center, e.g., the racks are ranked based on thermo-fluids efficiency at a given time, and the ranking is used in workload placement [15]. The IT equipment not in use is scaled down by the integrated manager [15]. Furthermore, in order to reduce the redundancy in the data center, working in conjunction with IT and Facility managers, the integrated manager uses virtualization and power scaling as a flexibility to mitigate failures e.g. air conditioner failures [12][14][15]. Based on past work, the total energy consumed—in power and cooling—with these demand management techniques would be half of that of state of the art designs. Coupling the demand side management with the supply side options from the local

Sustainable Ecosystems: Enabled by Supply and Demand Management

25

power grid, and the local cooling grid, opens up a completely new approach to integrated supply-demand side management [24] and lead to a “net zero” data center.

4 Applying Supply-Demand Framework to Other Ecosystems In previous sections, the supply-demand framework was applied in devising least lifetime exergy data centers. Sustainable IT can now become IT for sustainability to enable need-based provisioning of resources—power, water, waste, transportation, etc—at scales of cities and can thus deliver a net positive impact by reducing the consumption and depletion of precious Joules of available energy. Akin to the sustainable data center, the foundation of the “Sustainable City” or “City 2.0” is comprehensive lifecycle design [25][26]. Unlike previous generations, where cities were built predominantly focusing on cost and functionality desired by inhabitants, sustainable cities will require a comprehensive life-cycle view, where systems are designed not just for operation but for optimality across resource extraction, manufacturing and transport, operation, and end-of-life. The next distinction within sustainable cities will arise in the supply-side resource pool. The inhabitants of sustainable cities are expected to desire on-demand, just-in-time access to resources at affordable costs. Instead of following a centralized production model with large distribution and transmission networks, a more distributed model is proposed: augmentation of existing centralized infrastructure with local resource micro-grids. As shown for the sustainable data center, there is an opportunity to exploit the resources locally available to create local supply side grids made up of multiple local sources, e.g., power generation by photo-voltaic cells on roof tops, utilization of exergy available in waste stream of cities such as municipal waste and sewage together with natural gas fired turbines with full utilization of waste stream from the turbine, etc. Similarly, for other key verticals such as water, there is an opportunity to leverage the past experience to build water micro-grids using local sources, e.g., harvesting rain water to charge local man-made reservoirs and underground aquifers. Indeed, past examples such as the Amber Fort in Jaipur, Rajasthan, India show such considerations in arid regions [27].

design

management

Electricity

Water

Transport

Waste

Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design

Fig. 6. Architectural Framework for a Sustainable City

...

26

C.D. Patel

As shown in Figure 6, having constructed lifecycle based physical infrastructures consisting of configurable resource micro-grids, the next key element is a pervasive sensing layer. Such a sensing infrastructure can generate data streams pertaining to the current supply and demand of resources emanating from disparate geographical regions, their operational characteristics, performance and sustainability metrics, and the availability of transmission paths between the different micro-grids. The great strides made in building high density, small lifecycle footprint IT storage can enable archival storage of the aggregated data about the state of each micro-grid. Sophisticated data analysis and knowledge discovery methods can be applied to both streaming and archival data to infer trends and patterns, with the goal of transforming the operational state of the systems towards least exergy operations. The data analysis can also enable construction of models using advanced statistical and machine learning techniques for optimization, control and fault detection. Intelligent visualization techniques can provide high-level indicators of the ‘health’ of each system being monitored. The analysis can enable end of life replacement decisions, e.g., when to replace pumps in water distribution system to make the most of the lifecycle. Lastly, while a challenging task—given the flexible and configurable resource pools, pervasive sensing, data aggregation and knowledge discovery mechanisms—an opportunity exists to devise policy based control system. As an example, given a sustainability policy, upstream and downstream pumps in a water micro-grid can operate to maintain a balance of demand and supply.

5 Summary and Conclusions This paper presented a supply-demand framework to enable sustainable ecosystems and suggested that a sustainable IT ecosystem, built using such a framework, can enable IT to drive sustainability in other human managed ecosystems. With respect to the IT ecosystem, an architecture for a sustainable data center composed of three key components—IT, power and cooling—and five key design and management elements stripped across the verticals was presented (Figure 5).The key elements—lifecycle design, scalable and configurable resource micro-grids, sensing, knowledge discovery and policy based control—enable the supply-demand management of the key components. Next, as shown by Figure 6, an architecture for “sustainable cities” built on the same principles by integrating IT elements across key city verticals such as power, water, waste, transport, etc. was presented. One hopes that cities are able to incorporate the key elements of the proposed architecture along one vertical, and when sufficient number of verticals have been addressed, a unified city scale architecture can be achieved. The instantiation of the sustainability framework and the architecture for data centers and cities will require a multi-disciplinary workforce. Given the need to develop the human capital, the specific call to action at this venue is to: •

Leverage the past, and return to “old school” core engineering, to build the foundational elements of the supply-demand architecture using lifecycle design and supply side design principles.

Sustainable Ecosystems: Enabled by Supply and Demand Management •

27

Create a multi-disciplinary curriculum composed of various fields of engineering, e.g., melding of computer science and mechanical engineering to scale the supply-demand side management of power, etc. o The curriculum also requires social and economic tracks as sustainability in its broadest definition is defined by economic, social and environmental spheres—the triple bottom line—and requires us to operate at the intersection of these spheres.

References 1. United Nations Population Division, http://www.un.org/esa/population 2. Aguiar, M., Boutenko, V., Michael, D., Rastogi, V., Subramanian, A., Zhou, Y.: The Internet’s New Billion. Boston Consulting Group Report (2010) 3. Chopra, A.: $35 Computer Taps India’s Huge Low-Income Market. Christian Science Monitor (2010) 4. Patel, C., Shah, A.: Cost Model for Planning, Development and Operation of a Data Center, HP Laboratories Technical Report, HPL-2005-107R1, Palo Alto, CA (2005) 5. Moran, M.J.: Availability Analysis: A Guide to Efficient Energy Use. Prentice-Hall, Englewood Cliffs (1982) 6. Barbir, F.: PEM Fuel Cells, pp. 18–25. Elsevier Academic Press (2005) 7. Rao, S., Parulekar, B.B.: Energy Technology. Khanna Publishers (2005) 8. Sharma, R., Shah, A., Bash, C.E., Patel, C.D., Christian, T.: Water Efficiency Management in Datacenters. In: International Conference on Water Scarcity, Global Changes and Groundwater Management Responses, Irvine, CA (2008) 9. Shah, A., Patel, C., Carey, V.: Exergy-Based Metrics for Environmentally Sustainable Design. In: 4th International Exergy, Energy and Environment Symposium, Sharjah (2009) 10. Hannemann, C., Carey, V., Shah, A., Patel, C.: Life-cycle Exergy Consumption of an Enterprise Server. International Journal of Exergy 7(4), 439–453 (2010) 11. Patel, C.D, Bash, C.E., Sharma, R., Friedrich, R.: Smart Cooling of Data Centers. In: ASME IPACK, Maui, Hawaii (2003) 12. Patel, C., Sharma, R., Bash, C., Beitelmal, A.: Thermal Considerations in Data Center Design, In: IEEE-Itherm, San Diego (2002) 13. Bash, C.E., Patel, C.D., Sharma, R.K.: Dynamic Thermal Management of Air Cooled Data Centers. In: IEEE-Itherm (2006) 14. Beitelmal, A., Patel, C.D.: Thermo-fluids Provisioning of a High Density Data Center. HP Labs External Technical Report, HPL-2004-146R1 (2004) 15. Chen, Y., Gmach, D., Hyser, C., Wang, Z., Bash, C., Hoover, C., Singhal, S.: Integrated Management of Application Performance, Power and Cooling in Data Centers. In: 12th IEEE/IFIP Network Operations and Management Symposium (NOMS), Osaka (2010) 16. Patel, C.D., Sharma, R.K., Bash, C.E., Beitelmal, M.: Energy Flow in the Information Technology Stack: Introducing the Coefficient of Performance of the Ensemble. In: ASME International Mechanical Engineering Congress & Exposition, Chicago, Illinois (2006) 17. Shah, A., Patel, C.D.: Exergo-Thermo-Volumes: An Approach for Environmentally Sustainable Thermal Management of Energy Conversion Devices. Journal of Energy Resource Technology, Special Issue on Energy Efficiency, Sources and Sustainability 2 (2010) 18. Shah, A., Patel, C.: Designing Environmentally Sustainable Systems using ExergoThermo-Volumes. International Journal of Energy Research 33 (2009)

28

C.D. Patel

19. Sharma, R., Bash, C.E., Marwah, M., Christian, T., Patel, C.D.: MICROGRIDS: A new approach to supply-side design of data centers. In: IMECE 2009: Lake Buena Vista, FL (2009) 20. Sharma, R., Christian, T., Arlitt, M., Bash, C., Patel, C.: Design of Farm Waste-driven Supply Side Infrastructure for Data Centers, In: ASME-Energy Sustainability (2010) 21. Sharma, R., Bash, C.E., Patel, C.D., Friedrich, R.S., Chase, J.: Balance of Power: Dynamic Thermal Management of Internet Data Centers. IEEE Computer (2003) 22. Marwah, M., Sharma, R.K., Patel, C.D., Shih, R., Bhatia, V., Rajkumar, V., Mekanapurath, M., Velayudhan, S.: Data Analysis, Visualization and Knowledge Discovery in Sustainable Data Centers. In: Computer 2009, Bangalore (2009) 23. Patnaik, D., Marwah, M., Sharma, R., Ramakrishna, N.: Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining, In: ACM KDD (2009) 24. Gmach, D., Rolia, J., Bash, C., Chen, Y., Christian, T., Shah, A., Sharma, R., Wang, Z.: Capacity Planning and Power Management to Exploit Sustainable Energy. In: 6th International Conference on Network and Service Management, Niagara Falls, Canada (2010) 25. Bash, C., Christian, T., Marwah, M., Patel, C., Shah, A., Sharma, R.: City 2.0: Leveraging Information Technology to Build a New Generation of Cities. Silicon Valley Engineering Council (SVEC) Journal 1(1), 1–6 (2009) 26. Hoover, C.E., Sharma, R., Watson, B., Charles, S.K., Shah, A., Patel, C.D., Marwah, M., Christian, T., Bash, C.E.: Sustainable IT Ecosystems: Enabling Next-Generation Cities, HP Laboratories Technical Report, HPL-2010-73 (2010) 27. Water harvesting, http://megphed.gov.in/knowledge/RainwaterHarvest/Chap2.pdf 28. Cunha, I., Almeida, J., Almeida, V., Santos, M.: Self-adaptive capacity management for multi-tier virtualized environments. In: 10th IFIP/IEEE IM (2007) 29. Khana, G., Beaty, K., Kar, G., Kochut, A.: Application performance management in virtualized server environments. In: IEEE/IFIP NOMS (2006) 30. Gmach, D., Rolia, J., Cherkasova, L., Kemper, A.: Capacity management and demand prediction for next generation data centers. IEEE ICWS Salt Lake City (2007) 31. Kephart, J., Chan, H., Das, R., Levine, D., Tesauro, G., Rawson, F., Lefurgy, C.: Coordinating multiple autonomic managers to achieve specified power-performance tradeoffs. In: 4th IEEE Int. Conf. on Autonomic Computing, ICAC (2007) 32. Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. IEEE Computer 40(12), 33–37 (2007) 33. Andrzejak, A., Arlitt, M., Rolia, J.: Bounding the Resource Savings of Utility Computing Models, HP Laboratories Technical Report HPL-2002-339 (2002) 34. Patnaik, D., Marwah, M., Sharma, R., Ramakrishnan, N.: Data Mining for Modeling Chiller Systems in Data Centers, IDA (2010)

Unclouded Vision Jon Crowcroft1, Anil Madhavapeddy1 , Malte Schwarzkopf1, Theodore Hong1 , and Richard Mortier2 1

2

Cambridge University Computer Laboratory, 15, JJ Thomson Avenue, Cambridge CB3 0FB. UK [email protected] Horizon Digital Economy Research, University of Nottingham, Triumph Road, Nottingham NG7 2TU. UK [email protected]

Abstract. Current opinion and debate surrounding the capabilities and use of the Cloud is particularly strident. By contrast, the academic community has long pursued completely decentralised approaches to service provision. In this paper we contrast these two extremes, and propose an architecture, Droplets, that enables a controlled trade-oﬀ between the costs and beneﬁts of each. We also provide indications of implementation technologies and three simple sample applications that substantially beneﬁt by exploiting these trade-oﬀs.

1

Introduction

The commercial reality of the Internet and mobile access to it is muddy. Generalising, we have a set of cloud service providers (e.g. Amazon, Facebook, Flickr, Google and Microsoft, to name a representative few), and a set of devices that many – and soon most – people use to access these resources (so-called smartphones, e.g., Blackberry, iPhone, Maemo, Android devices). This combination of hosted services and smart access devices is what many people refer to as “The Cloud” and is what makes it so pervasive. But this situation is not entirely new. Once upon a time, looking as far back as the 1970s, we had “thin clients” such as ultra-thin glass ttys accessing timesharing systems. Subsequently, the notion of thin client has periodically resurfaced in various guises such as the X-Terminal, and Virtual Networked Computing (VNC) [14]. Although the world is not quite the same now as back in those thin client days, it does seem similar in economic terms. But why is it not the same? Why should it not be the same? The short answer is that the end user, whether in their home or on the top of the Clapham Omnibus,1 has in their pocket a device with vastly more resource than a mainframe of the 1970s by any measure, whether processing speed, storage capacity or network access rate. With this much power at our ﬁngertips, we should be able to do something smarter than simply using our devices as vastly over-speciﬁed dumb terminals. 1

http://en.wikipedia.org/wiki/The_man_on_the_Clapham_omnibus

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 29–40, 2011. c Springer-Verlag Berlin Heidelberg 2011

30

J. Crowcroft et al.

Meanwhile, the academic reality is that many people have been working at the opposite extreme from this commercial reality, trying to build “ultra-distributed” systems, such as peer-to-peer ﬁle sharing, swarms,2 ad-hoc mesh networks, mobile decentralised social networks,3 in complete contrast to the centralisation trends of the commercial world. We choose to coin the name “The Mist” for these latter systems. The deﬁning characteristic of the Mist is that data is dispersed among a multitude of responsible entities (typically, though not exclusively, ordinary users), rather than being under the control of a single monolithic provider. Haggle [17], Mirage [11] and Nimbus [15] are examples of architectures for, respectively, the networking, operating system and storage components of the Mist. The Cloud and the Mist are extreme points in a spectrum, each with its upsides and downsides. Following a discussion of users’ incentives (§2), we will expand on the capabilities of two instances of these ends later (§3). We will then describe our proposed architecture (§4) and discuss its implications for three particular application domains (§5), before concluding (§6).

2

User Incentives

For the average user, accustomed to doing plain old storage and computation on their own personal computer or mobile (what we might term “The Puddle”), there are multiple competing incentives pushing in many directions: both towards and away from the Cloud, and towards and away from the Mist (see Figure 1).

cloud social easy to use centrally managed scalable virus protection

de p puddle

mist

Sharing

default privacy no lock-in

Sync Data Location

physical control

Speed

high bandwidth

Security

hacking protection

Fig. 1. Incentives pushing users toward the centralised Cloud vs. the decentralised Mist 2 3

http://bittorrent.com/ http://joindiaspora.com, http://peerson.net/

Unclouded Vision

31

Consider some of the forms of utility a user wants from their personal data: – Sharing. There is a tension between the desire to share some personal data easily with selected peers (or even publicly), and the need for control over more sensitive information. The social Cloud tends to share data, whereas the decentralised Mist defaults to privacy at the cost of making social sharing more diﬃcult. – Synchronization. The Cloud provides a centralised naming and storage service to which all other devices can point. As a downside, this service typically incurs an ongoing subscription charge while remaining vulnerable to the provider stopping the service. Mist devices work in a peer-to-peer fashion which avoids provider lock-in, but have to deal with synchronisation complexity. – Data Location. The Cloud provides a convenient, logically centralised data storage point, but the speciﬁc location of any component is hard for the data owner to control.4 In contrast, the decentralised Mist permits physical control over where the devices are but makes it hard to reliably ascertain how robustly stored and backed-up the data is. – Speed. A user must access a centralised Cloud via the Internet, which limits access speeds and creates high costs for copying large amounts of data. In the Mist, devices are physically local and hence have higher bandwidth. However, Cloud providers can typically scale their service much better than individuals for those occasions when “ﬂash traﬃc” drives a global audience to popular content. – Security. A user of the Mist is responsible for keeping their devices updated and can be vulnerable to malicious malware if they fall behind. However, the damage of intrusion is limited only to their devices. In contrast, a Cloud service is usually protected by dedicated staﬀ and systems, but presents a valuable hacking target in which any failures can have widespread consequences, exposing the personal data of millions of users. These examples demonstrate the clear tension between what users want from services managing their personal data vs. how Cloud providers operate in order to keep the system economically viable. Ideally, the user would like to keep their personal data completely private while still hosting it on the Cloud. On the other hand, the cloud provider needs to recoup hosting costs by, e.g., selling advertising against users’ personal data. Even nominally altruistic Mist networks need incentives to keep them going: e.g., in BitTorrent it was recently shown that a large fraction of the published content is driven by proﬁt-making companies rather than altruistic amateur ﬁlesharers [2]. Rather than viewing this as a zero-sum conﬂict between users and providers, we seek to leverage the smart capabilities of our devices to provide happy compromises that can satisfy the needs of all parties. By looking more closely at the true underlying interests of the diﬀerent sides, we can often discover solutions that achieve seemingly incompatible goals [6]. 4

http://articles.latimes.com/2010/jul/24/business/la-fi-google-la-20100724

32

3

J. Crowcroft et al.

The Cloud vs. the Mist

To motivate the droplets architecture, we ﬁrst examine the the pros and cons of the Cloud and the Mist in more detail. The Cloud’s Benefits: Centralising resources brings several signiﬁcant beneﬁts, speciﬁcally: – economies of scale, – reduction in operational complexity, and – commercial gain. Perhaps the most signiﬁcant of these is the oﬄoading of the conﬁguration and management burden traditionally imposed by computer systems of all kinds. Additionally, cloud services are commonly implemented using virtualisation technology which enables statistical multiplexing and greater eﬃciencies of scale while still retaining “Chinese walls” that protect users from one another. As cloud services have grown, they have constructed specialised technology dedicated to the task of large data storage and retrieval, for example the new crop of “NoSQL” databases in recent years [10]. Most crucially, centralised cloud services have built up valuable databases of information that did not previously exist before. Facebook’s “social graph” contains detailed information on the interactions of hundreds of millions of individuals every day, including private messages and media. These databases are not only commercially valuable in themselves, they can also reinforce a monopoly position, as the network eﬀect of having sole access to this data can prevent other entrants from constructing similar databases. The Cloud’s Costs: Why should we trust a cloud provider with our personal data? There are many ways in which they might abuse that trust, data protection legislation notwithstanding. The waters are further muddied by the various commercial terms and conditions to which users initially sign up, but which providers often evolve over time. When was the last time you checked the URL to which your providers post alterations to their terms and conditions, privacy policies, etc.? Even if you object to a change, can you get your data back and move it to another provider, and ensure that they have really deleted it? The Mist’s Benefits: Accessing the Cloud can be ﬁnancially costly due to the need for constant high-bandwidth access. Using the Mist, we can reduce our access costs because data is stored (cached) locally and need only be uploaded to others selectively and intermittently. We keep control over privacy, choosing exactly what to share with whom and when. We also have better access to our data: we retain control over the interfaces used to access it; we are immune to service disruptions which might aﬀect the network or cloud provider; and we cannot be locked out from our own data by a cloud provider. The Mist’s Costs: Ensuring reliability and availability in a distributed decentralised system is extremely complex. In particular, a new vector for breach of

Unclouded Vision

33

personal data is introduced: we might leave our fancy device on top of the aforesaid Clapham Omnibus with our data in it! We have to manage the operation of the system ourselves, and need to be connected often enough for others to be able to contact us. Droplets: A Happy Compromise? In between these two extremes should lie the makings of a design that has all the positives and none of the negatives. In fact, a hint of a way forward is contained in the comments above. If data is encrypted on both our personal computer/device and in the Cloud, then for privacy purposes it doesn’t really matter where it is physically stored. However, for performance reasons, we do care. Hence we’d like to carry information of immediate value close to us. We would also like it replicated in multiple places for reliability reasons. We also observe that the vast majority of usergenerated content is of interest only within the small social circle of the content’s subject/creator/producer/owner and thus note that interest/popularity in objects tends to be Zipf-distributed. In the last paragraph, it might be unclear who “we” are: “we” refers to Joe Public, whether sitting at home or on the top of that bus. However, there is another important set of stakeholders: those who provide The Cloud and The Net. These stakeholders need to make money lest all of this fail. The service provider needs revenue to cover operational expenses and to make a proﬁt, but is loath to charge the user directly. Even in the case of the network, ISPs (and 3G providers) are mostly heading toward ﬂat data rates. As well as targeted advertisements and associated “click-through” revenue, service providers also want to carry out data mining to do market research of a more general kind. Fortunately, recent advances in cryptography and security hint at ways to continue to support the two-sided business models that abound in today’s Internet. In the case of advertising, the underlying interest of the Cloud provider is actually the ability to sell targeted ads, not to know everything about its users. Privacy-preserving query techniques can permit ads to be delivered to users matching certain criteria without the provider actually knowing which users they were [8,9,16]. In the case of data mining on the locations or transactions of users, techniques such as diﬀerential privacy [5] and k-anonymity [18] can allow providers to make queries on aggregate data without being able to determine information about speciﬁc users. So we propose Droplets, half way between the Cloud and the Mist. Droplets make use of the Mirage operating system [11], Nimbus storage [15] and Haggle networking [17]. They ﬂoat between the personal device and the cloud, using technologies such as social networks, virtualisation and migration [1,3], and they provide the basic components of a Personal Container [12]. They condense within social networks, where privacy is assured by society, but in the great unwashed Internet, they stay opaque. The techniques referred to above allow the service providers to continue to provide the storage, computation, indexing, search and transmission services that they do today, with the same wide range of business models.

34

4

J. Crowcroft et al.

Droplets

Droplets are units of network-connected computation and storage, designed to migrate around the Internet and personal devices. At a droplet’s core is the Mirage operating system, which compiles high-level language code into specialised targets such as Xen micro-kernels, UNIX binaries, or even Javascript applications. The same Mirage source code can thus run on a cloud computing platform, within a user’s web browser, on a smart-phone, or even as a plugin on a social network’s own servers. As we note in Table 1, there is no single “perfect” location where a Droplet should run all the time, and so this agility of placement is crucial to maximising satisfaction of the users’ needs while minimising their costs and risks. Table 1. Comparison of diﬀerent potential Droplets platforms Platform Storage Bandwidth

Google AppEngine VM (e.g., on EC2) Home Computer Mobile Phone moderate

moderate

high

low

high

high

limited

low

Accessibility

always on

always on

variable

variable

Computation

limited

ﬂexible, plentiful

ﬂexible, limited

limited

Cost

free

expensive

cheap

cheap

Reliability

high

high

medium (failure)

low (loss)

Storage in such an environment presents a notable challenge, which we address via the Nimbus system, a distributed, encrypted and delay-tolerant personal data store. Working on the assumption that personal data access follows a Zipf power-law distribution, popular objects can be kept live on relatively expensive but low-latency platforms such as a Cloud virtual machine, while older objects can be archived inexpensively but safely on a storage device at home. Nimbus also provides local attestation in the form of “trust fountains,” which let nodes provide a cryptographic attestation witnessing another node’s presence or ownership of some data. Trust fountains are entirely peer-to-peer, and so proof is established socially (similarly to the use of lawyers or public notaries) rather than via central authority. Haggle provides a delay-tolerant networking platform, in which all nodes are mobile and can relay messages via various routes. Even with the use of central “stable” nodes such as the Cloud, outages will still occur due to the scale and dynamics of the Cloud and the Net, as has happened several times to such highproﬁle and normally robust services as GMail. During such events, the user must not lose all access to their data, and so the Haggle delay-tolerant model is a good ﬁt. It is also interesting to observe that many operations performed by users are quite latency-insensitive, e.g. backups can be performed incrementally, possibly overnight.

Unclouded Vision

4.1

35

Deployment Model

Droplets are a compromise between the extremely-distributed Mist model and the more centralised Cloud. They store a user’s data and provide a network interface to this data rather than exposing it directly. The nature of this access depends on where the Droplet has condensed. – Internet droplet. If the Droplet is running exposed to the wild Internet, then the network interfaces are kept low-bandwidth and encrypted by default. To prevent large-scale data leaks, the Droplet rejects operations that would download or erase a large body of data. – Social network droplet. For hosting data, a droplet can condense directly within a social network, where it provides access to its database to the network, e.g., for data mining, in return for “free” hosting. Rather than allowing raw access, it can be conﬁgured to only permit aggregate queries to help populate the provider’s larger database, but still keep track of its own data. – Mobile droplet. The Droplet provides high-bandwidth, unfettered access to data. It also regularly checks with any known peers to see if a remote wipe instruction will cause it to permanently stop serving data. – Archiver droplet. Usually runs on a low-power device, e.g., an ARM-based BeagleBoard, accepting streams of data changes but not itself serving data. Its resources are used to securely replicate long-term data, ensuring it remains live, and to alert the user in case of signiﬁcant degradation. – Web droplet. A Droplet in a web browser executes as a local Javascript application, where it can provide web bookmarklet services, e.g., trusted password storage. It uses cross-domain AJAX to update a more reliable node with pertinent data changes. Droplets can thus adapt their external interfaces depending on where they are deployed, allowing negotiation of an acceptable compromise between hosting costs and desire for privacy. 4.2

Trust Fountains

To explain trust fountains by way of example, consider the following. As part of the instantiation of their Personal Container, Joe Public runs an instance of a Nimbus trust fountain. When creating a droplet from some data stored in his Personal Container, this trust fountain creates a cryptographic attestation proving Joe’s ownership of the data at that time in the form of a time-dependent hash token. The droplet is then encrypted under this hash token using a fast, medium strength cipher5 and pushed out to the cloud. By selectively publishing the token, Joe can grant access to the published droplet e.g., allowing data mining access to a provider in exchange for free data storage and hosting. Alternatively, 5

Strong encryption is not required as the attestations are unique for each droplet publication and breaking one does not grant an attacker access to any other droplets.

36

J. Crowcroft et al.

the token might only be shared with a few friends via an ad hoc wireless network in a coﬀee shop, granting them access only to that speciﬁc data at that particular time. 4.3

Backwards Provenance

A secondary purpose of the attestation is to enable “backwards provenance”, i.e., a way to prove ownership. Imagine that Joe publishes a picture of some event which he took using his smartphone while driving past it on that oftconsidered bus. A large news agency picks up and uses that picture after Joe publishes it to his Twitter stream using a droplet. The attached attestations then enable the news agency to compensate both the owner and potentially the owner’s access provider, who takes a share in all proﬁts made from Joe’s digital assets in exchange for serving them. Furthermore, Joe is given a tool to counter “hijacking” of his creation even if the access token becomes publicly known: using the cryptographic properties of the token, the issue log of his trust fountain together with his provider’s conﬁrmation of receipt of the attested droplet forms suﬃcient evidence to prove ownership and take appropriate legal action. Note that Joe Public can also deny ownership if he chooses, as only his trust fountain holds the crucial information necessary to regenerate the hash token and thus prove the attestation’s origin. 4.4

Handling 15 Minutes of Fame

Of course, whenever a droplet becomes suﬃciently popular to merit condensation into a cloud burst of marketing, then we have the means to support this transition, and we have the motivation and incentives to make sure the right parties are rewarded. In this last paragraph, “we” refers to all stakeholders: users, government and business. It seems clear that the always-on, everywhere-logged, ubiquitously-connected vision will continue to be built, while real people become increasingly concerned about their privacy [4]. Without such privacy features, it is unclear for how much longer the commercial exploitation of personal data will continue to be acceptable to the public; but without such exploitation, it is unclear how service providers can continue to provide the many “free” Internet services on which we have come to rely.

5

Droplications

The Droplet model requires us to rethink how we construct applications – rather than building centralised services, they must now be built according to a distributed, delay-tolerant model. In this section, we discuss some of the early services we are building.

Unclouded Vision

5.1

37

Digital Yurts

In the youthful days of the Internet, there was a clear division between public data (web homepages, FTP sites, etc.) and private (e-mail, personal documents, etc.). It was common to archive personal e-mail, home directories and so on, and thus to keep a simple history of all our digital activities. The pace of change in recent years has been tremendous, not only in the variety of personal data, but in where that data is held. It has moved out of the conﬁnes of desktop computers to data-centres hosted by third-parties such as Google, Yahoo and Facebook, who provide “free” hosting of data in return for mining information from millions of users to power advertising platforms. These sites are undeniably useful, and hundreds of millions of users voluntarily surrender private data in order to easily share information with their circle of friends. Hence, the variety of personal data available online is booming – from media (photographs, videos), to editorial (blogging, status updates), and streaming (location, activity). However, privacy is rapidly rising up the agenda as companies such as Facebook and Google collect vast amounts of data from hundreds of millions of users. Unfortunately, the only alternative that privacy-sensitive users currently have is to delete their online accounts, losing both access to and what little control they have over their online social networks. Often, deletion does not even completely remove their online presence. We have become digital nomads: we have to fetch data from many third-party hosted sites to recover a complete view of our online presence. Why is it so diﬃcult to go back to managing our own information, using our own resources? Can we do so while keeping the “good bits” of existing shared systems, such as ease-of-use, serendipity and aggregation? Although the immediate desire to regain control of our privacy is a key driver, there are several other longer-term concerns about third parties controlling our data. The incentives of hosting providers are not aligned with the individual: we care about preserving our history over our lifetime, whereas the provider will choose to discard information when it ceases to be useful for advertising. This is where the Droplet model is useful – rather than dumbly storing data, we can also negotiate access to that data with hosting providers via an Internet droplet, and arrive at a compromise between letting them data mine it, versus the costs of hosting it. When the hosting provider loses interest in the older, historical tail of data, the user can deploy an archival droplet to catch the data before it disappears, and archive it for later retrieval. 5.2

Dust Clouds

Dust Clouds [13] is a proposal for the provision of secure anonymous services using extremely lightweight virtual machines hosted in the cloud. As they are lightweight, they can be created and destroyed with very short lifetimes, yet still achieve useful work. However, several tensions exist between the requirements of users and cloud providers in such a system. For example, cloud providers have a strong requirement for a variety of auditing functions. They need to know who consumed what resources in order

38

J. Crowcroft et al.

to bill, to provision appropriately, to ensure that, e.g., upstream service level agreements with other providers are met, and so on. They would tend to prefer centralisation for reasons already mentioned (eﬃciency, economy of scale, etc.). By contrast, individual consumers use such a system precisely because it provides anonymity while they are doing things that they wish not to be attributed to them, e.g., to avoid arrest. Anonymity in a dust cloud is largely provided by having a rich mixnet of traﬃc and other resource consumption. Consumers would also prefer diversity, in both geography and provider, to ensure that they’re not at the mercy of a single judicial/regulatory system. Pure cloud approaches fail the users’ requirements by putting too much control in the hands of one (or a very small number of) cloud providers. Pure mist approaches fail the user by being unable to provide the richness of mixing to provide suﬃcient anonymity: many of the devices in the mist are either insuﬃciently powerful or insuﬃciently well-connected to support a large enough number of users’ processes. By taking a droplets approach we obviate both these issues: the lightweight nature of VM provisioning means that it becomes largely infeasible for the cloud provider to track in detail what users are doing, particularly when critical parts of the overall distributed process/communication are hosted on non-cloud infrastructure. Local auditing for payment recovery based on resources used is still possible, but the detailed correlation and reconstruction of an individual process’s behaviour becomes eﬀectively impossible. At the same time, the scalable and generally eﬃcient nature of cloud-hosted resources can be leveraged to ensure that the end result is itself suitably scalable. 5.3

Evaporating Droplets

With Droplets, we also have a way of creating truly ephemeral data items in a partially trusted or untrusted environment, such as a social network, or the whole Internet. Since Droplets have the ability to do computation, they can refuse to serve data if access prerequisites are not met: for example, time-dependent hashes created from a key and a time stamp can be used to control access to data in a Droplet. Periodically, the user’s “trust fountain” will issue new keys, notifying the Droplet that it should now accept the new key only. To “evaporate” data in a Droplet, the trust fountain simply ceases to provide keys for it, thus making users unable to access the Droplet, even if they still have the binary data or even the Droplet itself (assuming, of course, that brute-forcing the hash key is not a worthwhile option). Furthermore, their access is revoked even in disconnected state, i.e. when the Droplet cannot be notiﬁed to accept the new hash key only: since it is necessary to provide time stamp and key as authentication tokens in order for the Droplet to generate the correct hash, expired keys can no longer be used as they have to be provided along with their genuine origin time stamp. Additionally, as a more secure approach, the Droplet could even periodically re-encrypt its contents in order to combat brute-forcing. This subject has been of some research interest recently. Another approach [7] relies on statistical metrics that require increasingly large amounts of data from a DHT to be available to an attacker in order to reconstruct the data, but

Unclouded Vision

39

is vulnerable to certain Sybil attacks [19]. Droplets, however, have the power of being able to completely ensure that all access to data is revoked, even when facing a powerful adversary in a targeted attack. Furthermore, as a side eﬀect of the hash-key based access control, the evaporating Droplet could serve diﬀerent views, or stages of evaporation, to diﬀerent requesters depending on the access key they use (or its age). Finally, the “evaporating Droplet” can be made highly accessible from a user perspective by utilizing a second Droplet: a Web Droplet (see §4.1) that integrates with a browser can automate the process of requesting access keys from trust fountains and unlocking the evaporating Droplet’s contents.

6

Conclusions and Future Work

In this paper, we have discussed the tension between the capabilities of and demands on the Cloud and the Mist. We concluded that both systems are at opposite ends of a spectrum of possibilities and that compromise between providers and users is essential. From this, we derived an architecture for an alternative system, Droplets, that enables control over the trade-oﬀs involved, resulting in systems acceptable to both hosting providers and users. Having realised two of the main components involved in Droplets, Haggle networking and the Mirage operating system, we are now completing realisation of the third, Nimbus storage, as well as building some early “droplications”.

References 1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warﬁeld, A.: Live migration of virtual machines. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI), pp. 273–286. USENIX Association, Berkeley (2005) 2. Cuevas, R., Kryczka, M., Cuevas, A., Kaune, S., Guerrero, C., Rejaie, R.: Is content publishing in BitTorrent altruistic or proﬁt-driven (July 2010), http://arxiv.org/abs/1007.2327 3. Cully, B., Lefebvre, G., Meyer, D.T., Karollil, A., Feeley, M.J., Hutchinson, N.C., Warﬁeld, A.: Remus: High availability via asynchronous virtual machine replication. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI). USENIX Association, Berkeley (April 2008) 4. Doctorow, C.: The Things that Make Me Weak and Strange Get Engineered Away. Tor.com (August 2008), http://www.tor.com/stories/2008/08/weak-and-strange 5. Dwork, C.: Diﬀerential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) 6. Fisher, R., Patton, B.M., Ury, W.L.: Getting to Yes: Negotiating Agreement Without Giving In. Houghton Miﬄin (April 1992), http://www.amazon.com/exec/ obidos/redirect?tag=citeulike07-20&path=ASIN/0395631246 7. Geambasu, R., Kohno, T., Levy, A., Levy, H.M.: Vanish: Increasing data privacy with self-destructing data. In: Proceedings of the USENIX Security Symposium (August 2009)

40

J. Crowcroft et al.

8. Guha, S., Reznichenko, A., Tang, K., Haddadi, H., Francis, P.: Serving Ads from localhost for Performance, Privacy, and Proﬁt. In: Proceedings of Hot Topics in Networking (HotNets), New York, NY (October 2009) 9. Haddadi, H., Hui, P., Brown, I.: MobiAd: Private and scalable mobile advertising. In: Proceedings of MobiArch (to appear, 2010) 10. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010), http://dx.doi.org/10.1109/MC.2010.58 11. Madhavapeddy, A., Mortier, R., Sohan, R., Gazagnaire, T., Hand, S., Deegan, T., McAuley, D., Crowcroft, J.: Turning down the LAMP: software specialisation for the cloud. In: HotCloud 2010: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 11. USENIX Association, Berkeley (2010) 12. Mortier, R.: et al.: The Personal Container, or Your Lif. In Bit. In: Proceedings of Digital Futures (October 2010) 13. Mortier, R., Madhavapeddy, A., Hong, T., Murray, D., Schwarzkopf, M.: Using Dust Clouds to enhance anonymous communication. In: Proceedings of the Eighteenth International Workshop on Security Protocols, IWSP (April 2010) 14. Richardson, T., Staﬀord-Fraser, Q., Wood, K.R., Hopper, A.: Virtual network computing. IEEE Internet Computing 2(1), 33–38 (1998) 15. Schwarzkopf, M., Hand, S.: Nimbus: Intelligent Personal Storage. Poster at the Microsoft Research Summer School 2010, Cambridge, UK (2010) ¨ 16. Shikfa, A., Onen, M., Molva, R.: Privacy in content-based opportunistic networks. In: AINA Workshops, pp. 832–837 (2009) 17. Su, J., Scott, J., Hui, P., Crowcroft, J., De Lara, E., Diot, C., Goel, A., Lim, M.H., Upton, E.: Haggle: Seamless networking for mobile applications. In: Krumm, J., Abowd, G.D., Seneviratne, A., Strang, T. (eds.) UbiComp 2007. LNCS, vol. 4717, pp. 391–408. Springer, Heidelberg (2007) 18. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002) 19. Wolchok, S., Hofmann, O., Heninger, N., Felten, E., Halderman, J., Rossbach, C., Waters, B., Witchel, E.: Defeating vanish with low-cost sybil attacks against large DHTs. In: Proceedings of the 17th Network and Distributed System Security Symposium (NDSS), pp. 37–51 (2010)

Generating Fast Indulgent Algorithms Dan Alistarh1 , Seth Gilbert2 , Rachid Guerraoui1, and Corentin Travers3 1

2 3

EPFL, Switzerland National University of Singapore Universit´e de Bordeaux 1, France

Abstract. Synchronous distributed algorithms are easier to design and prove correct than algorithms that tolerate asynchrony. Yet, in the real world, networks experience asynchrony and other timing anomalies. In this paper, we address the question of how to efficiently transform an algorithm that relies on synchronization into an algorithm that tolerates asynchronous executions. We introduce a transformation technique from synchronous algorithms to indulgent algorithms [1], which induces only a constant overhead in terms of time complexity in well-behaved executions. Our technique is based on a new abstraction we call an asynchrony detector, which the participating processes implement collectively. The resulting transformation works for a large class of colorless tasks, including consensus and set agreement. Interestingly, we also show that our technique is relevant for colored tasks, by applying it to the renaming problem, to obtain the first indulgent renaming algorithm.

1 Introduction The feasibility and complexity of distributed tasks has been thoroughly studied both in the synchronous and asynchronous models. To better capture the properties of real-world systems, Dwork, Lynch, and Stockmeyer [2] proposed the partially synchronous model, in which the distributed system may alternate between synchronous to asynchronous periods. This line of research inspired the introduction of indulgent algorithms [1], i.e. algorithms that guarantee correctness and efficiency when the system is synchronous, and maintain safety even when the system is asynchronous. Several indulgent algorithms have been designed for specific distributed problems, such as consensus (e.g., [3, 4]). However, designing and proving correctness of such algorithms is usually a difficult task, especially if the algorithm has to provide good performance guarantees. Contribution. In this paper, we introduce a general transformation technique from synchronous algorithms to indulgent algorithms, which induces only a constant overhead in terms of time complexity. Our technique is based on a new primitive called an asynchrony detector, which identifies periods of asynchrony in a fault-prone asynchronous system. We showcase the resulting transformation to obtain indulgent algorithms for a large class of colorless agreement tasks, including consensus and set agreement. We also apply our transformation to the distinct class of colored tasks, to obtain the first indulgent renaming algorithm. Detecting Asynchrony. Central to our technique is a new abstraction, called an asynchrony detector, which we design as a distributed service for detecting periods of

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 41–52, 2011. c Springer-Verlag Berlin Heidelberg 2011

42

D. Alistarh et al.

asynchrony. The service detects asynchrony both at a local level, by determining whether the view of a process is consistent with a synchronous execution, and at a global level, by determining whether the collective view of a set of processes could have been observed in a synchronous execution. We present an implementation of an asynchrony detector, based on the idea that each process maintains a log of the messages sent and received, which it exchanges with other processes. This creates a view of the system for every process, which we use to detect asynchronous executions. The Transformation Technique. Based on this abstraction, we introduce a general technique allowing synchronous algorithms to tolerate asynchrony, while maintaining time efficiency in well-behaved executions. The main idea behind the transformation is the following: as long as the asynchrony detector signals a synchronous execution, processes run the synchronous algorithm. If the system is well behaved, then the synchronous algorithm yields an output, on which the process decides. Otherwise, if the detector notices asynchrony, we revert to an existing asynchronous backup algorithm with weaker termination and performance guarantees. Transforming Agreement Algorithms. We first showcase the technique by transforming algorithms for a large class of agreement tasks, called colorless tasks, which includes consensus and set agreement. Intuitively, a colorless task allows processes to adopt each other’s output values without violating the task specification, while ensuring that every value returned has been proposed by a process. We show that any synchronous algorithm solving a colorless task can be made indulgent at the cost of two rounds of communication. For example, if a synchronous algorithm solves synchronous consensus in t + 1 rounds, where t is the maximum number of crash failures (i.e. the algorithm it is time-optimal), then the resulting indulgent algorithm will solve consensus in t + 3 rounds if the system is initially synchronous, or will revert to a safe backup, e.g. Paxos [4, 5] or ASAP [6], otherwise. The crux of the technique is the hand-off procedure: we ensure that, if a process decides using the synchronous algorithm, any other process either decides or adopts a state which is consistent with the decision. In this second case, we show that a process can recover a consistent state by examining the views of other processes. The validity property will ensure that the backup protocol generates a valid output configuration. Transforming Renaming Algorithms. We also apply our technique to the renaming problem [7], and obtain the first indulgent renaming algorithm. Starting from the synchronous protocol of [8], our protocol renames in a tight namespace of N names and terminates in (log N + 3) rounds, in synchronous executions. In asynchronous executions, the protocol renames in a namespace of size N + t. Roadmap. In Section 2, we present the model, while Section 3 presents an overview of related work. We define asynchrony detectors in Section 4. Section 5 presents the transformation for colorless agreement tasks, while Section 6 applies it to the renaming problem. In Section 7 we discuss our results. Due to space limitations, the proofs of some basic results are omitted, and we present detailed sketches for some of the proofs.

Generating Fast Indulgent Algorithms

43

2 Model We consider an eventually synchronous system with N processes Π = {p1 , p2 , . . . , pN }, in which t < N/2 processes may fail by crashing. Processes communicate via messagepassing in rounds, which we model much as in [3, 9, 10]. In particular, time is divided into rounds, which are synchronized. However, the system is asynchronous, i.e. there is no guarantee that a message sent in a round is also delivered in the same round. We do assume that processes receive at least N − t messages in every round, and that a process always receives its own message in every round. Also, we assume that there exists a global stabilization time GST ≥ 0 after which the system becomes synchronous, i.e. every message is delivered in the same round in which it was sent. We denote such a system by ES(N, t). Although indulgent algorithms are designed to work in this asynchronous setting, they are optimized for the case in which the system is initially synchronous, i.e. when GST = 0. We denote the synchronous message-passing model with t < N failures by S(N, t). In case the system stabilizes at a later point in the execution, i.e. 0 < GST < ∞, then the algorithms are still guaranteed to terminate, although they might be less efficient. If the system never stabilizes, i.e. GST = ∞, indulgent algorithms might not terminate, although they always maintain safety. In the following, we say that an execution is synchronous if every message sent by a correct process in the course of the execution is delivered in the same round in which it was sent. Alternatively, if process pi receives a message m from process pj in round r ≥ 2, then every process received all messages sent by process pj in all rounds r < r. The view of a process p at a round r is given by the messages that p received at round r and in all previous rounds. We say that the view of process p is synchronous at round r if there exists an r-round synchronous execution which is indistinguishable from p’s view at round r.

3 Related Work Starting with seminal work by Dwork, Lynch and Stockmeyer [2], a variety of different models have been introduced to express relaxations of the standard asynchronous model of computation. These include failure detectors [11], round-by-round fault detectors (RRFD) [12], and, more recently, indulgent algorithms [1]. In [3, 9], Guerraoui and Dutta address the complexity of indulgent consensus in the presence of an eventually perfect failure detector. They prove a tight lower bound of t+ 2 rounds on the time complexity of the problem, even in synchronous runs, thus proving that there is an inherent price to tolerating asynchronous executions. Our approach is more general than that of this reference, since we transform a whole class of synchronous distributed algorithms, solving various tasks, into their indulgent counterparts. On the other hand, since our technique induces a delay of two rounds of communication over the synchronous algorithm, in the case of consensus, we miss the lower bound of t + 2 rounds by one round. Recent work studied the complexity of agreement problems, such as consensus [6] and k-set agreement [10], if the system becomes synchronous after an unknown stabilization time GST . In [6], the authors present a consensus algorithm that terminates in

44

D. Alistarh et al.

f + 2 rounds after GST , where f is the number of failures in the system. In [10], the authors consider k-set agreement in the same setting, proving that t/k + 4 rounds after GST are enough for k-set agreement, and that at least t/k + 2 rounds are required. The algorithms from these references work with the same time complexity in the indulgent setting, where GST = 0. On the other hand, the transformation in the current paper does not immediately yield algorithms that would work in a window of synchrony. From the point of view of the technique, references [6, 10] also use the idea of “detecting asynchrony” as part of the algorithms, although this technique has been generalized in the current work to address a large family of distributed tasks. Reference [13] considered a setting in which failures stop after GST , in which case 3 rounds of communication are necessary and sufficient. Leader-based, Paxos-like algorithms, e.g. [4, 5], form another class of algorithms that tolerate asynchrony, and can also be seen as indulgent algorithms. A precise definition of colorless tasks is given in [14]. Note that, in this paper, we augment their definition to include the standard validity property (see Section 5).

4 Asynchrony Detectors An asynchrony detector is a distributed service that detects periods of asynchrony in an asynchronous system that may be initially synchronous. The service returns a YES/NO indication at the end of every round, and has the property that processes which receive YES at some round share a synchronous execution prefix. Next, we make this definition precise. Definition 1 (Asynchrony Detector). Let d be a positive integer. A d-delay asynchrony detector in ES(N, t) is a distributed service that, in every round r, returns either YES or NO, at each process. The detector ensures the following properties. – (Local detection) If process p receives YES at round r, then there exists an r-round synchronous execution in which p has the same view as its current view at round r. – (Global detection) For all processes that receive YES in round r, there exists an (r− d)-round synchronous execution prefix S[1, 2, . . . , r − d] that is indistinguishable from their views at the end of round r − d. – (Non-triviality) The detector never returns NO during a synchronous execution. The local detection property ensures that, if the detector returns YES, then there exists a synchronous execution consistent with the process’ view. On the other hand, the global detection property ensures that, for processes that receive YES from the detector, the (r − R)-round execution prefix was “synchronous enough”, i.e. there exists a synchronous execution consistent with what these processes perceived during the prefix. The non-triviality property ensures that there are no false positives. 4.1

Implementing an Asynchrony Detector

Next, we present an implementation of a 2-delay asynchrony detector in ES(N, t), which we call AD(2). The pseudocode is presented in Figure 1.

Generating Fast Indulgent Algorithms

45

The main idea behind the detector, implemented in the process procedure, is that processes maintain a detailed view of the state of the system by aggregating all messages received in every round. For each round, each process maintains an Active set of processes, i.e. processes that sent at least one message in the round; all other processes are in the Failed set for that round (lines 2–4). Whenever a process receives a new message, it merges the contents of the Active and Failed sets of the sender with its own (lines 8–9). Asynchrony is detected by checking if there exists any process that is in the Active set in some round r, while being in the Failed set in some previous round r < r (lines 10–12). In the next round, each process sends its updated view of the system together with a synch flag, which was set to true, if asynchrony was detected.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

procedure detector()i msgi ← ⊥; synchi ← true; Active i ← [ ]; Failed i ← [ ]; for each round Rc do send( msgi ) msgSeti ← receive() (synchi , msgi ) ← process(msgSeti, Rc ) if synchi = true then output YES else output NO procedure process( msgSeti , r )i if synchi = true then Activei [Rc ] ← processes from which pi receives a message in round Rc F ailedi [Rc ] ← processes from which pi did not receive a message in round Rc if there exists pj ∈ msgSeti with synchj = false then synchi ← false for every msg j ∈ msgSet i do for round r from 1 to Rc do Active i [r] ← msg j .Active j [r] ∪ Active i [r] Failed i [r] ← msg j .Failed j [r] ∪ Failed i [r] for round r from 1 to Rc − 1 do for round k from r + 1 to Rc do if (Active i [k] ∩ Failed i [r] = ∅) then synchi ← false if synchi = true then msgi ← (synchi , (Activei [r])r∈[1,Rc ] , (F ailedi [r])r∈[1,Rc ] ) else msgi ← (synchi , ⊥, ⊥) return (synchi , msgi ) Fig. 1. The AD(2) asynchrony detection protocol

4.2 Proof of Correctness In this section, we prove that the protocol presented in the Section 4.1 satisfies the definition of an asynchrony detector. First, to see that the local detection condition is satisfied, notice that the contents of the Active and Failed sets at each process p can be used to construct a synchronous execution which is coherent with process p’s view. In the following, we focus on the global detection property. We show that, for a fixed round r > 0, given a set of processes P ⊆ Π that receive YES from AD(2) at the end of

46

D. Alistarh et al.

round r + 2, there exists an r-round synchronous execution S[1, r] such that the views of processes in P at the end of round r are consistent with S[1, r]. We begin by proving that if two processes receive YES from the asynchronous detector in round r + 2, then they must have received eachother’s round r + 1 messages, either directly, or through a relay. Note that, because of the round structure, a process’s round r + 1 message only contains information that it has acquired up to round r. In the following, we will use a superscript notation to denote the round at which the [r + 1] denotes the set Active[r + 2] local variables are seen. For example, Activer+2 q at process q, as seen from the end of round r + 2. Lemma 1. Let p and q be two processes that receive YES from AD(2) at the end of round r + 2. Then p ∈ Activer+2 [r + 1] and q ∈ Activer+2 q p [r + 1]. [r + 1]–the proof of the second statement is symProof. We prove that p ∈ Activer+2 q metric. Assume, for the sake of contradiction, that p ∈ / Activer+2 [r + 1]. Then, by q lines 8–9 of the process() procedure, none of the processes that send a message to q in round r + 2 received a message from p in round r + 1. However, this set of processes contains at least N − t > t elements, and therefore, in round r + 2, process p receives a message from at least one process that did not receive a message from p in round r+2 r + 1. Therefore p ∈ Activer+2 p [r + 2] ∩ F ailedp [r + 1] (recall that p receives its own message in every round). Following the process() procedure for p, we obtain that synchp = false in round r + 2, which means that process p receives NO from AD(2) in round r + 2, contradiction. Lemma 2. Let p and q be two processes in P . Then, for all rounds k < l ≤ r, Activerp [l] ∩ F ailedrq [k] = ∅, and Activerp [l] ∩ F ailedrq [k] = ∅, where the Active and Failed sets are seen from the end of round r. Proof. We prove that, given r ≥ l > k, Activerp [l] ∩ F ailedrq [k] = ∅. Assume, for the sake of contradiction, that there exist rounds k < l ≤ r and a processor s such that s ∈ Activerp [l] ∩ F ailedrq [k]. Lemma 1 ensures that p and q communicate in round r+2 r + 1, therefore it follows that s ∈ F ailedr+2 p [k]. This means that s ∈ Activep [l] ∩ F ailedr+2 p [k], for k < l, therefore p cannot receive YES in round r + 2, contradiction. The next lemma provides a sufficient condition for a set of processes to share a synchronous execution up to the end of some round R. The proof follows from the observation that the required synchronous execution E can be constructed by exactly following the contents of the Active and Failed sets by processes at every round in the execution. Lemma 3. Let E be an R-round execution in ES(N, t), and P be a set of processes in Π such that, at the end of round R, the following two properties are satisfied: 1. For any p and q in P , and any round r ∈ {1, 2, . . . , R − 1}, ActiveR p [r + 1] ∩ R F ailedq [r] = ∅. 2. | p∈P ActiveR p [R]| ≥ N − t. Then there exists a synchronous execution E which is indistinguishable from the views of processes in P at the end of round R.

Generating Fast Indulgent Algorithms

47

Finally, we prove that if a set of processes P receive YES from AD(2) at the end of some round R+2, then there exists a synchronous execution consistent with their views at the end of round R, for any R > 0, i.e. that AD(2) is indeed a 2-round asynchrony detector. The proof follows from the previous results. Lemma 4. Let R > 0 be a round and P be a set of processes that receive YES from AD(2) at the end of round R + 2. Then there exists a synchronous execution consistent with their views at the end of round R.

5 Generating Indulgent Algorithms for Colorless Tasks 5.1 Task Definition In the following, a task is a tuple (I, O, Δ), where I is the set of vectors of input values, O is a set of vectors of output values, and Δ is a total relation from I to O. A solution to a task, given an input vector I, yields an output vector O ∈ O such that O ∈ Δ(I). Intuitively, a colorless task is a terminating task in which any process can adopt any input or output value of any other process, without violating the task specification, and in which any (decided) output value is a (proposed) input value. We also assume that the output values have to verify a predicate P, such as agreement or k-agreement. For example, in the case of consensus, the predicate P states that all output values should be equal. Let val (V ) be the set of values in a vector V . We precisely define this family of tasks as follows. A colorless task satisfies the following properties: (1) Termination: every correct process eventually outputs; (2) Validity: for every O ∈ Δ(I), val (O) ⊆ val (I); (3) The Colorless property: If O ∈ Δ(I), then for every I with val (I ) ⊆ val (I) : I ∈ I and Δ(I ) ⊆ Δ(I). Also, for every O with val (O ) ⊆ val (O) : O ∈ O and O ∈ Δ(I). Finally, we assume that the outputs satisfy a generic property (4) Output Predicate: every O ∈ O satisfies a given predicate P. Consensus and k-set agreement are canonical examples of colorless tasks. 5.2 Transformation Description We present an emulation technique that generates an indulgent protocol in ES(N, t) out of any protocol in S(N, t) solving a given colorless task T , at the cost of two communication rounds. If the system is not synchronous, the generated protocol will run a given backup protocol Backup which ensures safety, even in asynchronous executions. For example, if an protocol solves synchronous consensus in t + 1 rounds (i.e. it is optimal), then the resulting protocol will solve consensus in t + 3 rounds if the system is initially synchronous. Otherwise, the protocol reverts to a safe backup, e.g. Paxos [5], or ASAP [6]. We fix a protocol A solving a colorless task in the synchronous model S(N, t). The running time of the synchronous protocol is known to be of R rounds. In the first phase of the transformation, each process p runs the AD(2) asynchrony detector in parallel with the protocol A, as long as the detector returns a YES indication at every round. Note that the protocol’s messages are included in the detector’s messages (or viceversa), preventing the possibility that the protocol encounters asynchronous message

48

D. Alistarh et al.

deliveries without the detector noticing. If the detector returns NO during this phase, the process stops running the synchronous protocol, and continues running only AD(2). If the process receives YES at the end of round R + 2, then it returns the decision value that A produced at the end of round R1 . On the other hand, if the process receives NO from AD(2) in round R + 2, i.e. asynchrony was detected, then the process will run the second phase of the transformation. More precisely, in phase two, the process will run a backup agreement protocol that tolerates periods of asynchrony (for example, the K4 protocol [10], if the task is k-set agreement). The main question is how to initialize the backup protocol, given that some of the processes may have already decided in phase one, without breaking the properties of the task. We solve this problem as follows. Let Supp (the support set) be the set of processes that received YES from AD(2) in round R + 1 that process p receives messages from in round R + 2. There are two cases. (1) If the set Supp is empty, then the process starts running the backup protocol using its initial proposal value. (2) If the set Supp is non-empty, then the process obtains a new proposal value as follows. It picks one process from Supp and adopts its state at R − 1. Then, in round R, it simulates receiving the messages in the end of round R+1 [R], where we maintain the notation used in Section 4. We will j∈Supp msgSet j show that in this case, the simulated protocol A will necessarily return a decision value at the end of simulated round R. The process p then runs the backup protocol, using as initial value the decision value resulting from the simulation of the first R rounds. 5.3

Proof of Correctness

We now prove that the resulting protocol verifies the task specification. The proofs of termination, validity, and the colorless property follow from the properties of the A and Backup protocols, therefore we will concentrate on proving that the resulting protocol also satisfies the output predicate P. Theorem 1 (Output Predicate). The indulgent transformation protocol satisfies the output predicate P associated to the task T . Assume for the sake of contradiction that there exists an execution in which the output of the transformation breaks the output predicate P. If all process decisions are made at the end of round R + 2, then, by the global detection property of AD(2), there exists a synchronous execution of A in which the same outputs are decided, which break the predicate P, contradiction. If all decisions occur after round R + 2, first notice that, by the validity and colorless properties, the inputs processes propose to the Backup protocol are always valid inputs for the task. It follows that, since all decisions are output by Backup, there exists an execution of the Backup protocol in which the predicate P is broken, again a contradiction. Therefore, at least one process outputs at the end of round R+ 2, and some processes decide at some later round. We prove the following claim. 1

Since AD(2) returns YES at process p at the end of round R + 2, it follows that it must have returned YES at p at the end of round R as well. The local detection property of the asynchrony detector implies that the protocol A has to return a decision value, since it executes a synchronous execution.

Generating Fast Indulgent Algorithms

49

Claim. If a process decides at the end of round R + 2, then (i) all correct processes will have a non-empty support set Supp and (ii) there exists an R-round synchronous execution consistent with the views that all correct processes adopt at the end of round R + 2. Proof (Sketch). First, let d be a process that decides at the end of round R + 2. Then, in round R + 2, process d received a message from at least N − t processes that got YES from AD(2) at the end of round R + 1. Since N ≥ 2t + 1, it follows that every process that has not crashed by the end of round R + 2 will have received at least one message from a process that has received YES from AD(2) in round R + 1; therefore, all non-crashed processes that get NO from AD(2) in round R + 2 will execute case 2, which ensures the first claim. Let Q = {q1 , . . . , q } be the non-crashed processes at the end of round R + 2. By the above claim, we know that these processes either decide or simulate an execution. We prove that all views simulated in this round are consistent with a synchronous execution up to the end of round R, in the sense of Lemma 3. To prove that the intersection of their simulated views in round R contains at least (N − t) messages, notice that the processes from which process d receives messages in round R + 2 are necessarily in this intersection, since otherwise process d would receive NO in round R + 2. To prove the first condition of Lemma 3, note that process d’s view of round R, i.e. [R], contains all messages simulated as received in round R by the the set msgSet R+2 d processes that receive NO in round R + 2. Since N − t > t, every process that receives NO in round R + 2 from the detector also receives a message supporting d’s decision in round R + 2; process d receives the same message and does not notice any asynchrony. Therefore, we can apply Lemma 3 to obtain that there exists a synchronous execution of the protocol A in which the processes in Q obtain the same decision values as the values obtained through the simulation or decision at the end of round R + 2. Returning to the proof of the output predicate, recall that there exists at least process d which outputs at the end of round R + 2. From the above Claim, it follows that all non-crashed processes simulate synchronous views of the first R rounds. Therefore all non-crashed processes will receive an output from the synchronous protocol A. Moreover, these synchronous views of processes are consistent with a synchronous execution, therefore the set of outputs received by non-crashed processes verifies the predicate P. Hence all the inputs that the processes propose to the Backup protocol verify the predicate P. Since Backup respects validity, it follows that the outputs of Backup will also verify the predicate P.

6 A Protocol for Strong Indulgent Renaming 6.1 Protocol Description In this section, we present an emulation technique that transforms any synchronous renaming protocol into an indulgent renaming protocol. For simplicity, we will assume that the synchronous renaming protocol is the one by Herlihy et al. [8], which is timeoptimal, terminating in log N + 1 synchronous rounds. The resulting indulgent protocol will rename in N names using log N + 3 rounds of communication if the system is initially synchronous, and will eventually rename into N + t names if the system is

50

D. Alistarh et al.

asynchronous, by safely reverting to a backup constituted by the asynchronous renaming algorithm by Attiya et al. [7]. Again, the protocol is structured into two phases. First Phase. During the first log N + 1 rounds, processes run the AD(2) asynchrony detector in parallel with the synchronous renaming algorithm. Note that the protocol’s messages are included in the detector’s messages. If the detector returns NO at one of these rounds, then the process stops running the synchronous algorithm, and continues only with the detector. If at the end of round [log N ] + 1, the process receives YES from AD(2), then it also receives a name name i as the decision value of the synchronous protocol. Second Phase. At the end of round [log N ] + 1, the processes start the asynchronous renaming algorithm of [7]. More precisely, each process builds a vector V with a single entry, which contains the tuple vi , namei , Ji , bi , ri , where vi is the processes’ initial value. The entry namei is the proposed name, which is either the name returned by the synchronous renaming algorithm, if the process received YES from the detector, or ⊥, otherwise. The entry Ji counts the number of times the process proposed a name–it is 1 if the process has received YES from the detector, and 0 otherwise; bi is the decision bit, which is initially 0. Finally, ri is the round number2 when the entry was last updated, which is in this case log n + 1. The processes broadcast their vectors V for the next two rounds, while continuing to run the asynchrony detector in parallel. The contents of the vector V are updated at every round, as follows: if a vector V containing new entries is received, the process adds all the new entries to its vector; if there are conflicting entries corresponding to the same process, the tie is broken using the round number ri . If, at the end of round log N + 3 the process receives YES from the detector, then it decides on namei . Otherwise, it continues runnning the AttiyaRenaming algorithm until decision is possible. 6.2 Proof of Correctness The first step in the proof of correctness of the transformation provides some properties of the asynchronous renaming algorithm of [7]. More precisely, the first Lemma states that the asynchronous renaming algorithm remains correct even though processes propose names initially, that is at the beginning of round log N + 2. The proof follows from an examination of the protocol and proofs from [7]. Lemma 5. The asynchronous renaming protocol of [7] ensures termination, name uniqueness, and a name space bound of N + t, even if processes propose names at the beginning of the first round. The previous Lemma ensures that the transformation guarantees termination, i.e. that every correct process eventually returns a name. The non-triviality property of the asynchrony detector ensures that the resulting algorithm will terminate in log N +3 rounds in any synchronous run. In the following, we will concentrate on the uniqueness of the names and on the bounds on the resulting namespace. We start by proving that the protocol does not generate duplicate names. 2

This entry in the vector is implied in the original version of the algorithm [7].

Generating Fast Indulgent Algorithms

51

Lemma 6 (Uniqueness). Given any two names ni , nj returned by processes in an execution, we have that ni = nj . Proof (Sketch). Assume for the sake of contradiction that there exists a run in which two processes pi , pj decide on the same name n0 . First, we consider the case in which both decisions occurred at round log N + 3, the first round at with a process can decide using our emulation. Notice that, if a decision is made, the processes necessarily decide on the decision value of the simulated synchronous protocol3. By the global detection property of AD(2) it then follows that there exists a synchronous execution of the synchronous renaming protocol in which two distinct processes return the same value, contradicting the correctness of the protocol. Similarly, we can show that if both decisions occur after round log N + 3, we can reduce the correctness of the transformation to the correctness of the asynchronous protocol. Therefore, the remaining case is that in which one of the decisions occurs at round log N + 3, and the other decision occurs at a later round, i.e. it is a decision made by the asynchronous renaming protocol. In this case, let pi be the process that decides on n0 at the end of round log N + 3. This implies that process pi received YES at the end of round log N + 3 from AD(2). Therefore, since pi sees a synchronous view, there exists a set S of at least N − t processes that received pi ’s message reserving name n0 in round log N + 2. It then follows that each non-crashed process receives a message from a process in the set S in round log N + 3. By the structure of the protocol, we obtain that each process has the entry vi , n0 , 1, 0, log N + 1 in their V vector at the end of round log N + 3. It follows from the structure of the asynchronous protocol that no process other than pi will ever decide on the name n0 at any later round, which concludes the proof of the Lemma. Finally, we prove that the transformation ensures the following guarantees on the size of the namespace. Lemma 7 (Namespace Size). The transformation ensures the following properties: (1) In synchronous executions, the resulting algorithm will rename in a namespace of at most N names. (2) In any execution, the resulting algorithm will rename in a namespace of at most N + t names. Proof (Sketch). For the proof of the first property, notice that, in a synchronous execution, any output combination for the transformation is an output combination for the synchronous renaming protocol. For the second property, let ≥ 0 be the number of names decided on at the end of round log N + 3 in a run of the protocol. These names are clearly between 1 and N . Lemma 6 guarantees that none of these names is decided on in the rest of the execution. On the other hand, Lemma 5 and the namespace bound of N + t for the asynchronous protocol ensure that the asynchronous protocol decides exclusively on names between 1 and N + t, which concludes the proof of the claim.

7 Conclusions and Future Work In this paper, we have introduced a general transformation technique from synchronous algorithms to indulgent algorithms, and applied it to obtain indulgent solutions for a 3

A simple analysis of the asynchronous renaming protocol shows that a process cannot decide after two rounds of communication, unless it had already proposed a value at the beginning of the first round.

52

D. Alistarh et al.

large class of distributed tasks, including consensus, set agreement and renaming. Our results suggest that, even though it is generally hard to design asynchronous algorithms in fault-prone systems, one can obtain efficient algorithms that tolerate asynchronous executions starting from synchronous algorithms. In terms of future work, we first envision generalizing our technique to generate algorithms that also work in a window of synchrony, and investigating its limitations in terms of time and communication complexity. Another interesting research direction would be to analyze if similar techniques exist in the case of Byzantine failures–in particular, if, starting from a synchronous fault-tolerant algorithm, one can obtain a Byzantine fault-tolerant algorithm, tolerating asynchronous executions. Acknowledgements. The authors would like to thank Prof. Hagit Attiya and Nikola Kneˇzevi´c for their help on previous drafts of this paper, and the anonymous reviewers for their useful feedback.

References 1. Guerraoui, R.: Indulgent algorithms. In: PODC 2000, pp. 289–297. ACM, New York (July 2000) 2. Dwork, C., Lynch, N.A., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35, 288–323 (1988) 3. Dutta, P., Guerraoui, R.: The inherent price of indulgence. In: PODC 2002: Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, pp. 88–97 (2002) 4. Lamport, L.: Fast paxos. Distributed Computing 19(2), 79–103 (2006) 5. Lamport, L.: Generalized consensus and paxos. Microsoft Research Technical Report MSRTR-2005-33 (March 2005) 6. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: How to solve consensus in the smallest window of synchrony. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 32–46. Springer, Heidelberg (2008) 7. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous environment. Journal of the ACM 37(3), 524–548 (1990) 8. Chaudhuri, S., Herlihy, M., Tuttle, M.R.: Wait-free implementations in message-passing systems. Theor. Comput. Sci. 220(1), 211–245 (1999) 9. Dutta, P., Guerraoui, R.: The inherent price of indulgence. Distributed Computing 18(1), 85–98 (2005) 10. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: Of choices, failures and asynchrony: The many faces of set agreement. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, Springer, Heidelberg (2009) 11. Chandra, T.D., Toueg, S.: Unreliable failure detectors for asynchronous systems (preliminary version). In: ACM Symposium on Principles of Distributed Computing, pp. 325–340 (August 1991) 12. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 13. Dutta, P., Guerraoui, R., Keidar, I.: The overhead of consensus failure recovery. Distributed Computing 19(5-6), 373–386 (2007) 14. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009)

An Eﬃcient Decentralized Algorithm for the Distributed Trigger Counting Problem Venkatesan T. Chakaravarthy1, Anamitra R. Choudhury1 , Vijay K. Garg2 , and Yogish Sabharwal1 1

IBM Research - India, New Delhi {vechakra,anamchou,ysabharwal}@in.ibm.com 2 University of Texas at Austin [email protected]

Abstract. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-speciﬁed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. The main result of the paper is a decentralized and randomized algorithm with expected message complexity O(n log n log w). Moreover, every processor in this algorithm receives no more than O(log n log w) messages with high probability. All the earlier algorithms for this problem have maximum message load of Ω(n log w).

1

Introduction

In this paper, we study the distributed trigger counting (DTC) problem. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user speciﬁed input. We note w may be much larger than n. The sequence of processors receiving the w triggers is not known apriori to the system. Moreover, the number of triggers received by each processor is also not known. We are interested in designing distributed algorithms for the DTC problem that are communication eﬃcient and are also decentralized. The DTC problem arises in applications such as distributed monitoring and global snapshots. Monitoring is an important issue in networked systems such as sensor networks and data networks. Sensor networks are typically employed to monitor physical or environmental conditions such as traﬃc volume, wildlife behavior, troop movements and atmospheric conditions, among others. For example, in traﬃc management, one may be interested in raising an alarm when the number of vehicles on a highway exceeds a certain threshold. Similarly, one may wish to monitor a wildlife region for the sightings of a particular species, and raise an alert, when the number crosses a threshold. In the case of data networks, M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 53–64, 2011. c Springer-Verlag Berlin Heidelberg 2011

54

V.T. Chakaravarthy et al.

example applications are monitoring the volume of traﬃc or the number of remote logins. See, for example, [7] for a discussion of applications of distributed monitoring. In the context of global snapshots (example, checkpointing), a distributed system must record all the in-transit messages in order to declare the snapshot to be valid. Garg et al. [4] showed the problem of determining whether all the in-transit messages have been received can be reduced to the DTC problem (they call this the distributed message counting problem). In the context of synchronizers [1], a distributed system is required to generate the next pulse when all the messages generated in the current pulse have been delivered. Any message in the current pulse can be viewed as a trigger of the DTC problem. Our goal is to design a distributed algorithm for the DTC problem that is communication eﬃcient and decentralized. We use the following two natural parameters that measure these two important aspects. – The message complexity, i.e., the number of messages exchanged between the processors. – The MaxRcvLoad, i.e., the maximum number of messages received by any processor in the system. Garg et al. [4] studied the DTC problem for a general distributed system. They presented two algorithms: a centralized algorithm and a tree-based algorithm. The centralized algorithm has message complexity O(n log w). However, the MaxRcvLoad of this algorithm can be as high as Ω(n log w). The tree-based algorithm has message complexity O(n log n log w). This algorithm is more decentralized in a heuristic sense, but its MaxRcvLoad can be as high as O(n log n log w), in the worst case. They also proved a lowerbound on the message complexity. They showed that any deterministic algorithm for the DTC problem must have message complexity Ω(n log(w/n)). So, the message complexity of the centralized algorithm is optimal asymptotically. However, this algorithm has MaxRcvLoad as high as the message complexity. In this paper, we consider a general distributed system where any processor can communicate with any other processor and all the processors are capable of performing basic computations. We assume an asynchronous model of computation and messages. We assume that the messages are guaranteed to be delivered but there is no ﬁxed upper bound on the message arrival time. Also, messages are not corrupted or spuriously introduced. This setting is common in data networks. We also assume that there are no faults in the processors and that the processors do not fail. Our main result is a decentralized randomized algorithm called LayeredRand that is eﬃcient in terms of both the message complexity and MaxRcvLoad. Its message complexity is O(n log n log w). Moreover, with high probability, its MaxRcvLoad is O(log n log w). The message complexity of our algorithm is the same as that of the tree based algorithm of Garg et al. [4]. However, the MaxRcvLoad of our algorithm is signiﬁcantly better than both their tree based and centralized algorithms. It is important to minimize MaxRcvLoad for many applications. For example, in sensor networks where the message processing may

An Eﬃcient Decentralized Algorithm for the DTC Problem Algorithm

Message Complexity Tree-based[4] O(n log n log w) Centralized[4] O(n log w) LayeredRand O(n log n log w)

55

MaxLoad O(n log n log w) O(n log w) O(log n log w)

Fig. 1. Summary of DTC Algorithms

consume limited power available at the node, a high MaxRcvLoad may reduce the lifetime of a node. Another important aspect of our algorithm is its simplicity. In particular, our algorithm is much simpler than both the algorithms of Garg et al. A comparison of our algorithm with the earlier results is summarized in Fig. 1. Designing an algorithm with message complexity O(n log w) and MaxRcvLoad O(log w) remains a challenging open problem. Our main result is formally stated next. For 1 ≤ i ≤ w, the external source delivers the ith trigger to some processor xi . We call the sequence x1 , x2 , . . . , xw as a trigger pattern. Theorem 1. Fix any trigger pattern. The message complexity of the LayeredRand algorithm is O(n log n log w). Furthermore, there exist constants c and d ≥ 1 such that 1 Pr[MaxRcvLoad ≥ c log n log w] ≤ d . n The above bounds hold for any trigger pattern, even if ﬁxed by an adversary. Related work. Most prior work (e.g. [3,7,6]) primarily consider the DTC problem in a centralized setting where one of the processors acts as a master and coordinates the system, and the other processors act as slaves. The slaves can communicate only with the master (they cannot communicate among themselves). Such a scenario applies where a communication network linking the slaves does not exist or the slaves have only limited computational power. Prior work addresses various issues arising in such a setup, such as message complexity. They also consider variations and generalizations of the DTC problem. One such variation is approximate threshold computation, where system need not raise an alert on seeing exactly w triggers; it suﬃces if the alert raised upon seeing at most (1 + )w triggers, where is some user speciﬁed tolerance parameter. Prior work also considers aggregate function more general than counting. Here, each input trigger i is associated with a value αi . The goal is to raise an alert when some aggregate of these values crosses the threshold (an example, aggregate function is sum). Note that the Echo or Wave algorithms [2,9,10] and the framework of repeated global computation [5] are not easily applicable for the DTC problem because the triggers arrive at processors asynchronously at unknown times. Computing the sum of all the trigger counts just once is not enough and repeated computation results in an excessive number of messages.

56

2

V.T. Chakaravarthy et al.

A Deterministic Algorithm

For the DTC problem, Garg et al. [4] presented an algorithm with the message complexity of O(n log w). In this section, we describe a simple alternative deterministic algorithm having the same message complexity. The aim of presenting this algorithm is to highlight the diﬃculties in designing an algorithm that simultaneously achieves good message complexity and MaxRcvLoad bounds. A naive algorithm for the DTC problem works as follows. One of the processors acts as a master and every processor sends a message to the master upon receiving each trigger. The master keeps count on the total number of triggers received. When the count reaches w, the user is informed and the protocol ends. The disadvantage with this algorithm is that its message complexity is O(w). A natural idea is avoid sending a message to the master for every trigger received. Instead, a processor will send one message for every B triggers received. Clearly, setting B to a high value will reduce the number of messages. However, care should taken to ensure that the system does not enter the dead state. For instance, suppose we set B = w/2. Then, the adversary can send w/4 triggers to some selected four processors. Notice that none of these processors would send a message to the master. Thus, even though all the w triggers have been delivered by the adversary, the system will not detect the termination. We say that the system is the dead state. Our deterministic algorithm with message complexity O(n log w) is described next. A predetermined processor would serve as the master. The algorithm works in multiple rounds. We start by setting two parameters: w ˆ = w and B = w/(2n). ˆ Each processor would send a message to the master for every B triggers received. The master will keep count of the triggers reported by other processors and the triggers received by itself. When the count reaches w/2, ˆ it declares end-of-round and sends a message to all the processors to this eﬀect. In return, each processor sends the number of unreported triggers to the master (namely, the triggers not reported to the master). This way, the master can compute w , the total number of triggers received so far in the system. It recomputes w ˆ = w−w ˆ ; the new w ˆ is the number of triggers yet to be received. The master recomputes B = w/(2n) ˆ and sends this number to every processor. The next round starts. When w ˆ < (2n), we set B = 1. We now argue that the system never enters a dead state. Consider the state of the system in the middle of any round. Each processor has less than w/(2n) ˆ unreported triggers. Thus, the total number of unreported triggers is less than w/2. ˆ The master’s count of reported triggers is less than w/2. ˆ Thus, the total number of triggers delivered so far is less than w. ˆ So, some more triggers are yet to be delivered. It follows that the system is never in a dead state and the system will correctly terminate upon receiving all the w triggers. Notice that in each round, w ˆ decreases at least by a factor of 2. So, the algorithm terminates after log w rounds. Consider any single round. A message is sent to the master for every B triggers received and the rounds gets completed when the master’s count reaches w/2. ˆ Thus, the number of messages sent to the master is w/(2B) ˆ = n. At the end of each round, the O(n) messages are exchanged between the master and the other processors. Thus, the number of

An Eﬃcient Decentralized Algorithm for the DTC Problem ()*+ /.-, w; Gc GG GG ww GG ww w GG ww GG ww GG w w GG w GG ww w ()*+wKe Si S /.-, ()*+ /.-, F 0X 0 KKSSS kk95 9 F 0X kksksss 00 k 00 KKKSKSSSS k k ss k 00 S k S k K s 00 KK SSSkSkkk ss 00 KK kkk SSS ss 00 00 00 kkkkkkKKKK sssSsSSSSS 00 SSS k K s k 0 K s k s KK 0 SSSS kk 00 s k k K s SSS 00 KK kk 0 sss SSS K kkk ()*+kk /.-, ()*+s /.-, /.-, ()*+ /.-, ()*+

57

Layer 0 (root)

Layer 1

Layer 2

. . . .

()*+ /.-,

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

/.-, ()*+

Layer 3

Fig. 2. Illustration for LayeredRand

messages per round is O(n). The total number messages exchanged during all the rounds is O(n log w). The above algorithm is eﬃcient in terms of message complexity. However, the master may receive upto O(n log w) messages and so, the MaxRcvLoad of the algorithm is O(n log w). In the next section, we present an eﬃcient randomized algorithm which simultaneously achieves provably good message complexity and MaxRcvLoad bounds.

3

LayeredRand Algorithm

In this section, we present a randomized algorithm called LayeredRand. Its message complexity is O(n log n log w) and with high probability, its MaxRcvLoad is O(log n log w). For the ease of exposition, we ﬁrst describe our algorithm under the assumption that the triggers are delivered one at a time; meaning, all the processing required for handling a trigger is completed before the next trigger arrives. This assumption allows us to better explain the core ideas of the algorithm. We will discuss how to handle the concurrency issues in Sect. 5. For the sake of simplicity, we assume that n = 2L − 1, for some integer L. The n processors are arranged in L layers numbered 0 through L − 1. For 0 ≤ < L, layer consists of 2 processors. Layer 0 consists of a single processor, which we refer to as the root. Layer L − 1 is called the leaf layer. The layering is illustrated in Fig. 2, for n = 15. Only processors occupying adjacent layers will communicate with each other. The algorithm proceeds in multiple rounds. In the beginning of each round, the system needs to know how many triggers are yet to be received. This can be computed by keeping track of the total number of triggers received in all the previous rounds and subtracting this quantity from w. Let the term initial value of a round mean the number of triggers yet to be received at the beginning of the round. We use a variable w ˆ to store the initial value of any round. In the ﬁrst round, we set w ˆ = w, since all the w triggers are yet to be received.

58

V.T. Chakaravarthy et al.

We next describe the procedure followed in a single round. Let w ˆ denote the initial value of this round. For each 1 ≤ < L, we compute a threshold τ () for the layer : w ˆ τ () = . 4 · 2 · log(n + 1) Each processor x maintains a counter C(x), which is used to keep track of some of the triggers received by x and other processors occupying the layers below of that of x. The exact semantics C(x) will become clear shortly. The counter is reset to zero in the beginning of the round. Consider any non-root processor x occupying a level . Whenever x receives a trigger, it will increment C(x) by one. If C(x) reaches the threshold τ (), x chooses a processor y occupying level − 1 uniformly at random and sends a message to y. We refer to such a message as a coin. Upon receiving the coin, the processor y updates C(y) by adding τ () to C(y). Intuitively, receipt of a coin by y means that y has evidence that some processors below the layer − 1 have received τ ( − 1) triggers. After the update, if C(y) ≥ τ ( − 1), y will pick a processor z occupying level − 2 uniformly at random and send a coin to z. Then, processor y updates C(y) = C(y) − τ ( − 1). Processor z handles the coin similarly. See Fig. 2. A directed edge from a processor u to a processor v means that u may send a coin to v. Thus, a processor may send a coin to any processor in the layer above. This is illustrated for the top three layers in the ﬁgure. We now formally describe the behavior of a non-root processor x occupying a level . Whenever x receives a trigger from the external source or a coin from level + 1, it behaves as follows: – If a trigger is received, increment C(x) by one. – If a coin is received from level + 1, update C(x) = C(x) + τ ( + 1). – If C(x) ≥ τ (), • Among the 2−1 processors occupying level − 1, pick a processor y uniformly at random and send a coin to y. • Update C(x) = C(x) − τ (). The behavior of the root is similar to that of the other processors, except that it does not send coins. The root processor r also maintains a counter C(r). Whenever it receives a trigger from the external source, it increments C(r) by one. If it receives a coin from level 1, it updates C(r) = C(r) + τ (1). An important observation is that at any point of time, any trigger received by the system in the current round is accounted in the counter C(x) of exactly one processor x. This means that the sum of C(x) over all the processors gives us the exact count of the triggers received in the system so far in this round. This observation will be useful in proving the correctness of the algorithm. The crucial activity of the root is to initiate an end-of-round procedure. When C(r) reaches w/2 ˆ (i.e., when C(r) ≥ w/2), ˆ the root declares end-of-round. Now, the root needs to get a count of the total number of triggers received by all the processors in this round. Let this count be w . The processors are arranged in a pre-determined binary tree formation such that each processor x

An Eﬃcient Decentralized Algorithm for the DTC Problem

59

has exactly one parent from the layer above and exactly two children from the layer below. The end-of-round notiﬁcation can be broadcast to all the processors in a recursive top-down manner. Similarly, the sum of C(x) over all the processors can be reduced at the root in a recursive bottom-up manner. Thus, the root obtains the value w , i.e., the total number of triggers received in the system in this round. The root then updates the initial value for the next round by computing w ˆ = w ˆ − w , and broadcasts this to all the processors, again in a recursive fashion. All the processors then update their τ () values for the new round. This marks the start of the next round. Notice that in the end-of-round process, each processor receives at most a constant number of messages. At the end of any round, if the newly computed wˆ is zero, we know that all the w triggers have been received. So, the root can raise an alert to the user and the algorithm is terminated. It is easy to derive a bound on the number of rounds taken by the algorithm. Observe that in successive rounds the initial value drops by a factor of two (meaning, w ˆ of round i + 1 is at most half the w ˆ of round i). Thus, the algorithm takes at most log w rounds.

4

Analysis of the LayeredRand Algorithm

Here, we prove the correctness of the algorithm and then prove message bounds. 4.1

Correctness of the Algorithm

We now show that the system will correctly raise an alert to the user when all the w triggers are received. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure, i.e., the system does not get stalled in the middle of the round, when all the triggers have been delivered. We denote the set of all processors by P. Consider any round and let w ˆ be the initial value of the round. Let x be any non-root processor and let be the layer in which x is found. Notice that at any point of time, we have C(x) ≤ τ () − 1. Thus, we can derive a bound on the sum of C(x): x∈P−{r}

C(x) ≤

L−1 =1

2 (τ () − 1) ≤

(L − 1)w ˆ 4 · log(n + 1)

≤

w ˆ 4

Now suppose that all the outstanding w ˆ triggers have been delivered to the system in this round. We already saw that at any point of time, x∈P C(x) gives the number of triggers received by the system so far in the current round.1 Thus, x∈P C(x) = w. ˆ It follows that the counter at the root C(r) satisﬁes C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ But, this means that the root would initiate the end-ofround procedure. We conclude that the system will not enter a dead state. 1

We note that C(r) is an integer, and hence this holds even when w ˆ = 1.

60

V.T. Chakaravarthy et al.

The above argument shows that the system always makes progress by moving into the next round. As we observed earlier, the initial value w ˆ drops by a factor of at least two in each round. So, eventually, w ˆ must become zero and the system will raise an alert to the user. 4.2

Bound on the Message Complexity

Lemma 1. The message complexity of the algorithm is O(n log n log w). Proof: As argued before, the algorithm takes only O(log w) rounds to terminate. Consider any round and let w ˆ be the initial value of the round. Consider any layer 1 ≤ < L. Every coin sent from layer to − 1 means that at least τ () triggers have been received by the system in this round. Thus, the number of coins sent from layer to the layer − 1 can be at most w/τ ˆ (). Summing up over all the layers, we can get a bound on the total number of coins (messages) sent in this round: L−1 L−1 w ˆ Number of coins sent ≤ ≤ 4 · 2 log n ≤ 4 · (n − 1) log n τ () =1

=1

The end-of-round procedure involves only O(n) messages, in any particular round. Summing up over all log w rounds, we see that the message complexity of the algorithm is O(n log n log w). 4.3

Bound on the MaxRcvLoad

In this section, we show that with high probability, the MaxRcvLoad is bounded by O(log n log w). We use the following Chernoﬀ bound (see [8]) for this purpose. Theorem 2 (see [8], Theorem 4.4). Let X be the sum of a finite number of independent 0−1 random variables. Let the expectation of X be μ = E[X]. Then, for any r ≥ 6, Pr[X ≥ rμ] ≤ 2−rμ . Moreover, for any μ ≥ μ, the inequality is true, if we replace μ by μ on both sides. Lemma 2. Pr[MaxRcvLoad ≥ c log n log w] ≤ n−47 , for some constant c. Proof: Let us ﬁrst consider the number coins received by any processor. Processors in the leaf layer do no receive any coins and so, it suﬃces to consider the processors occupying other layers. Consider any layer 0 ≤ ≤ L − 2 and let x be any processor found in layer . Let Mx be the random variable denoting the number of coins received by x. As discussed before, the algorithm takes at most log w rounds. In any given round, w ˆ the number of coins received by layer is at most τ (+1) ≤ 4 · 2+1 log n. Thus, the total number of coins received by layer is at most 4 · 2+1 log n log w. Each of these coins is sent uniformly and independently at random to one of the 2 processors occupying layer . Thus, expected number of coins received by x is E[Mx ] ≤

4 · 2+1 log n log w 2

= 8 log n log w

An Eﬃcient Decentralized Algorithm for the DTC Problem

61

The random variable Mx is a sum of independent 0-1 random variables. Applying the Chernoﬀ bound given by Theorem 2 (taking r = 6), we see that Pr[Mx ≥ 48 log n log w] ≤ 2−48 log n log w

< n−48 .

Applying the union bound, we see that Pr[There exists a processor x having Mx ≥ 48 log n log w] < n−47 . During the end-of-round process, a processor receives at most a constant number of messages in any round. So, the total of these messages received by any processor is O(log w).

5

Handling Concurrency

In this section, we discuss how to handle the concurrency issues. All triggers and coin messages received by a processor can be placed into a queue and processed one at a time. Thus, there is no concurrency issue related to triggers and coins received within a round. However, concurrency issues need to be handled during an end-of-round. Towards this goal, we slightly modify the LayeredRand algorithm. The core functioning of the algorithm remains the same as before; we mainly modify the end-of-round procedure by adding some additional features (such as counters and queues). The rest of this section explains these features and the end-of-the round procedure in detail. We also prove correctness of the algorithm in the presence of concurrency issues. 5.1

Processing Triggers and Coins

Each processor x maintains two FIFO queues - a default queue and a priority queue. All triggers and coin messages received by a processor are placed in the default queue. The priority queue contains only the messages related to the endof-round procedure, which are handled on a priority basis. In the main event handling loop, a processor repeatedly checks for messages in queues. It ﬁrst examines the priority queue and handles the ﬁrst message in that queue, if any. If there is no message there, it examines the default queue and handles the ﬁrst message in that queue (if any). Every processor also maintains a counter D(x) that keeps a count of triggers directly received and processed by x, since the beginning of the algorithm. The triggers received by x that are in the default queue (not yet processed) are not accounted in D(x). The counter D(x) is incremented every time the processor processes a trigger from the default queue. This counter is never reset. It is maintained in addition to the counter C(x) (which gets reset in the beginning of each round). Every processor x maintains another variable, RoundNum, that indicates the current round number for this processor. Whenever x sends a coin to some other processor, it includes its RoundNum in the message. The processing of triggers and coins is done as before (as in Sect. 3).

62

5.2

V.T. Chakaravarthy et al.

End-of-Round Procedure

Here, we describe the end-of-round procedure in detail, highlighting the modiﬁcations. The procedure consists of four phases. The processors are arranged in the form of a binary tree as before. In the ﬁrst phase, the root processor broadcasts a RoundReset message down the tree to all nodes requesting them to send their D(x) counts. In the second phase, these counts are reduced at the root using Reduce messages; the root computes the sum of D(x) over all the processors. Note that, unlike the algorithm described in Sect. 3, here the root computes the sum of D(x) counters, rather than the sum of C(x) counters. We shall see that this is useful in proving correctness. Using the sum of D(x) counters, the root computes the initial value w ˆ for the next round. In the third phase, the root broadcasts this value w ˆ to all nodes using Inform messages. In the fourth phase, each processor sends an acknowledgement InformAck back to the root and enters the next round. We now describe the four phases in detail. First Phase: In this phase, the root processor initiates the broadcast of a RoundReset message by sending it down to its children. A processor x on receiving RoundReset message, does the following: – At this point, the processor suspends processing of the default queue until the end-of-round processing is completed. Thus all new triggers are queued up without being processed. This ensures that the D(x) value is not modiﬁed while end-of-round procedure is in progress. – If x is not a leaf processor, it forwards the RoundReset message to its children; if it is a leaf-processor, it initiates the second phase as described below. Second Phase: In this phase, the D(x) values are sum-reduced at the root from all the processors. The second phase starts when a leaf processor receives a RoundReset message, in response to which it initiates a Reduce message containing its D(x) value and passes it to its parent. When a non-leaf processor has received Reduce messages from all its children, it adds up the values in these messages to its own D(x) and sends a Reduce message to its parent with this sum. Thus, the root collects the sum of D(x) over all the processors. This sum w is the total numbers of triggers received in the system so far. Subtracting w from w, the root computes the initial value w ˆ for the next round. If w ˆ = 0, the root raises an alert and terminates the algorithm. Otherwise, the root initiates the third phase. Third Phase: In this phase, the root processor broadcasts the new w ˆ value by sending an Inform message to its children. A processor x on receiving the Inform message, performs the following: – It computes the threshold τ () value for the new round, where is the layer number of x. – If x is a non-leaf processor, it forwards the Inform message to its children; if x is a leaf processor, it initiates the fourth phase as described below.

An Eﬃcient Decentralized Algorithm for the DTC Problem

63

Fourth Phase: In this phase, the processors send an acknowledgement upto the root and enter the new round. The fourth phase starts when a leaf processor x receives an Inform message. After performing the processing for the Inform message, it performs the following actions: – It increments RoundNum. This signiﬁes that the processor has entered the next round. After this point, the processor does not process any coins from the previous rounds. Whenever the processor receives a coin generated in the previous rounds, it simply discards the coin. – C(x) is reset to zero. – It sends an InformAck to its parent. – The processor x resumes processing of the default queue. This way, x will start processing the outstanding triggers (if any). When a non-leaf node receives InformAck messages from all its children, it performs the same processing as above. When the root processor has received InformAck messages from all its children, the system enters the new round. We note that it is possible to implement the end-of-round procedure using three phases. However, the fourth phase (of sending acknowledgements) ensures that at any point of time, the processors can only be in two diﬀerent (consecutive) rounds. Moreover, when the root receives the InformAck messages from all its children, all the processors in the system are in the same round. Thus, end-ofround processing for diﬀerent rounds cannot be in progress simultaneously. 5.3

Correctness of Algorithm

We now show that the system correctly raises an alert to the user when all the w triggers are delivered. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure. Furthermore, we also show that system does not incorrectly raise an alert to the user before w triggers are delivered. We say that a trigger is unprocessed, if the trigger has been delivered to a processor and is waiting in its default queue. A processor is said to be in round k, if its RoundNum equals k. A trigger is said to be processed in round k, if the processor that received this trigger is in round k when it processed the trigger. Consider the point in time t when the system has entered a new round k. Let w ˆ be the initial value of the round. Recall that in the second phase, the root computes w = x∈P D(x) and sets w ˆ = w − w , where P is the set of all processors. Notice that in the ﬁrst phase, all processors suspend processing triggers from the default queue. The trigger processing is resumed only in the fourth phase after the RoundNum is incremented. Therefore, no more triggers are processed in the round k − 1. It follows that w is the total number of triggers that have been processed in the (previous) rounds k ≤ k − 1. Thus, any triggers processed in round k will be accounted in the counter C(x) of some processor x. This observation leads to the following argument. We now show that the root initiates the end-of-round procedure upon receiving at most w ˆ triggers. Suppose all the w ˆ triggers have been delivered and

64

V.T. Chakaravarthy et al.

processed in this round. Furthermore, assume that all the coins generated and sent in the above process have also been received and processed. Clearly, such a state will happen at some point in time, sincewe assume a reliable communicaˆ tion network. At this point of time, we have x∈P C(x) = w. At any point of time after t, we have x∈P−{r} C(x) ≤ w/4, ˆ where P is the set of all processors and r is the root processor. The claim is proved using the same arguments as in Sect. 4.1 and the fact that the processors discard the coins generated in previous rounds. From the above relations, we get that C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ The root initiates the end-of-round procedure whenever C(r) crosses w/2. ˆ Thus, the root will eventually start the end-of-round procedure. Hence the system never gets stalled in the middle of the round. Clearly, the system raises an alert on receiving w triggers. We now argue that the system does not raise an alert before receiving w triggers. This follows from the fact that w ˆ for a new round is calculated on the basis of D(x) counters. The analysis of message complexity and MaxRcvLoad are unaﬀected.

6

Conclusions

We have presented a randomized algorithm to the DTC problem which reduces the MaxRcvLoad of any node from O(n log w) to O(log n log w) with high probability. The ultimate goal of this line of work would be to design a deterministic algorithm with MaxRcvLoad O(log w).

References 1. Awerbuch, B.: Complexity of network synchronization. J. ACM 32(4), 804–823 (1985) 2. Chang, E.: Echo algorithms: Depth parallel operations on general graphs. IEEE Trans. Software Eng. 8(4), 391–401 (1982) 3. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA (2008) 4. Garg, R., Garg, V.K., Sabharwal, Y.: Scalable algorithms for global snapshots in distributed systems. In: 20th Int. Conf. on Supercomputing, ICS (2006) 5. Garg, V., Ghosh, J.: Repeated computation of global functions in a distributed environment. IEEE Trans. Parallel Distrib. Syst. 5(8), 823–834 (1994) 6. Huang, L., Garofalakis, M., Joseph, A., Taft, N.: Communication-eﬃcient tracking of distributed cumulative triggers. In: ICDCS (2007) 7. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-eﬃcient distributed monitoring of thresholded counts. In: SIGMOD Conference (2006) 8. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge (2005) 9. Segall, A.: Distributed network protocols. IEEE Transactions on Information Theory 29(1), 23–34 (1983) 10. Tel, G.: Distributed inﬁmum approximation. In: Lupanov, O.B., Bukharajev, R.G., Budach, L. (eds.) FCT 1987. LNCS, vol. 278, pp. 440–447. Springer, Heidelberg (1987)

Deterministic Dominating Set Construction in Networks with Bounded Degree Roy Friedman and Alex Kogan Department of Computer Science Technion, Israel {roy,sakogan}@cs.technion.ac.il

Abstract. This paper considers the problem of calculating dominating sets in networks with bounded degree. In these networks, the maximal degree of any node is bounded by Δ, which is usually signiﬁcantly smaller than n, the total number of nodes in the system. Such networks arise in various settings of wireless and peer-to-peer communication. A trivial approach of choosing all nodes into the dominating set yields an algorithm with the approximation ratio of Δ + 1. We show that any deterministic algorithm with non-trivial approximation ratio requires Ω(log∗ n) rounds, meaning eﬀectively that no o(Δ)-approximation deterministic algorithm with a running time independent of the size of the system may ever exist. On the positive side, we show two deterministic algorithms that achieve log Δ and 2 log Δ-approximation in O(Δ3 + log∗ n) and O(Δ2 log Δ + log∗ n) time, respectively. These algorithms rely on coloring rather than node IDs to break symmetry.

1

Introduction

The dominating set problem is a fundamental problem in graph theory. Given a graph G, a dominating set of the graph is a set of nodes such that every node in G is either in the set or has a direct neighboring node in the set. This problem, along with its variations, such as the connected dominating set or the k-dominating set, play signiﬁcant role in many distributed applications, especially in those running over networks that lack any predeﬁned infrastructure. Examples include mobile ad-hoc networks (MANETs), wireless sensor networks (WSNs), peer-to-peer networks, etc. The main application of dominating sets in such networks is to provide a virtual infrastructure, or overlay, in order to achieve scalability and eﬃciency. Such overlays are mainly used to improve routing schemes, where only nodes in the set are responsible for routing messages in the network (e.g., [29, 30]). Other applications of dominating sets include eﬃcient power management [11, 30] and clustering [3, 14]. In many cases, the network graph is such that each node has a limited number of direct neighbors. Such a limitation may result from several reasons. First,

This work is partially supported by the Israeli Science Foundation grant 1247/09 and by the Technion Hasso Plattner Center.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 65–76, 2011. c Springer-Verlag Berlin Heidelberg 2011

66

R. Friedman and A. Kogan

it can represent a hardware limitation, such as a bounded number of communication ports in a device [8]. Second, it can be an outcome of an inherent communication protocol limitation, like in the case of BlueTooth networks composed of units, called piconets, that include at most eight devices [10]. Finally, performance considerations, such as space complexity and network scalability, may limit the number of nodes with which each node may communicate directly. This is a common case for structured peer-to-peer networks, where each node selects a constant number of neighbors when it joins the network [17, 25]. The problem of ﬁnding a dominating set that has a minimal number of nodes is known to be N P -complete [12], and, in fact, it is also hard for approximation [9]. Although the approximation ratio of existing solutions for the dominating set problem, O(log Δ), was found to be the best possible (to within a lower order additive factor, unless NP has an nO(log log n) -time deterministic algorithm [9]), the gap between lower and upper bounds on the running time of distributed deterministic solutions remains wide. Kuhn et al. [19] showed that any distributed approximation algorithm for the dominating set problem with a polylogarithmic approximation ratio requires at least Ω( log n/ log log n) communication rounds. Along with that, the existing distributed deterministic algorithms incur a linear (in number of nodes) running time [7, 23, 29]. This worst-case upper bound remains valid even when graphs of interest are restricted to the bounded degree case, like the ones described above. The deterministic approximation algorithms [7, 23, 29] are based on the centralized algorithm of Guha and Khuller [13], which in turn is based on a greedy heuristic for the related set-cover problem [5]. Following the heuristic, these algorithms start with an empty dominating set and proceed as following. Each node calculates the span, the number of uncovered neighbors, including the node itself. (A node is uncovered if it is not in the dominating set and does not have any neighbor in the set.) Then it exchanges the span with all nodes within distance of 2 hops and decides whether to select itself to the dominating set based on its span and the span of nodes within distance 2. These iterations are repeated by a node v until v or at least one of its neighbors is uncovered. The decision whether to join the dominating set in the above iterative process is taken based on the lexicographic order of the pair span, ID [7, 23, 29]. The use of IDs to break ties leads to long dependency chains, where a node cannot join the set because of another node having higher ID. This, in turn, leads to a time complexity that is linear in the number of nodes. To see that, consider a ring, where nodes have IDs starting from 1 and increasing clockwise. At the ﬁrst iteration, only the node with the highest ID = n will join the set. At the second iteration, only the node with ID = n − 3 will join the set, since it has 3 uncovered neighbors (including itself), while nodes n − 2 and n − 1 have only 2 and 1, respectively. At the third iteration, the node with ID = n − 6 will join, and so on. Thus, such an approach will require roughly n/3 phases. In this paper, we employ coloring to reduce the length of such dependency chains. Our approach is two-phased: we ﬁrst run a coloring algorithm that assigns each node with a color, which is diﬀerent from a color of any other node within

Deterministic Dominating Set Construction

67

distance 2. Then, we run the same iterative process described above, while using colors instead of IDs to break ties between nodes with equal span, shortening the length of the maximal chain. This approach results in a distributed deterministic algorithm with approximation ratio of log Δ (or, more precisely, log Δ + O(1)) and running time of O(Δ3 +log∗ n). Notice, though, that the coloring required by our algorithm can be precomputed for other purposes, e.g., time slot scheduling for the wireless channel access [15, 28]. When the coloring is given, the running time of the algorithm becomes O(Δ3 ), independent of the size of the system. We also describe a modiﬁcation to our algorithm that reduces its running time to O(Δ2 log Δ + log∗ n) (O(Δ2 log Δ) in case of coloring is already given) while the approximation ratio is increased by a constant factor. An essential question that arises in the context of bounded degree networks is whether it is possible to construct a local approximation algorithm, i.e., an algorithm with a running time that depends solely on the degree bound. As have been already stated above, in the general case, Kuhn et al. [19] provide a negative answer and state that at least Ω( log n/ log log n) communication rounds are needed. Along with that, in several other related communication models, such as the unit disc graph, local approximation algorithms are known to exist [6]. In this paper, we show that any deterministic algorithm with a nontrivial approximation ratio requires at least Ω(log∗ n) rounds, thus answering negatively to the question stated above. In light of this lower bound, our modiﬁed algorithm leaves an additive gap of O(Δ2 log Δ).

2

Related Work

Due to its importance, the dominating set problem was considered in various networking models. For general graphs, the best distributed deterministic O(log Δ)-approximation algorithms have linear running time [7, 23, 29]. In fact, these algorithms perform no better than a trivial approach where each node collects a global view of the network by exchanging messages with its neighbors and then calculates locally a dominating set approximation by running, e.g., the centralized algorithm of Guha and Khuller [13]. The only lower bound known for general graphs is due to Kuhn et. al. [19], which states that at least Ω( log n/ log log n) communication rounds are needed to ﬁnd a constant or polylogarithmic approximation1. Their proof relies on a construction of a special family of graphs in which the maximal node degree depends on the size of the graph. Thus, this construction cannot be realized in the bounded degree model. Another body of works considers unit-disk graphs (UDG), which are claimed to model the communication in wireless ad-hoc networks. Although the dominating set problem remains NP-hard in this model, approximation algorithms with a constant ratio are known (e.g., [6, 20]). Recently, Lenzen and Wattenhofer [22] showed that any f -approximation algorithm for the dominating set problem in the UDG model runs in g(n) time, where f (n)g(n) ∈ Ω(log∗ n). In 1

This work assumes unbounded local computations.

68

R. Friedman and A. Kogan

Table 1. Comparison of results on distributed deterministic O(log Δ)-approximation of optimal dominating sets Model

Running time Algorithm/Lower bound log n/ log log n) [19] General O(n) [7, 23, 29] Ω(log∗ n) this paper Bounded degree O(log ∗ n + Δ3 ), O(log ∗ n + Δ2 log Δ) this paper Ω(

contrary, we consider a diﬀerent model of graphs with bounded degree nodes, in which Δ is not a constant number, but rather an independent parameter of the problem. This enables us to obtain more reﬁned lower bound. Speciﬁcally, we show that while obtaining O(Δ)-approximation for the optimal dominating set in our model is possible even without any communication, any o(Δ)-approximation algorithm requires Ω(log∗ n) time. Although our proof employs a similar (ring) graph, which can be realized also in the UDG model, the formalism we use allows us to obtain our lower bound in a shorter and more straight-forward way. The dominating set problem in bounded degree networks was considered by Chlebik and Chlebikova [4], who derive explicit lower bounds on the approximation ratios of centralized solutions. While we are not aware of any previous work on distributed approximation of dominating sets in bounded degree networks, several related problems were considered in this setting. Very recently, Astrand and Suomela et al. provided distributed deterministic approximation algorithms to a series of such problems, e.g., vertex cover [1, 2] and set cover [2]. Panconesi and Rizzi considered maximal matchings and various colorings [26]. It is worth to mention several randomized approaches that have been proposed for the general graph model and which can also be applied in the setting of networks with bounded degree. For instance, Jia et al. [16] propose an algorithm with O(log n log Δ) running time, while Kuhn et. al. [21] achieve even better O(log2 Δ) running time. These solutions, however, provide only probabilistic guarantees on the running time and/or approximation ratio (for example, the former achieves the approximation ratio of O(log Δ) in expectation and O(log n) with high probability), while our approach deterministically achieves the approximation ratio of log Δ. The results of previous work along with the contribution of this paper are summarized in Table 1.

3

Model and Preliminaries

We model the network as a graph G = (V, E). The number of nodes is n and the degree of any node in the graph is limited by a global parameter Δ. We assume that both n and Δ are known to any node in the system. Also, we assume that each node has a unique identiﬁer of size O(log n). In fact, both assumptions are required only by the coloring procedure we use as a subroutine [18]. Our lower bound does not require the latter assumption and, in particular, holds for anonymous networks as well.

Deterministic Dominating Set Construction

69

k · o(log∗ n) nodes vj

f (n) nodes

vi

Fig. 1. A (partial) 2-ring graph R(n, 2)

k · o(log∗ n) nodes

Fig. 2. A subgraph G of R(n, 2)

Our model of computation is a synchronous, message-passing system (denoted as LOCAL in [27]) with reliable processes and reliable links. In particular, time is divided into rounds and in every round, a node may send one message of an arbitrary size to each of its direct neighbors in G, receive all messages sent to it by its direct neighbors at the same round and perform some local computation. Consequently, for any given pair of nodes v and u at distance of k edges in G, a message sent by v in round i may reach u not before round i + k − 1. All nodes start the computation at the same round. The time complexity of the algorithms presented below is the number of rounds from the start until the last node ceases to send messages. Let Nk (v, G) denote the k-neighborhood of a node v in a graph G, that is Nk (v, G) is a set of all nodes (not including v itself), which are at most k hops from v in G. In the following deﬁnitions, all node indices are taken modulo n. Definition 1. A ring graph R(n) = (Vn , En ) is a circle graph consisting of n nodes where Vn = {v1 , v2 , ..., vn } and En = {(vi , vi+1 ) | 1 ≤ i ≤ n}. A k-ring graph R(n, k) = (Vn , Enk ) is an extension of the ring graph, where Vn = {v1 , v2 , ..., vn } and Enk = {(vi , u) | u ∈ Nk (vi , R(n)) ∧ 1 ≤ i ≤ n}. Notice that in R(n, k) each node v has exactly 2k edges, one to each of its neighbors in Nk (v, R(n)) (see Fig. 1). Given R(n, k) and two nodes vi , vj ∈ Vn , i ≤ j, let Sub(R(n, k), vi , vj ) be a subgraph (V, E) where V = {vk ∈ Vn | i ≤ k ≤ j}. Thus, assuming a clockwise ordering of nodes on the ring, Sub(R(n, k), vi , vj ) contains the sequence of nodes between vi and vj in the clockwise direction. The nodes vi and vj are referred to as boundary nodes in the sequence. Definition 2. Suppose A is an algorithm operating on R(n, k) and assigning each node vi ∈ Vn a value c(vi ) ∈ {0, 1}. Let r(vi ) = minj {j ≤ i | ∀k, j ≤ k ≤ i : c(vk ) = c(vi )}. Similarly, let l(vi ) = maxj {i ≤ j | ∀k, i ≤ k ≤ j : c(vk ) = c(vi )}. Then Seq(vi ) = Sub(R(n, k), vr(vi ) , vl(vi ) ) is the longest sequence of nodes

70

R. Friedman and A. Kogan

containing vi and in which all nodes have the value c(vi ). We call vl(vi ) as the leftmost node in Seq(vi ), vl(vi )+1 as the second leftmost node, and so on.

4 4.1

Proof of Bounds Lower Bound

The minimal dominating set of any bounded degree graph has a size of at least n . Thus, a simple approach for choosing all nodes of the graph into the domiΔ+1 nating set gives a trivial (Δ + 1)-approximation for the optimal set. An essential question is whether a non-trivial approximation can be calculated deterministically in the bounded degree graphs in an eﬀective way, i.e., independent of the system size. The following theorem gives a negative answer to this question. Theorem 1. Any distributed deterministic o(Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time. Proof. Assume, by way of contradiction, that there exists a deterministic algorithm A that ﬁnds an o(Δ)-approximation in o(log∗ n) time. Given a ring of n nodes, R(n), the following algorithm colors it with 3 colors, for any given k. – Construct the k-ring graph R(n, k) and run A on it. For each node vi ∈ Vn , denote the value c(vi ) as 1 if A selects v into the dominating set, and as 0 otherwise. – Every node vi ∈ Vn chooses its color according to whether or not vi and some of its neighbors are chosen to the dominating set by A. Speciﬁcally, consider the sequence Seq(vi ) as deﬁned in Def. 2. • If vi is not in the set, the nodes in the sequence are colored with colors 2 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 2, the second leftmost node chooses color 1, the third leftmost node chooses color 2, and so on. • If vi is in the set, the nodes in the sequence are colored with colors 0 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 0, the second leftmost node chooses color 1, the third leftmost node chooses color 0, and so on. The produced coloring uses 3 colors and is a subject to a straight-forward distributed implementation. Notice that the coloring is legal (i.e., no two adjacent nodes share the same color) inside sequences of nodes chosen and not chosen to the dominating set by A. Thus, the legality of the produced coloring should be veriﬁed in cases where the sequences end. Consider two neighboring nodes (in R(n)) v and u, where v is a left neighbor of u (i.e., v appears immediately after u in the ring when considering nodes in the clockwise direction). If v is in the set and u is not, then the color of u, being the leftmost in the sequence of nodes not in the set, is 2, while the color of v is 0 or 1. Similarly, if u is in the set and v is not, then the color of u, being the leftmost in the sequence of nodes in the set, is 0, while the color of v is 2 or 1. Thus, the produced coloring is legal.

Deterministic Dominating Set Construction

71

The running time of the algorithm is g(n) ∈ o(log∗ n) rounds spent for running A and an additional number of rounds to decide on colors. The length of the longest sequence of nodes not in the dominating set cannot exceed 2k, since otherwise there will be a node that is not covered by any other node in the selected dominating set. Thus, the implementation of the ﬁrst rule for the coloring decision requires a constant number of rounds. In the following, we show that there exists k such that the length of the longest sequence chosen to the dominating set by A is o(log∗ n). Thus, for this k, nodes decide on their colors in o(log∗ n) time. Thus, the running time of the algorithm to color a ring with 3 colors sums up to o(log∗ n), contradicting the famous lower bound of Linial [24]. We are left with the claim that for some k, the length of the longest sequence of nodes chosen to the dominating set by A is o(log∗ n). Suppose, by way of contradiction, that for any k there exists a function f (n) ∈ Ω(log∗ n) such that A produces a sequence of length f (n). Let vi and vj be the boundary nodes of such a sequence s.t. i ≤ j, and construct a subgraph G = Sub(R(n, k), vi−k·g(n) , vj+k·g(n) ). Notice that this subgraph contains the same f (n) nodes chosen by A into the dominating set plus additional 2k · g(n) nodes (see Fig. 2). Also note that a minimum domi1 nating set in G , Opt(G ), contains 2k+1 (f (n) + 2k · g(n)) nodes. When A is run on G , nodes in the original sequence of length f (n) cannot distinguish between the two graphs, i.e., R(n, k) and G . This is because in our model, a node can collect information in o(log∗ n) rounds only from nodes at distance of at most o(log∗ n) edges from it. Thus, being completely deterministic, A must select the same f (n) nodes (plus some additional nodes to ensure that all nodes in G are covered). Consequently, |A(G )| ≥ f (n), where |A(G )| denotes the size of the dominating set calculated by A for the graph G . On the other hand, A has an o(Δ)-approximation ratio, thus for any graph G, |A(G)| ≤ o(Δ) · |OP T (G)| + c, where c is some non-negative constant. For simplicity, we will assume c = 0; the proof does not change much for c > 0. In the graph R(n, k) (and G ), Δ = 2k, thus there exist Δ and k s.t. 2o(Δ ) = 2o(2k ) < 2k + 1. In addition, since f (n) ∈ Ω(log∗ n) and g(n) ∈ o(log∗ n), there exists n > k s.t. 2k · g(n ) < f (n ). Thus, for Δ , k and n , we get: 1 (f (n ) + 2k · g(n )) +1 2 < o(Δ ) · f (n ) < f (n ) ≤ |A(G )|, 2k + 1

o(Δ ) · |OP T (G )| = o(Δ ) ·

2k

contradicting the fact that A has an o(Δ)-approximation ratio.

It follows immediately from the previous theorem that no local deterministic algorithm that achieves an optimal O(log Δ)-approximation may exist. Corollary 1. Any distributed deterministic O(log Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time.

72

4.2

R. Friedman and A. Kogan

Upper Bound

First, we describe an algorithm that achieves log Δ-approximation in O(Δ3 + log∗ n) time. Next, we show a modiﬁed version that runs in O(Δ2 log Δ + log∗ n) time and achieves 2 log Δ-approximation. We will use the following notion: Definition 3. A k-distance coloring is an assignment of colors to nodes such that any two nodes within k hops of each other have distinct colors. Our algorithm consists of two parts. The ﬁrst part is a 2-distance coloring routine, implemented by means of a coloring algorithm provided by Kuhn [18]. Kuhn’s distributed deterministic algorithm produces 1-distance coloring for any input graph G using Δ + 1 colors in O(Δ + log∗ n) time. For our purpose, we run this algorithm on G2 graph, created from G by (virtually) connecting each node with any of its neighbors at distance 2. This means that any message sent on such a virtual link is routed by an intermediate node to its target, increasing the running time of the algorithm by a constant factor. The second part of the algorithm is the approximation routine, which is a simple application of the greedy heuristic described in Sect. 1, where colors obtained in the ﬁrst phase are used to break ties instead of IDs. That is, nodes exchange their span and color with all neighbors at distance 2 and decide to join the set if their span, color pair is lexicographically higher than any of the received pairs. The pseudo-code for the algorithm is given in Algorithm 1. It denotes the set of immediate neighbors of a node i by N1 (i) and the set of neighbors of i at distance 2 by N2 (i). Additionally, each node i uses the following local variables: – color: array with values of colors assigned to each node j ∈ N2 (i) by a 2-distance coloring routine. Initially, all values are set to ⊥. – state: array that holds the state of each node j ∈ N1 (i). The state can be uncovered, covered or marked. Initially, all values are set to uncovered. The nodes chosen to the dominating set are those that ﬁnish the algorithm with their state set to marked. – span: array with values for each node j ∈ N2 (i); span[j] holds the number of nodes in N1 (j) ∪ {j} that are uncovered by any other node already selected into the dominating set, as reported by j. Initially, all values are set to ⊥. – done: boolean array that speciﬁes for each node j ∈ N1 (i) whether j has ﬁnished the algorithm. Initially, all values are set to false. Theorem 2. The algorithm in Algorithm 1 computes a dominating set with an approximation ratio of log Δ in O(Δ3 + log∗ n) time. Proof. We start by proving the bound on the running time of the algorithm. The 2-distance coloring routine requires O(Δ2 + log∗ n) time. This is because the maximal degree of nodes in the graph G2 is bounded by Δ(Δ − 1) and each round of the coloring algorithm of Kuhn [18] in G2 can be simulated by at most 2 rounds in the given graph G.

Deterministic Dominating Set Construction

73

Algorithm 1. code for node i 1: color[i] = calc-2-dist-coloring() 2: distribute-and-collect(color, 2)

use the coloring algorithm of [18]

3: while state[j] = uncovered for any j ∈ N1 (i) ∪ {i} do 4: span[i] := |{state[j] = uncovered | j ∈ N1 (i) ∪ {i}}| 5: distribute-and-collect(span, 2) 6: if span[i], color[i] > max{span[j], color[j] | j ∈ N2 (i) ∧ span[j] = ⊥} then 7: state[i] := marked 8: distribute-and-collect(state, 1) 9: if state[j] = marked for any j ∈ N1 (i) then 10: state[i] := covered 11: distribute-and-collect(state, 1) 12: done 13: broadcast done to all neighbors distribute-and-collect(arrayi , radius): 14: foreach q in [1,2, ..., radius] do 15: broadcast arrayi to all neighbors 16: receive arrayj from all j ∈ N1 (i) s.t. done[j] = f alse 17: foreach node l at distance q from i do 18: if ∃j ∈ N1 (i) s.t. done[j] = f alse ∧ node l at distance q − 1 from j then 19: arrayi [l] = arrayj [l] 20: done 21: done when done is received from j: 22: done[j] = true 23: span[j] = ⊥

The maximal value that the span can be assigned to is Δ + 1, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, the maximal number of distinct values for all span, color pairs is O(Δ3 ). In every other iteration of the greedy heuristic (while-do loop in Lines 3–12 in Algorithm 1), all nodes having a maximal value of the span, color pair join the set. Thus, after at most O(Δ3 ) iterations, all nodes are covered, while each iteration can be implemented in O(1) synchronous communication rounds. Summing over both phases produces the required bound on the running time. Note that if coloring is not required, the running time is independent of n. For the approximation ratio proof, observe that the span of a node is inﬂuenced only by its neighbors at distance of at most 2 hops. Also, notice that the dominating set problem is easily reduced to the set-cover problem (by creating a set for each node along with all its neighbors [13]). Thus, the algorithm chooses essentially exactly the same nodes as the well-known centralized greedy heuristic for the set-cover problem [5], which picks sets based on the number of uncovered elements they contain. Thus, the approximation ratio of the algorithm follows directly from the analysis of that heuristic (for details, see [5]).

74

R. Friedman and A. Kogan

To reduce the running time of the algorithm (at the price of increasing the approximation ratio by a factor of 2), we modify the algorithm to work with an adjusted span for each node u. The adjusted span is the smallest power of 2 that is at least as large as the number of u’s uncovered neighbors (including u itself). Thus, during the second phase of the algorithm, u exchanges its adjusted span and color with all nodes at distance 2 and decides to join the dominating set if its adjusted span, color is lexicographically higher than that of any node at distance 2. Note that one might adjust the span to the power of any other constant c > 1 improving slightly the approximation ratio, but not the asymptotic running time. Theorem 3. The modified algorithm computes a dominating set with an approximation ratio of 2 log Δ in O(Δ2 log Δ + log∗ n) time. Proof. The maximal value that the adjusted span can be assigned to is log Δ, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, similarly to the proof of Theorem 2, we can infer that the running time is O(Δ2 log Δ + log∗ n). The factor 2 in the approximation ratio appears due to the span adjustment. In order to prove this claim, consider the centralized greedy heuristic for the set-cover problem [5] with the adjusted span modiﬁcation. That is, the number of uncovered elements in a set S is replaced (adjusted) by the smallest power of 2 which is at least as large as this number, and at each step, the heuristic chooses a set that covers the largest adjusted number of uncovered elements. Following the observation in the proof of Theorem 2, setting the approximation ratio for the centralized set-cover heuristic that uses the adjusted span modiﬁcation will set the proof for the approximation ratio of the modiﬁed dominating set algorithm. When the (modiﬁed or unmodiﬁed) greedy heuristic chooses a set S, suppose that it charges each element of S by the price 1/i, where i is the number of uncovered elements in S. As a result, the total price paid by the heuristic is exactly the number of sets it chooses, while each element is charged only once. Consider the set S ∗ = {ek , ek−1 , . . . , e1 } in the optimal set-cover solution Sopt , and assume without loss of generality that the greedy heuristic covers the elements of S ∗ in the given order: ek , ek−1 , . . . , e1 . Consider the step at which the heuristic chooses a set that covers ei . At the beginning of that step, at least i elements are uncovered. Thus, if the heuristic were to choose the set S ∗ at that step, it would pay the price of 1/i per element. Using the adjusted span modiﬁcation, the heuristic might pay at that step at most twice the price per element covered, i.e., it pays for ei at most 2/i. Consequently, the total price paid by the heuristic to cover all elements in the set S ∗ is at most Σ1≤i≤k 2/i = 2Hk , where Hk = Σ1≤i≤k 1/i = log k + O(1) is the k-th harmonic number. Thus, since every element is in some set of Sopt , we get that in order to cover all elements, the modiﬁed greedy heuristic pays at most ΣS∈Sopt 2Hm = 2Hm ΣS∈Sopt 1 = 2Hm |Sopt |, where m is the size of the biggest set in Sopt . In the instance of the set-cover problem produced from the graph with the bounded degree Δ, m = Δ+1, which establishes the required approximation ratio.

Deterministic Dominating Set Construction

5

75

Conclusions

In this paper, we examined distributed deterministic solutions for the dominating set problem, one of the most important problems in graph theory, in the scope of graphs with bounded node degree. Such graphs are useful for modeling networks in many realistic settings, such as various types of wireless and peer-to-peer networks. For these graphs, we showed that no purely local, i.e., independent of the number of nodes, deterministic algorithms that calculate a non-trivial approximation may ever exist. This lower bound is complemented by two approximation algorithms. The ﬁrst algorithm ﬁnds a log Δ-approximation in O(Δ3 + log∗ n) time, while the second one achieves a 2 log Δ-approximation in O(Δ2 log Δ + log∗ n) time. These results compare favorably to previous deterministic algorithms with running time of O(n). With regard to the lower bound, they leave an additive gap of O(Δ2 log Δ) for further improvements. In the full version of this paper, we show a simple extension of our bounds for weighted bounded degree graphs.

Acknowledgments We would like to thank Fabian Kuhn and Jukka Suomela for fruitful discussions on the subject, and to anonymous reviewers whose valuable comments helped to improve the presentation of this paper.

References 1. ˚ Astrand, M., Flor´een, P., Polishchuk, V., Rybicki, J., Suomela, J., Uitto, J.: A local 2-approximation algorithm for the vertex cover problem. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 191–205. Springer, Heidelberg (2009) 2. Astrand, M., Suomela, J.: Fast distributed approximation algorithms for vertex cover and set cover in anonymous networks. In: Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 294–302 (2010) 3. Chen, Y.P., Liestman, A.L.: Approximating minimum size weakly-connected dominating sets for clustering mobile ad hoc networks. In: Proc. ACM Int. Symp. on Mob. Ad Hoc Networking and Computing (MobiHoc), pp. 165–172 (2002) 4. Chlebik, M., Chlebikova, J.: Approximation hardness of dominating set problems in bounded degree graphs. Inf. Comput. 206(11) (2008) 5. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 6. Czyzowicz, J., Dobrev, S., Fevens, T., Gonzalez-Aguilar, H., Kranakis, E., Opatrny, J., Urrutia, J.: Local algorithms for dominating and connected dominating sets of unit disk graphs with location aware nodes. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 158–169. Springer, Heidelberg (2008) 7. Das, B., Bharghavan, V.: Routing in ad-hoc networks using minimum connected dominating sets. In: Proc. IEEE Int. Conf. on Comm (ICC), pp. 376–380 (1997) 8. Dong, Q., Bejerano, Y.: Building robust nomadic wireless mesh networks using directional antennas. In: Proc. IEEE INFOCOM, pp. 1624–1632 (2008)

76

R. Friedman and A. Kogan

9. Feige, U.: A threshold of ln n for approximating set cover. Journal of the ACM 45, 314–318 (1998) 10. Ferro, E., Potorti, F.: Bluetooth and Wi-Fi wireless protocols: a survey and a comparison. IEEE Wireless Communications 12(1), 12–26 (2005) 11. Friedman, R., Kogan, A.: Eﬃcient power utilization in multi-radio wireless ad hoc networks. In: Abdelzaher, T., Raynal, M., Santoro, N. (eds.) OPODIS 2009. LNCS, vol. 5923, pp. 159–173. Springer, Heidelberg (2009) 12. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. Ltd., New York (1979) 13. Guha, S., Khuller, S.: Approximation algorithms for connected dominating sets. Algorithmica 20, 374–387 (1998) 14. Han, B., Jia, W.: Clustering wireless ad hoc networks with weakly connected dominating set. Journal of Parallel and Distr. Computing 67(6), 727–737 (2007) 15. Herman, T., Tixeuil, S.: A distributed TDMA slot assignment algorithm for wireless sensor networks. In: Nikoletseas, S.E., Rolim, J.D.P. (eds.) ALGOSENSORS 2004. LNCS, vol. 3121, pp. 45–58. Springer, Heidelberg (2004) 16. Jia, L., Rajaraman, R., Suel, T.: An eﬃcient distributed algorithm for constructing small dominating sets. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 33–42 (2001) 17. Kaashoek, M.F., Karger, D.R.: Koorde: A simple degree-optimal distributed hash table. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 98–107. Springer, Heidelberg (2003) 18. Kuhn, F.: Weak graph colorings: distributed algorithms and applications. In: Proc. Symp. on Paral. in Algorithms and Architectures (SPAA), pp. 138–144 (2009) 19. Kuhn, F., Moscibroda, T., Wattenhofer, R.: What cannot be computed locally! In: Proc. ACM Symp. on Principles of Distr. Comp. (PODC), pp. 300–309 (2004) 20. Kuhn, F., Moscibroda, T., Wattenhofer, R.: On the locality of bounded growth. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 60–68 (2005) 21. Kuhn, F., Moscibroda, T., Wattenhofer, R.: The price of being near-sighted. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 980–989 (2006) 22. Lenzen, C., Wattenhofer, R.: Leveraging linial’s locality limit. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 394–407. Springer, Heidelberg (2008) 23. Liang, B., Haas, Z.J.: Virtual backbone generation and maintenance in ad hoc network mobility management. In: Proc. IEEE INFOCOM, pp. 1293–1302 (2000) 24. Linial, N.: Locality in distributed graph algorithms. SIAM Journal on Computing 21(1), 193–201 (1992) 25. Malkhi, D., Naor, M., Ratajczak, D.: Viceroy: a scalable and dynamic emulation of the butterﬂy. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 183–192 (2002) 26. Panconesi, A., Rizzi, R.: Some simple distributed algorithms for sparse networks. Distributed Computing 14(2), 97–100 (2001) 27. Peleg, D.: Distributed computing: a locality-sensitive approach. SIAM, Philadelphia (2000) 28. Rhee, I., Warrier, A., Min, J., Xu, L.: DRAND: distributed randomized TDMA scheduling for wireless ad-hoc networks. In: Proc. 7th ACM Int. Symp. on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 190–201 (2006) 29. Sivakumar, R., Das, B., Bharghavan, V.: Spine routing in ad hoc networks. Cluster Computing 1(2), 237–248 (1998) 30. Wu, J., Dai, F., Gao, M., Stojmenovic, I.: On calculating power-aware connected dominating sets for eﬃcient routing in ad hoc wireless networks. Journal of Communications and Networks, 59–70 (2002)

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks Dirk Bradler1 , Lachezar Krumov1 , Max M¨ uhlh¨ auser1 , and Jussi Kangasharju2 1

TU Darmstadt, Germany {bradler,krumov,max}@cs.tu-darmstadt.de 2 University of Helsinki, Finland [email protected]

Abstract. Peer-to-Peer networks are divided into two main classes: unstructured and structured. Overlays from the ﬁrst class are better suited for exhaustive search, whereas those from the second class oﬀer very efﬁcient key-value lookups. In this paper we present a novel overlay, PathFinder, which combines the advantages of both classes within one single overlay for the ﬁrst time. Our evaluation shows that PathFinder is comparable or even better in terms of lookup and complex query performance than existing peer-to-peer overlays and scales to millions of nodes.

1

Introduction

Peer-to-peer overlay networks can be classiﬁed into unstructured and structured networks, depending on how they construct the overlay. In an unstructured network the peers are free to choose their overlay neighbors and what they oﬀer to the network.1 In order to discover if a certain piece of information is available a peer must somehow search through the overlay. There are several implementations of such search algorithms. The original Napster used a central index server, Kazaa relied on a hybrid network with supernodes and the original Gnutella used a decentralized ﬂooding of queries [4]. The BubbleStorm network [5] is a fully decentralized network based on random graphs and is able to provide eﬃcient exhaustive search. Structured networks, on the other hand, have strict rules about how the overlay is formed and where content should be placed within the network. Structured networks are also often called distributed hash tables (DHT) and the research world has seen several examples of DHTs [3,7]. DHTs are very eﬃcient for simple key-value lookups. Because objects are addressed with their unique names, searching in a DHT is hard to be made more eﬃcient [6]. However, wildcard searching and complex queries either impose extensive complexity and costs in terms of additional messages or are not supported at all. Given the attractive properties of both these diﬀerent network structures: it is natural to ask the question: Is it possible to combine these two properties in 1

In this paper we focus on networks where peers store and share content, e.g., ﬁles, database items, etc.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 77–82, 2011. c Springer-Verlag Berlin Heidelberg 2011

78

D. Bradler et al.

one single network? Our answer to this question is PathFinder, a peer-to-peer overlay which combines an unstructured and a structured network in a single overlay. PathFinder is based on a random graph which gives it short average path length, large number of alternative paths for fault tolerable, highly robust and reliable overlay topology. Our main contribution is the eﬃcient combination of exhaustive searching and key-value lookups in a single overlay. The rest of this paper is organized as follows. In Section 2 we present an overview of PathFinder. Section 3 compares it to existing P2P overlays and we conclude in Section 4. Due to space limitations, the reader is referred to [1] for technical aspects such as node join and leave, handling crashed nodes, and network size adaptation. In [1] an extensive evaluation of PathFinder under churn and attacks is also presented.

2

PathFinder Design

In this section we present the system model and preliminaries of PathFinder. We also describe how the basic key-value lookup and exhaustive search work. For further basic operations like node join/leave, handling crashed nodes, see [1]. 2.1

Challenges

We designed PathFinder to be fully compliant with the concept of BubbleStorm [5], namely an overlay structure based on random graphs. We augment the basic random graph with a deterministic lookup mechanism (see Section 2.4) to add eﬃcient lookups into the exhaustive search provided by BubbleStorm. The challenge and one of the key contributions of this paper is developing a deterministic mechanism for exploiting these short paths in implement DHT-like lookups. 2.2

System Model and Preliminaries

All processes in PathFinder beneﬁt from the properties of its underlying random graph and the routing scheme built on top of it. PathFinder construction principle. The basic idea of PathFinder is to build a robust network of virtual nodes on top of the physical peers (i.e. actual physical nodes). Routing among peers is carried out in the virtual network. The actual data transfer still takes place directly among the physical peers. PathFinder builds a random graph of virtual nodes and then distributes them among the actual peers. At least one virtual node is assigned to each peer. From the routing point of view, the data in the network is stored on the virtual nodes. When a peer B is looking for a particular piece of information it has to ﬁnd a path from one of its virtual nodes to the virtual node containing the requested data. Then B directly contacts the underlying peer A which is responsible for the targeted virtual node. B retrieves the requested data directly from A. This process is described in detail in Section 2.4.

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks

79

It is known that the degree sequence in a random graph is Poisson distributed. We need two pseudorandom number generators (PRNG) which initialized with the same ID always produce a deterministic sequence of numbers. Given a number c, the ﬁrst generator returns Poisson distributed numbers with mean value c. The second PRNG given a node ID produces a deterministic sequence of numbers which we use as IDs for the neighbors of the given node. The construction principle of PathFinder is as follows. First we ﬁx a number c (see [1] on how to chose c according to the number of peers and how to adapt it once the network becomes too small/large). Then, for each virtual node we determine the number of neighbors with the ﬁrst number generator. The actual nodes IDs to which the current virtual node should be connected are chosen with the second number generator. The number generator is started with the ID of the virtual node. The process can be summarized in the following steps: 1. The underlying peer determines how many virtual nodes it should handle. See [1] for details. 2. For every virtual node handled by the peer: (a) The peer uses the poisson number generator to determine the number of neighbors of the current virtual node. (b) The peer then draws as many pseudo random numbers according to the number drawn in the previous step. (c) The peer selects the virtual nodes with IDs matching to those numbers as neighbors for its current virtual node. The construction mechanism of PathFinder allows the peers to build a random graph out of their virtual nodes. It is of crucial importance that a peer only needs a PRNG to perform that operation. There is no need for network communication. Similarly, any peer can determine the neighbors of any virtual node, simply seeding the pseudo random generator with the corresponding ID. Now we have both, a random graph topology suited for exhaustive search and a mechanism for each node to compute the neighbor list of any other node. i.e. DHT-like behavior within PathFinder. Routing table example of PathFinder. Figure 1 shows a small sample of PathFinder with a routing table for the peer with ID 11. The random graph has 5 virtual nodes (1 through 5) and there are 4 peers (with IDs from 11 through 14). Peer 11 handles two virtual nodes (4 and 5) and all the rest of the peers have 1 virtual node each. The arrows between the virtual nodes show the directed neighbor links. Each peer keeps track of its own outgoing links as well as incoming links from other virtual nodes. A peer learns the incoming links when the other peers attempt to connect to it. Keeping track of the incoming links is strictly speaking not necessary, but makes key lookups much more eﬃcient (see Section 2.4). The routing table of peer marked as 11 therefore consists of all outgoing links from its virtual nodes 4 and 5 and the incoming link from virtual node 3.

D. Bradler et al.

80

Fig. 1. A small example of PathFinder

2.3

10k nodes 1M nodes 100M nodes

100

Cumulative Lookups

80

60

40

20

0

2

4

6

8

10

Path Length

Fig. 2. Key lookup with local expanding ring search from source and target

Fig. 3. Distribution of complete path length, 5000 key lookups with c = 20

Storing Objects

An object is stored on the virtual node (i.e. on the peer responsible for the virtual node) which matches the object’s identiﬁer. If the hash space is larger than the number of virtual nodes, then we map the object to the virtual node whose identiﬁer matches the preﬁx of the object hash. 2.4

Key Lookup

Key lookup is the process when a peer contacts another peer possessing a given data of interest. Using the structure of the network, the requesting peer traverses only one single and usually short path from itself to the target peer. Key lookup is the main function of a DHT. In order to perform quick lookups, the average number of hops between peers as well as the variance needs to be kept small. We now show how PathFinder achieves eﬃcient lookups and thus behaves as any other DHT. Suppose that peer A wants to retrieve an object O. Peer A determines that the virtual node w is responsible for object O by using the hash function described above. Now A has to route in the virtual network from one of its virtual nodes to w and directly retrieve O from the peer responsible for w. Denote with V the set of virtual nodes managed by the peer A. For each virtual node in V , A calculates the neighbors of those nodes. (Note that this calculation is already done, since these neighbors are the entries in peer A’s routing table.) A checks if any of those neighbors is the virtual node w. If yes, A contacts the underlying peer to retrieve O. If none of peer A’s virtual node neighbors is responsible for O, A calculates the neighbors of all of its neighbors, i.e. its second neighbors. Because the neighbors of each virtual node are pre-known (see Section 2.2), this is a simple local computation. Again, peer A checks if any of the new calculated neighbors is responsible for O. If yes, peer A sends its request to the virtual node whose neighbor is responsible for O. If still no match is found, peer A expands its search by calculating the neighbors of the nodes from the previous step and checks again. The process continues until a match is found. A may have to calculate several neighbors, but a match is guaranteed.

PathFinder: Eﬃcient Lookups and Eﬃcient Search in Peer-to-Peer Networks

81

Because peer A is able to compute w’s neighboring virtual nodes, A can expand the search rings locally from both the source and target sides, which is called forward and backward chaining. In every step the search depth of the source and target search ring is increased by one. In that way the number of rings around the source are divided between the source itself and the target. This leads to exponential decrease in the number of IDs that have to be computed. We generated various PathFinder networks from 103 up to 108 nodes with average degree 20. In all of them we performed 5000 arbitrary key lookups. It turned out that, expanding rings of depth 3 or 4 (i.e., path length between 6 and 8) is suﬃcient for a successful key lookup, as shown in Figure 3. 2.5

Searching with Complex Queries

PathFinder supports searching with complex queries with tunable success rate almost identical to BubbleStorm [5]. In fact, since both PathFinder and BubbleStorm are based on random graphs, we implemented the search mechanism of BubbleStorm directly into PathFinder. In BubbleStorm both data and queries are sent to some number of nodes, where the exact number of messages depends on how we set the probability of ﬁnding the match. We use exactly the same algorithm in PathFinder for searching and the reader is referred to [5] for details.

3

Comparison and Analysis

Most DHT overlays provide the same functionality, since they all support the common interface for key based routing. The main diﬀerences between various DHT implementations are average lookup path length, resilience to failures, and load balancing. In this Section we compare PathFinder to established DHTs. ) The lookup path length of Chord is well studied: Lavg = log(N . The maximum 2 log(N ) ) path length of Chord is log(1+d) . The average path length of PathFinder is log(N , log(c) where c is the average number of neighbors. The path length of the Pastry model can be estimated by log2b (N ) [3], where b is a tunable parameter. The Symphony overlay is based on a small world graph. This leads to key lookups in 2 O( log k(N) ) hops [2]. The variable k refers only to long distance links. The actual 1 amount of neighbors is indeed much higher [2]. The diameter of CAN is 12 dN d with a degree for each node 2d, with a ﬁxed d. With large d the distribution of path length becomes gaussian, like Chord. We use simulations to evaluate the practical eﬀects of the individual factors. Figure 4 shows the results for a 20,000 nodes network. We perform 5,000 lookups among random pairs of nodes and measure the number of hops each DHT takes to ﬁnd the object. Figure 5 displays the results. Note that PathFinder results come from actual simulation, not analytical calculations. PathFinder also inherits the exhaustive search mechanism of BubleStorm. Hence, as an unstructured overlay it performs identical to BubleStorm and the reader is referred to [5] for thorough comparison to other unstructured systems.

82

D. Bradler et al.

10

Pastry PathFinder (c=20) SkipNet Chord Symphony

80

8 6

Number of Hops

Cumulative Lookups

100

60

40

4

2

20

0

5

10

15

20

25

Path Length

Fig. 4. Average number of hops for 5,000 key lookups in diﬀerent DHTs

4

1 1⋅103

Chord Pastry PathFinder (c=20) PathFinder (c=50) DeBruijn 1⋅104

1⋅105

1⋅106

1⋅107

1⋅108

Number of Nodes

Fig. 5. Average number of hops for different DHTs measured analytically. Numbers for PathFinder are simulated.

Conclusions

In this paper we have presented PathFinder, an overlay which combines eﬃcient exhaustive search and eﬃcient key-value lookups in the same overlay. Combining these two mechanisms in the same overlay is very desirable, since it allows efﬁcient and overhead-free implementation of natural usage patterns. PathFinder is the ﬁrst overlay to combine exhaustive search and key-value lookups in an eﬃcient manner. Our results show that PathFinder has performance comparable or better to existing overlays. It scales easily to millions of nodes and its key lookup performance is in large networks better than in existing DHTs. Because PathFinder is based on a random graph, we are able to directly beneﬁt from existing search mechanisms (BubbleStorm) for enabling eﬃcient exhaustive search.

References 1. Bradler, D., Krumov, L., Kangasharju, J., Weihe, K., M¨ uhlh¨ auser, M.: Pathﬁnder: Eﬃcient lookups and eﬃcient search in peer-to-peer networks. Tech. Rep. TUD-CS2010872, TU Darmstadt (October 2010) 2. Manku, G., Bawa, M., Raghavan, P.: Symphony: Distributed Hashin In A Small World. In: Proc. 4th USENIX Symposium on Internet Techn. and Systems (2003) 3. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001) 4. Steinmetz, R., Wehrle, K. (eds.): Peer-to-Peer Systems and Applications. LNCS, vol. 3485. Springer, Heidelberg (2005) 5. Terpstra, W., Kangasharju, J., Leng, C., Buchmann, A.: Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: Proc. SIGCOMM, pp. 49–60 (2007) 6. Yang, Y., Dunlap, R., Rexroad, M., Cooper, B.: Performance of full text search in structured and unstructured peer-to-peer systems. In: Proc. IEEE INFOCOM (2006) 7. Zaho, B., Kubiatowicz, J., Joseph, A.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Comp. 74 (2001)

Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) Hagit Attiya1,2 and Eshcar Hillel1 1

2

Department of Computer Science, Technion Ecole Polytechnique Federale de Lausanne (EPFL)

Abstract. We present PermiSTM, a single-version STM that satisfies a practical notion of permissiveness, usually associated with keeping many versions: it never aborts read-only transactions, and it aborts other transactions only due to a conflicting transaction (which writes to a common item), thereby avoiding spurious aborts. It avoids unnecessary contention on the memory, being strictly disjointaccess parallel.

1 Introduction Transactional memory is a leading paradigm for programming concurrent applications for multicores. It is seriously considered as part of software solutions (abbreviated STMs) and as a basis for novel hardware designs, which exploit the parallelism offered by contemporary multicores and multiprocessors. A transaction encapsulates a sequence of operations on a set of data items: it is guaranteed that if a transaction commits, then all its operations appear to be executed atomically. A transaction may abort, in which case none of its operations are executed. The data items written by the transaction are its write set, the data items read by the transaction are its read set, and together they are the transaction’s data set. When an executing transaction may violate consistency, the STM can forcibly abort it. Many existing STMs, however, sometimes spuriously abort a transaction, even when in fact, the transaction may commit without compromising data consistency [9]. Frequent spurious aborts can waste system resources and significantly impair performance; in particular, this reduces the chances of long transactions, which often only read the data, to complete. Avoiding spurious aborts has been an important goal for STM design, and several conditions have been proposed to evaluate how well it is achieved [8, 9, 12, 16, 20]. A permissive STM [9] never aborts a transaction unless necessary to ensure consistency. A stronger condition, called strong progressiveness [12], further ensures that even when there are conflicts, at least one of the transactions involved in the conflict is not aborted. Alternatively, multi-version (MV) permissiveness [20] focuses on read-only transactions (whose write set is empty), and ensures they never abort; update transactions, with non-empty write set, may abort when in conflict with other transactions writing to the same items. As its name suggests, multi-version progressiveness was meant to be provided by a multi-version STM, maintaining multiple versions of each data item.

This research is supported in part by the Israel Science Foundation (grant number 953/06).

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 83–94, 2011. c Springer-Verlag Berlin Heidelberg 2011

84

H. Attiya and E. Hillel

It has been suggested [20] that refraining to abort read-only transactions mandates the overhead associated with maintaining multiple versions: additional storage, a complex implementation of a precedence graph (to track versions), as well as an intricate garbage collection mechanism, to remove old versions. Indeed, MV-permissiveness is satisfied by current multi-version STMs, both practical [20, 21] and more theoretical [16, 19], keeping many versions per item. It can be achieved by other multi-version STMs [22,3], if enough versions of the items are maintained. This paper shows it is possible to achieve MV-permissiveness while keeping only a single version of each data item. We present PermiSTM, a single-version STM that is both MV-permissive and strongly progressive, indicating that multiple versions are not the only design choice when seeking to reduce spurious aborts. By maintaining a single version, PermiSTM avoids the high space complexity associated with multiversion STMs, which is often unacceptable in practice. This also eliminates the need for intricate mechanisms of maintaining and garbage collecting old versions. PermiSTM is lock-based, like many contemporary STMs, e.g., [6, 5, 7, 23]. For each data item, it maintains a single version, as well as a lock, and a read counter, counting the number of pending transactions that have read the item. Read-only transactions never abort (without having to declare them as such, in advance); update transactions abort only if some data item in their read set is written by another transaction, i.e., at least one of the conflicting transactions commits. Although it is blocking, PermiSTM is deadlock-free, i.e., always some transaction can make progress. The design choices of PermiSTM offer several benefits, most notably: – Simple lock-based design makes it easier to argue about correctness. – Read counters avoid the overhead of incremental validation, thereby improving performance, as demonstrated in [6, 17], especially in read-dominated workloads. Read-only transactions do not require validation at all, while update transactions validate their read sets only once. – Read counters circumvent the need for a central mechanism, like a global version clock. Thus, PermiSTM is strictly disjoint-access parallel [10], namely, processes executing transactions with disjoint data sets do not access the same base objects. It has been proved [20, Theorem 2] that a weakly disjoint-access parallel STM [2, 14] cannot be MV-permissive. PermiSTM, satisfying the even stronger property of strict disjoint-access parallelism, shows that this impossibility result depends on a strong progress condition: a transaction delays only due to a pending operation (by another transaction). In PermiSTM, a transaction may delay due to another transaction reading from its write set, even if no operation of the reading transaction is pending.

2 Preliminaries We briefly describe the transactional memory model [15]. A transaction is a sequence of operations executed by a single process. Each operation either accesses a data item or tries to commit or abort the transaction. Specifically, a read operation specifies the item to read, and returns the value read by the operation; a write operation specifies the item and value to be written; a try-commit operation returns an indication whether

Single-Version STMs Can Be Multi-version Permissive boolean CAS(obj, exp, new) { // Atomically if obj = exp then obj ← new return TRUE return FALSE }

85

boolean k CSS (o[k], e[k], new) { // Atomically if o[1] = e[1] and . . . o[k] = e[k] then o[1] ← new return TRUE return FALSE }

Fig. 1. The CAS and k-compare-single-swap primitives

the transaction committed or aborted; an abort operation returns an indication that the transaction is aborted. While trying to commit, a transaction might be aborted, e.g., due to conflict with another transaction.1 A transaction is forcibly aborted if an invocation of a try-commit returns an indication that the transaction is aborted. Every transaction begins with a sequence of read and write operations. The last operation of a transaction is either an access operation, in which case the transaction is pending, or a try-commit or an abort operation, in which case the transaction is committed or aborted. A software implementation of transactional memory (STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitives on the base objects, which asynchronous processes follow in order to execute the operations of the transactions. An event is a computation step by a process consisting of local computation and the application of a primitive to base objects, followed by a change to the process’s state, according to the results of the primitive. We employ the following primitives: READ(o) returns the value in base object o; WRITE(o, v) sets the value of base object o to v; CAS(o, exp, new) writes the value new to base object o if its value is equal to exp, and returns a success or failure indication; k CSS is similar to CAS, but compares the values of k independent base objects (see Figure 1). 2.1 STM Properties We require the STM to be opaque [11]. Very roughly stated, opacity is similar to requiring strict view serializability applied to all transactions (included aborted ones). Restrictions on spurious aborts are stated by the following two conditions. Definition 1. A multi-version (MV-)permissive STM [20] forcibly aborts a transaction only if it is an update transaction that has a conflict with another update transaction. Definition 2. An STM is strongly progressive [12] if a transaction that has no conflicts cannot be forcibly aborted, and if a set of transactions have conflicts on a single item then not all of them are forcibly aborted. These two properties are incomparable: strong progressiveness allows a read-only transaction to abort, if it has a conflict with another update transaction; on the other hand, MVpermissiveness does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. 1

Two transactions conflict if they access the same data item; the conflict is nontrivial if at least one of the operations is a write. In the rest of the paper all conflicts are nontrivial conflicts.

86

H. Attiya and E. Hillel

Finally, an STM is strictly disjoint-access parallel [10] if two processes, executing transactions T1 and T2 , access the same base object, at least one with a non-trivial primitive, only if the data sets of T1 and T2 intersect.

3 The Design of PermiSTM The design of PermiSTM is very simple. The first and foremost goal is to ensure that a read-only transaction never aborts, while maintaining only a single-version. This suggests that the data returned by a read operation issued by a read-only transaction T should not be overwritten until T completes. A natural way to achieve this goal is to associate a read counter with each item, tracking the number of pending transactions reading from the item. Transactions that write to the data items respect the read counters; an update transaction commits and updates the items in its write set only in a “quiescent” configuration, where no (other) pending transaction is reading an item in its write set. This yields read-only transactions that guarantee consistency without requiring validation and without specifying them as such in advance. The second goal is to guarantee consistent updates of data items, by using ordinary locks to ensure that only one transaction is modifying a data item at each point. Thus, before writing its changes, at commit time, an update transaction acquires locks. Having two different mechanisms—locks and counters—in our design, requires care in combining them. One question is when during the executing, a transaction decrements the read counters of the items in its read set? The following simple example demonstrates how a deadlock may happen if an update transaction does not decrement its counters, before acquiring locks: T1 : read(a) write(b) try-commit T2 : read(b) write(a) try-commit T1 and T2 incremented the read counters of a and b, respectively, and later, in commit time, T1 acquires a lock on b, while T2 acquires a lock on a. To commit, T1 has to wait for T2 to complete and decrement the read counter of b, while T2 has to wait for the same to happen with T1 and item a. Since an update transaction first decrements read counters, it must ensure consistency by acquiring locks also for items in its read set. Therefore, an update transaction acquires locks for all items in its data set. Finally, read counters are incremented as they are encountered during the execution of the transaction. What happens if read-only transactions wait for locks to be released? The next example demonstrates how this can create a deadlock: read(b) T1 : read(a) T2 : write(b) write(a) try-commit If T2 acquires a lock on b, then T1 cannot read b until T2 completes; T2 cannot commit as it has to wait for T1 to complete and decrease the read counter of a; MV-permissiveness does not allow both transactions to be forcibly aborted. Thus, read counters get preference over locks, and they can always be incremented. Prior to committing, an update transaction first decrements its read counters, and then acquires locks on all items in its

Single-Version STMs Can Be Multi-version Permissive

write set (ws)

87

item seq data

lock

read set (rs)

read counter

item seq data

owner seq

rcounter

data

status

Fig. 2. Data structures used in the algorithm: an item (left) and a transaction descriptor (right)

data set, in a fixed order (while validating the consistency of its read set); this avoids deadlocks due to blocking cycles, and livelocks due to repeated aborts. Since committing a transaction and committing its updates are not done atomically, a committed transaction that has not yet completed updating all the items in its write set, can yield an inconsistent view for a transaction reading one of these items. If a read operation simply reads the value in the item, it might miss the up-to-date value of the item. Therefore, a read operation is required to read the current value of the item, which can be found either in the item, or in the data of the transaction.2 To simplify the exposition of PermiSTM, k-compare-single-swap (k CSS) [18] is applied to commit an update transaction while ensuring that the read counters of the items in its write set are all zero. Section 4 describes how the implementation can be modified to use only CAS; the resulting implementation is (strongly) disjoint-access parallel but is not strictly disjoint-access parallel. Data Structures. Figure 2 presents the data structures of items and transactions’ descriptors used in our algorithm. We associate a lock and a read counter with each item, as follows: – A lock includes an owner field, and an unbounded sequence number, seq, that are accessed atomically. The owner field is set to the id of the update transaction owning the lock and is 0 if no transaction holds the lock. The seq field holds the sequence number of the data, it is incremented whenever a new data is committed to the item, and it is used to assert the consistency of reads. – A simple read counter, rcounter, tracks how many transactions are reading the item. – The data field holds the value that was last written to the item, or its initial value if no transaction yet written to the item. The descriptor of a transaction consists of the read set, rs, the write set ws, and the status of the transaction. The read and write sets are collections of data items. – A data item in the read set includes a reference to an item, the data read from the item, and the sequence number of this data, seq. 2

This is analogous to the notion of current version of a transactional object in DSTM [13].

88

H. Attiya and E. Hillel

– A data item in the write set includes a reference to an item, the data to be written in the item, and the sequence number of the new data, seq, i.e., the sequence number of the current data plus 1. – A status indicates if the transaction is COMMITTED or ABORTED, initially NULL. The current data and sequence number of an item are defined as follows: If the lock of the item is owned by a committed transaction that writes to this item, then the current data and sequence number of the item appear in the write set of the owner transaction. Otherwise (owner is 0, or the owner is not committed, or the item is not in the owner’s write set), the current data and current sequence number appear in the item. The Algorithm. Next we give a detailed description of the main methods, for handling the operations; the code appears in Pseudocodes 1 and 2. The reserved word self in the pseudocode is a self-reference to the descriptor of the transaction whose code is being executed. read method: If the item is already in the transaction’s read set (line 2), return the value from the read set (line 3). Otherwise, increment the read counter of the item (line 5). Then, the reading transaction adds the item to its read set (line 7) with the current data and sequence number of the item (line 6). write method: If the item is not already in the transaction’s write set (line 11), then add the item to the write set (line 12). Set data of the item in the transaction’s write set to the new data to be written (line 13). No lock is acquired at this stage. tryCommit method: Decrement all the read counters of the items in the transaction’s read set (line 16). If the transaction is read-only, i.e., the write set of the transaction is empty (line 17), then commit (line 18); the transaction completes and returns (line 19). Otherwise, this is an update transaction and it continues: acquire locks on all items in the data set (line 20); commit the transaction (line 22) and the changes to the items (lines 23-25); release locks on all items in the data set (line 26). The transaction may abort while acquiring locks due to a conflict with another update transaction (line 21). acquireLocks method: Acquire locks on all items in the data set of the transaction by their order (line 30). If the item is in the read set (line 33), check that the sequence number in the read set (line 34) is the same as the current sequence number of the item (line 32). If the sequence number has changed (line 35) then the data read is overwritten by another committed transaction and the transaction aborts (line 36). Use CAS to acquire the lock: set owner from 0 to the descriptor of the transaction; if the item is in the read set this is done while asserting that seq is unchanged (line 38). If the CAS failed then owner is non-zero since there is another owner (or seq has changed), so spin, re-reading the lock (line 38) until owner is 0. If the item is in the write set (line 39), set the sequence number of the item in the transaction’s write set, seq, to the sequence number of the current data plus 1 (line 41). commitTx method: Use k CSS to set status to COMMITTED , while ensuring that all read counters of items in the transaction’s write set are 0 (line 47). If the read counter of one of these items is not 0, a pending transaction is reading from this item, then spin, until all rcounters are 0.

Single-Version STMs Can Be Multi-version Permissive

Pseudocode 1. Methods for read, write and try-commit operations 1: Data read(Item item) { 2: if item in rs then 3: di ← rs.get(item) 4: else 5: incrementReadCounter(item) 6: di ← getAbsVal(item) 7: rs.add(item,di) 8: return di.data 9: }

10: write(Item item, Data data) { 11: if item not in ws then 12: ws.add(item,item,0,0) 13: ws.set(item,item,0,data) 14: }

15: tryCommit() { 16: decrementReadCounters() // decrement read counters 17: if ws is empty then // read-only transaction 18: WRITE (status, COMMITTED ) 19: return // update transaction 20: acquireLocks() // lock all the data set 21: if ABORTED = READ(status) then return 22: commitTx() // commit update transaction 23: for each item in ws do // commit the changes to the items 24: di ← owner.ws.get(item) 25: WRITE (item.data, di.data) 26: releaseLocks() // release locks on all the data set 27: } 28: acquireLocks() { 29: ds ← ws.add(rs) // items in the data set (read and write sets) 30: for each item in ds by their order do 31: do 32: cur ← getAbsVal(item) // current value 33: if item in rs then // check validity of read set 34: di ← rs.get(item) // value read by the transaction 35: if di.seq != cur.seq then // the data is overwritten 36: abort() 37: return 38: while ! CAS(item.lock, 0,cur.seq, self,cur.seq) 39: if item in ws then 40: di ← ws.get(item) 41: ws.set(item,item,cur.seq+1,di.data) 42: } 43: commitTx() { 44: kCompare[0] ← status // the location to be compared and swaped 45: for i = 1 to k − 1 do // k − 1 locations to be compared 46: kCompare[i] ← ws.get(i).item.rcounter 47: while !k CSS (kCompare, NULL ,0 . . . 0, COMMITTED ) do 48: no-op // until no reading transactions is pending 49: }

89

90

H. Attiya and E. Hillel

Pseudocode 2. Additional methods for PermiSTM 50: incrementReadCounter(Item item) { 51: do m ← READ(item.rcounter) 52: while ! CAS(item.rcounter, m, m + 1) 53: }

59: releaseLocks() { 60: ds ← ws.add(rs) 61: for each item in ds do 62: di ← ds.get(item) 63: WRITE (item.lock, 0,di.seq) 64: }

54: decrementReadCounters() { 55: for each item in rs do 56: do m ← READ(item.rcounter) 57: while ! CAS(item.rcounter, m, m − 1) 58: }

65: DataItem getAbsVal(Item item) { 66: lck ← READ(item.lock) 67: dt ← READ(item.data) 68: di ← item, lck.seq, dt // values from the item 69: if lck.owner != 0 then 70: sts ← READ(lck.owner.status) 71: if sts = COMMITTED then 72: if item in lck.owner.ws then 73: di ← lck.owner.ws.get(item) // values from the write set of the owner 74: return di 75: } 76: abort() { 77: ds ← ws.add(rs) 78: for each item in ds do 79: lck ← READ(item.lock) 80: if lck.owner = self then // the transaction owns the item 81: WRITE (item.lock, 0,lck.seq) // release lock 82: WRITE (status, ABORTED ) 83: }

Properties of PermiSTM. Since PermiSTM is lock-based, it is easier to argue that it preserves opacity than in implementations that do not use locks. Specifically, an update transaction holds locks on all items in its data set before committing, which allows updating transactions to be serialized at their commit point. The algorithm ensures that an update transaction does not commit, leaving the value of the items in its write set unchanged, as long as there is a pending transaction reading one of the items to be written. A read operation reads the current value of the item, after incrementing its read counter. So, if an update transaction commits before the read counter is incremented, but the changes are not yet committed in the items, the reading transaction still maintains a consistent state as it reads the value from the write set of the committed transaction, which is the up-to-date value of the item. Hence, a read-only transaction is serialized after the update transaction that writes to one of the read items and is last to be committed. Since all transactions do not decrement read counters until commit time, and since all read operations return the up-to-date value of the item, all transactions maintain a consistent view. As this holds for committed as well as aborted transactions, PermiSTM is opaque.

Single-Version STMs Can Be Multi-version Permissive

91

Next we discuss the progress properties of the algorithm. After an update transaction acquires locks on all items in its data set it may wait for other transactions reading items in its write set to complete, it may even starve due to continual stream of readers; thus, our STM is blocking. However, the STM guarantees strong progressiveness, as transactions are forcibly aborted only due to another committed transaction with a read-afterwrite conflict; since read-only transactions are never forcibly aborted, PermiSTM is MV-permissive. Furthermore, read-only transactions are obstruction-free [13]. A readonly transaction may delay due to contention with concurrent transactions, updating the same read counters, but once it is running solo it is guaranteed to commit. Write, try-commit and abort operations only access the descriptor of the transaction and the items in the data set of the transaction; this may result in contention only with non-disjoint transactions. A read operation, in addition to accessing the read counter of the item, also reads the descriptor of the owning transaction, which may result in contention only with non-disjoint transactions; thus, PermiSTM is strictly disjoint-access parallel. Note that disjoint transactions may concurrently read the descriptor of a transaction owning items the transactions read, however, this does not violate strict disjointaccess parallelism. Furthermore these disjoint transaction read from the same base object only if they all intersect with the owning transaction; This property is called 2-local contention [1] and it implies (strong) disjoint-access parallelism [2].

4

CAS-Based PermiSTM

The k CSS operation can be implemented in software from CAS [18], without sacrificing the properties of PermiSTM. However, this implementation is intricate and incurs a step complexity that can be avoided in our case. This section outlines the modifications of PermiSTM needed to obtain an STM with similar properties using CAS instead of a k CSS primitive; this results in more costly read operations. We still wish to guarantee that an update transaction commits only in a “quiescent” configuration, in which no other transaction is reading an item in its write set. If the committing update transaction does not use k CSS, then the responsibility of “notifying” the update transaction that it cannot commit is shifted to the read operations, and they pay the extra cost of preventing the update transactions from committing in a nonquiescent configuration. A transaction commits by changing its status from NULL to COMMITTED ; a way to prevent an update transaction from committing is by invalidating its status. For this purpose, we attach a sequence number to the transaction status. Prior to committing, an update transaction reads its status, which now includes the sequence number, and repeats the following for each item in its write set: spin on the item’s read counter until the read counter becomes zero, then annotate the zero with a reference to its descriptor, and the status sequence number. The transaction changes its status to COMMITTED only if the sequence number of its status has not changed since it has read it. Once it completes annotating all zero counters, and unless it is notified by some read operation that one of the counters changed and it is no longer “quiescent”, the update transaction can commit—using only a CAS. A read operation basically increases the read counter, and then reads the current value of the item. The only change is when it encounters a “marked” counter. If the

92

H. Attiya and E. Hillel

update transaction annotating the item already committed, the read operation simply increases the counter. Otherwise, the read operation invalidates the status of the update transaction, by increasing its status sequence number. If more than one transaction is reading an item from the write set of the update transaction, at least one of them prevents the update transaction from committing, by changing its status sequence number. The changes in the data structures used by the algorithm are as follows: The status of a transaction descriptor now includes the state of the transaction (NULL, COMMITTED, or ABORTED ), as well as a sequence number, seq, that is used to invalidate the status; these fields are accessed atomically. The read counter, rcount, of an item is a tuple including a counter of the number of readers, the owner transaction of the item (holding its lock), and seq matching the status sequence number of the owner. We reuse the core implementation of operations from Pseudocodes 1 and 2. The most crucial modification is in the protocol for incrementing the read counter, which invalidates the status of the owner transaction when increasing the item’s read counter. Pseudocode 3 presents the main modifications. In order to commit, an update transaction reads the read counter of every item in its write set (lines 87-88), and when the read counter is 0, the update transactions annotates the 0 with its descriptor and status sequence number, using CAS (line 89). Finally it sets the status to COMMITTED while increasing the status sequence number, using CAS. If the status was invalidated and the last CAS fails, the transaction re-reads the status (line 86) and goes over the procedure again. A successful CAS implies that the transaction committed while no other transaction is reading any item in its write set.

Pseudocode 3. methods for avoiding k CSS 84: commitTx() { 85: do 86: sts ← READ(status) 87: for each item in ws do 88: do rc ← READ(item.rlock) // spin until no readers 89: while ! CAS(item.rcounter, 0,rc.owner,rc.seq, 0,self,sts.seq) // annotated 0 // commit in a “quiescent” configuration 90: while ! CAS(status, NULL ,sts.seq, COMMITTED ,sts.seq+1) 91: } 92: incrementReadCounter(Item item) { 93: do 94: rc ← READ(item.rcounter) 95: if rc.owner != 0 then // the read counter is “marked” 96: CAS(rc.owner.status, NULL ,rc.seq, NULL ,rc.seq+1) // invalidate status 97: while ! CAS(item.rcounter, rc, rc.counter+1,rc.owner,rc.seq) // increase counter 98: } 99: decrementReadCounters() { 100: for each item in rs do 101: do rc ← READ(item.rcounter) 102: while !CAS (item.rcounter, rc, rc.counter−1,0,0) // clean and decrease counter 103: }

Single-Version STMs Can Be Multi-version Permissive

93

To ensure that an update transaction only commits in a “quiescent” configuration, a read operation that finds the read counter of the item “marked” (lines 94-95) continuous as follows: use CAS to invalidate the status of the owner transaction—by increasing its sequence number (line 96), if the status sequence number has changed, either the owner is committed or its status was already invalidated; finally, the reader transaction simply increases the reader counter using CAS (line 97). If increasing the reader counter fails, the reader repeats the procedure. While decreasing the read counters the reader transaction cleans each read counter by setting its owner and seq fields to 0 (line 102). In addition, methods such as tryCommit and abort are adjusted to handle the new structure, for example, accessing the read counter and state indicator through the new read counters and status fields. The resulting algorithm is not strictly disjoint-access parallel. Two transactions, T1 and T2 , reading items a and b, respectively, may access the same base object, when checking and invalidating the status of a third transaction, T3 , updating these items. The algorithm, however, has 2-local contention [1] and is (strongly) disjoint-access parallel, as this memory contention is always due to T3 , which intersects both T1 and T2 .

5 Discussion This paper presents PermiSTM, a single-version STM that is both MV-permissive and strongly progressive; it is also disjoint-access parallel. PermiSTM has simple design, based on read counters and locks, to provide consistency without incremental validation. This also simplifies the correctness argument. The first variant of PermiSTM uses a k-compare-single-swap to commit update transactions. No architecture currently provides k CSS in hardware, but it can be supported by best-effort hardware transactional memory (cf. [4]). In PermiSTM, update transactions are not obstruction free [13], since they may block due to other conflicting transactions. Indeed, a single-version, obstruction-free STM cannot be strictly disjoint-access parallel [10]. Read-only transactions modify the read counters of all items in their read set. This matches the lower bound for read-only transactions that never abort, for (strongly) disjoint-access parallel STMs [2]. Several design principles of PermiSTM are inspired by TLRW [6], which uses readwrite locks. TLRW, however, is not permissive as read-only transactions may abort due to a timeout while attempting to acquire a lock. We avoid this problem by tracking readers through read counters (somewhat similar to SkySTM [17]) instead of read locks. Our algorithm improves on the multi-versioned UP-MV STM [20], which is not weakly disjoint-access parallel (nor strictly disjoint-access parallel), as it uses a global transaction set, holding the descriptors of all completed transactions yet to be collected by the garbage collection mechanism. UP-MV STM requires that operations execute atomically; its progress properties depend on the precise manner this atomicity is guaranteed, which is not detailed. We remark that simply enforcing atomicity with a global lock or a mechanism similar to TL2 locking [5] could make the algorithm blocking. Acknowledgements. We thank the anonymous refereees for helpful comments.

94

H. Attiya and E. Hillel

References 1. Afek, Y., Merritt, M., Taubenfeld, G., Touitou, D.: Disentangling multi-object operations. In: PODC 1997, pp. 111–120 (1997) 2. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009, pp. 69–78 (2009) 3. Aydonat, U., Abdelrahman, T.: Serializability of transactions in software transactional memory. In: TRANSACT 2008 (2008) 4. Dice, D., Lev, Y., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Simplifying concurrent algorithms by exploiting hardware transactional memory. In: SPAA 2010, pp. 325– 334 (2010) 5. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 6. Dice, D., Shavit, N.: TLRW: Return of the read-write lock. In: SPAA 2010, pp. 284–293 (2010) 7. Ennals, R.: Software transactional memory should not be obstruction-free. Technical Report IRC-TR-06-052, Intel Research Cambridge Tech. Report (2006) 8. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 9. Guerraoui, R., Henzinger, T.A., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) 10. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 11. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 12. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 13. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 14. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 1994, pp. 151–160 (1994) 15. Kapalka, M.: Theory of Transactional Memory. PhD thesis, EPFL (2010) 16. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 17. Lev, Y., Luchangco, V., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Anatomy of a scalable software transactional memory. In: TRANSACT 2009 (2009) 18. Luchangco, V., Moir, M., Shavit, N.: Nonblocking k-compare-single-swap. In: SPAA 2003, pp. 314–323 (2003) 19. Napper, J., Alvisi, L.: Lock-free serializable transactions. Technical Report TR-05-04, The University of Texas at Austin (2005) 20. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 21. Perelman, D., Keidar, I.: SMV: Selective Multi-Versioning STM. In: TRANSACT 2010 (2010) 22. Riegel, T., Felber, P., Fetzer, C.: A lazy snapshot algorithm with eager validation. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 284–298. Springer, Heidelberg (2006) 23. Saha, B., Adl-Tabatabai, A.-R., Hudson, R.L., Cao Minh, C., Hertzberg, B.: McRT-STM: a high performance software transactional memory system for a multi-core runtime. In: PPoPP 2006, pp. 187–197 (2006)

Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems Sathya Peri1, and Krishnamurthy Vidyasankar2 1

Indian Institute of Technology Patna, India [email protected] 2 Memorial University, St John’s, Canada [email protected]

Abstract. A generally agreed upon requirement for correctness of concurrent executions in Transactional Memory systems is that all transactions including the aborted ones read consistent values. Opacity is a recently proposed correctness criterion that satisfies the above requirement. Our first contribution in this paper is extending the opacity definition for closed nested transactions. Secondly, we define conflicts appropriate for optimistic executions which are commonly used in Software Transactional Memory systems. Using these conflicts, we define a restricted, conflict-preserving, class of opacity for closed nested transactions the membership of which can be tested in polynomial time. As our third contribution, we propose a correctness criterion that defines a class of schedules where aborted transactions do not affect consistency of the other transactions. We define a conflict-preserving subclass of this class as well. Both the class definitions and the conflict definition are new for nested transactions.

1 Introduction In recent years, Software Transactional Memory (STM) has garnered significant interest as an elegant alternative for developing concurrent code. Importantly, transactions provide a very promising approach for composing software components. Composing simple transactions into a larger transaction is an extremely useful property which forms the basis of modular programming. This is achieved through nesting. A transaction is nested if it is invoked by another transaction. STM systems ensure that transactions are executed atomically. That is, each transaction is either executed to completion in which case it is committed and its effects are visible to other transactions or aborted and the effects of a partial execution, if any, are rolled back. In a closed nested transaction [2], the commit of a sub-transaction is local; its effects are visible only to its parent. When the top-level transaction (of the nestedcomputation) commits, the effects of the sub-transaction are visible to other top-level transactions. The abort of a sub-transaction is also local; the other sub-transactions and the top-level transaction are not affected by its abort.1 To achieve atomicity, a commonly used approach for software transactions is optimistic synchronisation (term used in [6]). In this approach, each transaction has local 1

This work was done when the author was a Post-doctoral Fellow at Memorial University. Apart from Closed nesting, Flat and Open nesting [2] are the other means of nesting in STMs.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 95–106, 2011. c Springer-Verlag Berlin Heidelberg 2011

96

S. Peri and K. Vidyasankar

buffers where it records the values read and written in the course of its execution. When the transaction completes, the contents of its buffers are validated. If the values in the buffers form a consistent view of the memory then the transaction is committed and the values are merged into the memory. If the validation fails, the transaction is aborted and the buffer contents are ignored. The notion of buffers extends naturally to closed nested transactions. When a sub-transaction is invoked, new buffers are created for all the data items it accesses. The contents of the buffers are merged with its parent’s buffers when the sub-transaction commits. A commonly accepted correctness requirement for concurrent executions in STM systems is that all transactions including aborted ones read consistent values. The values resulting from any serial execution of transactions are assumed to be consistent. Then, for each transaction, in a concurrent execution, there should exist a serial execution of some of the transactions giving rise to the values read by that transaction. Guerraoui and Kapalka [5] captured this requirement as opacity. An implementation of opacity has been given in [8]. On the other hand, the recent understanding (Doherty et al [3], Imbs et al [7]) is that opacity is too strong a correctness criterion for STMs. Weaker notions have been proposed: (i) The requirement of a single equivalent serial schedule is replaced by allowing possibly different equivalent serial schedules for committed transactions and for each aborted transaction, and these schedules need not be compatible; and (ii) the effects, namely, the read steps, of aborted transactions should not affect the consistency of the transactions executed subsequently. The first point refines the consistency notion for aborted transactions. (All the proposals insist on a single equivalent serial schedule consisting of all committed transactions.) The second point is a desirable property for transactions in general and a critical point for nested transactions, where the reads of an aborted sub-transaction may prohibit committing the entire top-level transaction. The above proposals in the literature have been made for non-nested transactions. In this paper, we define two notions of correctness and corresponding classes of schedules: Closed Nested Opacity (CNO) and Abort-Shielded Consistency (ASC). In the first notion, read steps of aborted (sub-)transactions are included in the serialization as in opacity [5, 8]. In the second, they are discarded. These definitions turn out to be non-trivial due to the fact that an aborted sub-transaction may have some (locally) committed descendents and similarly some committed ancestors. Checking opacity, like general serializability (for instance, view-serializability), cannot be done efficiently. Very much like restricted classes of serializability allowing polynomial membership test, and facilitating online scheduling, restricted classes of opacity can also be defined. We define such classes along the lines of conflict-serializability for database transactions: Conflict-Preserving Closed Nested Opacity (CP-CNO) and Conflict-Preserving Abort-Shielded Consistency (CP-ASC). Our conflict notion is tailored for optimistic execution of the sub-transactions and not just between any two conflicting operations. We give an algorithm for checking the membership in CP-CNO which can be easily modified for CP-ASC as well. The algorithm uses serialization graphs similar to those in [12]. Using this algorithm an online scheduler implementing these classes can be designed.

Correctness of Concurrent Executions of Closed Nested Transactions

97

We note that all online schedulers (implementing 2PL, timestamp, optimistic approaches, etc.) for database transactions allow only subclasses of conflict-serializable schedules. We believe similarly that all STM schedulers can only allow subclasses of conflict-preserving schedules satisfying opacity or any of its variants. Such schedulers are likely to use mechanisms simpler than serialization graphs as in the database area. An example is the scheduler described by Imbs and Raynal [8]. There have been many implementations of nested transactions in the past few years [2, 10, 1, 9]. However, none of them provide precise correctness criteria for closed nested transactions that can be efficiently verified. In [2], the authors provide correctness criteria for open nested transactions which can be extended to closed nested transactions as well. Their correctness criteria also look for a single equivalent serial schedule of both (read-only) aborted transactions and committed transactions. Roadmap: In Section 2, we describe our model and background. In Section 3, we define CNO, CP-CNO In Section 4, we present ASC and CP-ASC Section 5 concludes this paper.

2 Background and System Model A transaction is a piece of code in execution. In the course of its execution, a nested transaction performs read and write operations on memory and invokes other transactions (also referred to as sub-transactions). A compuation of nested transactions constitutes a computation tree. The operations of the computation are classified as: simplememory operations and transactions. Simple-memory operations are read or write on memory. In this document when we refer to a transaction in general, it could be a toplevel transaction or a sub-transaction. Collectively, we refer to transactions and simplememory operation as nodes (of the computation tree) and denote them as nid . If a transaction tX executes successfully to completion, it terminates with a commit operation denoted as cX . Otherwise it aborts, aX . Abort and commit operations are called terminal operations2. By default, all the simple-memory operations are always considered to be (locally) committed. In our model, transactions can interleave at any level. Hence the child sub-transactions of any transaction can execute in interleaved manner. To perform a write operation on data item x, a closed-nested transaction tP creates a x-buffer (if it is not already present) and writes to the buffer. A buffer is created for every data item tP accesses. When tP commits, it merges the contents of its local buffers with the buffers of its parent. Any peer (sibling) transaction of tP can read the values written by tP only after tP commits. We assume that there exists a hypothetical root transaction of the computation tree, denoted as tR , which invokes all the top-level transactions. On system initialization we assume that there exists a child transaction tinit of tR , which creates and initializes all the buffers of tR that are written or read by any descendant of tR . Similarly, we also assume that there exists a child transaction tf in of tR , which reads the contents of tR ’s buffers when the computation terminates. 2

A transaction starts with a begin operation. In our model we assume that the begin operation is superimposed with the first event of the transaction. Hence, we do not explicitly represent it in our schedules.

98

S. Peri and K. Vidyasankar

Coming to reads, a transaction maintains a read set consisting of all its read operations. We assume that for a transaction to read a data-item, say x, (unlike write) it has access to the buffers of all its ancestors apart from its own. To read x, a nested sub-transaction tN starts with its local buffers. If it does not contain a x-buffer, tN continues to read the buffers of its ancestors starting from its parent until it encounters a transaction that contains a x-buffer. Since tR ’s buffers have been initialized by tinit , tN will eventually read a value for x. When the transaction commits, its read set is merged with its parent’s read set. We will revisit read operations a few subsections later. 2.1 Schedules A schedule is a totally ordered sequence (in real time order) of simple-memory operations and terminal operations of transactions in a computation. These operations are referred to as events of the schedule. A schedule is represented by the tuple evts, nodes, ord, where evts is the set of all events in the schedule, nodes is the set of all the nodes (transactions and simple-memory operations) present in the computation and ord is an unary function that totally orders all the events in the order of execution. Example 1 shows a schedule, S1. In this schedule, the memory operations r2211 (x) and w2212 (y) belong to the transaction t221 . Transactions t22 and t31 are aborted. All the other transactions are committed. It must be noted that the t221 and t222 of t22 are committed sub-transactions of the aborted transaction t22 . Example 1 S1 : r111 (z)w112 (y)w12 (z)c11 r211 (b)r2211 (x)w2212 (y)c221 w212 (y)c21 w13 (y)c1 r2221 (y)w2222 (z)c222 a22 w23 (z)r311 (y)c2 w312 (y)a31 r321 (z)w322 (z)c32 c3 The events of the schedule are the real time representation of the leaves of the computation tree. The computation tree for a schedule S1 is shown in Figure 1. The order of execution of memory operations is from left to right as shown in the tree. The dotted edges represent terminal operations. The terminal operations are not part of the computation tree but are represented here for clarity. tR tinit

t1 t11

c1

t21 c w12 (z)11

r111(z) w112(y)

tf in

t2

t221 r211(b)

w13 (y) c21 w212 (y) c221

r2211(x)w

2212 (y)

t3 c2

t22

t31

t32

w23 (z) t222 a22

c222

r311 (y)

a31

w312(y)

r2221(y)w2222(z)

Fig. 1. Computation tree for Example 1

c3 c32

w (z) r321(z)322

Correctness of Concurrent Executions of Closed Nested Transactions

99

For a closed nested transaction, all its write operations are visible to other transactions only after it commits. In S1, w212 (y) occurs before w13 (y). When t1 commits, it writes w13 (y) onto tR ’s y-buffer. But t2 commits after t1 commits. When t2 commits it overwrites tR ’s y buffer with w212 (y). Thus when transaction t31 performs the read operation r311 (y), it reads the value written by w212 (y) and not w13 (y) even though w13 (y) occurs after w212 (y). To model the effects of commits clearly, we augment a schedule with extra write operations. For each transaction that commits, we introduce a commit-write operation for each data item x the transaction writes to or one of its children commit-writes. This writes the latest value in the transaction’s x-buffer. The commit-writes are added just before the commit operation and represent the merging of the local buffers with its parent’s buffers. Using this representation, the schedule for Figure 1 is: Example 2 112 2212 S2 : r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)r2211 (x)w2212 (y)w221 (y)c221 w212 (y) 212 12 13 2222 w21 (y)c21 w13 (y)w1 (z)w1 (y)c1 r2221 (y)w2222 (z)w222 (z)c222 a22 w23 (z)r311 (y) 322 w221 (y)w223 (z)c2 w312 (y)a31 r321 (z)w322 (z)w32 (z)c32 w332 (z)c3 112 212 Some examples of commit-write in S2 are w11 (y), w21 (y), w223 (z) etc. The com112 mitwrite w11 (y) represents t11 ’s write onto t1 ’s y buffer with the value written by w112 . There are no commit-writes for aborted transactions. Hence the writes of aborted transactions are not visible to its peers. Originally in the computation tree only the leaf nodes could write. With this augmentation of transactions, even non-leaf nodes (i.e. committed transactions) write with commit-write operations. For sake of brevity, we do not represent commit-writes in the computation tree. In the rest of this document, we assume that all the schedules we deal with are augmented with commit-writes. Generalizing the notion of commit-writes to any node of the tree, the commit-write of a simple-memory write is itself. It is nil for a read operation and aborted transactions. Collectively we refer to simple-memory operations along with commit-write operations as memory operations. With commit-write operations, we extend the definition of an operation, denoted as oX , to represent a transaction, a commit-write operation or a simple-memory operation. It can be seen that a schedule partially orders all the transactions and simple-memory operations in the computation. The partial order is called schedule-partial-order and is denoted <S . For a transaction tX in S, we define S.tX .f irst, S.tX .last as the first and last operations of tX . Thus, S.tX .last denotes the terminal operation of the tX . For a simple-memory operation mX , S.mX .f irst = S.mX .last. For two nodes nX , nY , in S: (nX <S nY ) ≡ (S.ord(S.nX .last) < S.ord(S.nY .f irst)).

2.2 Function Definitions For a commit-write operation wX we define its holder, S.holder(wX ) as the transaction tX to which it belongs to. Extending this function to a node (a transaction or simple-memory operation), the holder of a node is itself. For any operation oX , we define S.level(oX ) as the distance of S.holder(oX ) in the tree from the root. From this definition tR is at level 0. The level of a transaction and all its commit-write operations 212 are the same. For instance in Example 2, S2.level(w21 ) = S2.level(t21) = 2.

100

S. Peri and K. Vidyasankar

The functions on a tree, namely parent, children, ancestor, descendant, peer (siblings) can be extended to commit-write operations by defining them for S.holder(oX ) over the tree. For instance in S2 of Example 2, S2.parent(w221 ) = tR and S2.children( w221 = {t21 , t22 , w23 (z)}. Thus transactions and simple-memory operations are children of a transaction. Two commit-writes of the same node are not peers of each other since they have the same holder. For a transaction, tX in a computation, we define its dSet, denoted as S.dSet(tX ) as the set consisting of tX , tX ’s commit-writes, tX ’s begin and terminal operations and dSets of tX ’s descendants (including its children). This set comprises of all the operations in the sub-tree of tX . A simple-memory operation’s dSet is itself. A commitwrite’s dSet is its holder transactions’s dSet. In Example 2, S2.dSet(t2 ) = S2.dSet(w223 212 (z)) = {t2 , t21 , t22 , w23 (z), r211 (b), w212 (y), w21 (y), c21 , t221 , r2211 (x), w2212 (y), 2212 2222 w221 (y), c221 , t222 , r2221 (y), w2222 (z), w222 (z), c222 , a22 , w221 (y), w223 (z), c2 } Next we define a boolean function optVis on two operations oX , oY in a schedule S, denoted as S.optV is(oY , oX ). It is true if oY is a peer of oX or peer of an ancestor of oX , i.e., oY ∈ (S.peers(oX ) ∪ S.peers(S.ansc(oX ))). Otherwise it is false. This definition implies that if (oX ∈ S.dSet(oY )) then S.optV is(oY , oX ) is false. As a result for any commit-write of oY , say wY , S.optV is(wY , oX ) is false as well. One can see that optVis function is not symmetric (but not asymmetric). Hence S.optV is(oY , oX ) does not imply S.optV is(oX , oY ). In S2, S2.optV is(w112 (z), r211 (b)) is true as w112 (z) is a peer of t2 which is an ancestor of r211 (b). Similarly S2.optV is(t3 , r2221 (y)) is true because t3 is a peer of t2 which is an ancestor of r2221 (y). But S2.optV is(r2221 (y), t3 ) is false. We denote S.schOps(tX ) as the set of operations in S.dSet(tX ) which are also present in S.evts. Formally, S.schOps(tX ) = (S.dSet(tX ) ∩ S.evts). We define a few notations based on aborted transactions in a schedule S. For an aborted transaction tX , we denote S.abort(tX ) as the set of all aborted transactions in tX ’s dSet. It includes tX as well, if it is aborted. We define S.prune(tX ) as all the events in the schOps of tX after removing the events of all aborted transactions in tX . Formally, S.prune(tX ) = {S.schOps(tX ) − ( S.schOps(tA ))}. If tX has tA ∈S.abort(tX )

no aborted transaction in its dSet then S.prune(tX ) is same as S.schOps(tX ). If tX itself is an aborted transaction then its pruned set is nil. 2.3 Writes for Read Operations and Well-Formedness For a read operation rX (z) belonging to a transaction tP in S, we associate a write wY (z) as its lastWrite3 or S.lastW rite(rX (z)). The read operation will retrieve the value written by the lastWrite. We want the lastWrite wY (z) to satisfy the properties: (1) wY occurs before rX in the schedule; (2) wY is optVis to rX ; (3) The value written by wY is in the z-buffer of an ancestor (starting from its parent tP ) closest to rX in terms of level and (4) If there are multiple writes satisfying the above conditions then, the wY is closest to rX in the schedule S. The lastWrite definition ensures that all transactions read values only from committed nodes i.e. a committed transaction or a simple-write operation. Having lastWrite be 3

This term is inspired from [2]

Correctness of Concurrent Executions of Closed Nested Transactions

101

optVis to the read operation ensures that the buffer in which the lastWrite writes to is accessible by the read operation. In S2, the lastWrites are: (r111 (z) : winit (z)), (r211 (b) : 2212 winit (b)), (r2211 (x) : winit (x)), (r2221 (y) : w221 (y)), (r311 (y) : w113 (y)), (r321 (z) : 23 2212 w2 (z)). Note that the read r2221 (y) reads from w221 (y) even though w113 (y) is closer 2212 to r2221 (y) in the schedule. This is because w221 (y) is closer to it in terms of level. For a node nP with a read operation rX in its dSet, the read is said to be an externalread if its lastWrite is not in nP ’s dSet. Thus a read operation rX is an external-read of itself. It can be seen that a nested transaction interacts with its peers through externalreads and commit-writes. Thus, a nested transaction can be treated as a non-nested transaction consisting only of its external-reads and commit-writes. The external-reads and commit-writes of a transaction constitute its extOpsSet. A schedule is called well-formed if it satisfies: (1) Validity of Transaction limits: After a transaction executes a terminal operation no operation (memory or terminal) belonging to it can execute; and (2) Validity of Read Operations: Every read operation reads the value written by its lastWrite operation. We assume that all the schedules we deal with are well-formed. 2.4 Serial Schedules For the case of non-nested transactions a serial schedule is a schedule in which all the transactions execute serially (as the name suggests) without any interleaving. For a nested transaction we define a serial schedule SS as: for every transaction tX in SS, its children (both transactions and simple-memory operations) are totally ordered. Formally, ∀tX ∈ SS.trans : {nY , nZ } ⊆ S.children(tX ) : (SS.ord(nY .last) < SS.ord(nZ .f irst)) ∨ (SS.ord(nZ .last) < SS.ord(nY .f irst)). Thus in a serial schedule, all the events in the dSet of a transaction appear contiguously.

3 Conflict Preserving Closed Nested Opacity 3.1 Closed Nested Opacity Guerraoui and Kapalka [5] proposed the notion of opacity as a correctness criterion for software transactions. A schedule, consisting of an execution of transactions, is said to be opaque if there is an equivalent serial schedule such that it respects the original schedule’s real time ordering of the nodes and the lastWrites for every read operation, including the reads of aborted transactions, in the serial schedule is same as in the original schedule. Opacity ensures that all the reads are consistent. An implementation of opacity for non-nested transactions is given in [8] in which aborted transactions are treated as read-only (with read steps executed before the abort) when looking for an equivalent serial schedule consisting of all the transactions. In the context of nested transactions, an aborted transaction can have a committed sub-transaction whose values are read by other sub-transactions. For instance in S2, aborted transaction t22 ’s sub-transactions t221 and t222 are committed. The read operation r2221 (y) of t222 reads from t221 . This shows that some writes of aborted (sub) transactions should also be considered for correctness of other sub transactions. On

102

S. Peri and K. Vidyasankar

the other hand, a committed transaction can have aborted sub-transactions whose write values should be omitted. In our characterization of schedules, aborted transactions do not have commit-writes. Thus an aborted transaction’s writes do not affect any of its peers or ancestors. But committed sub-transactions of an aborted transactions can have commit-writes and other sub-transactions can read from it. Thus using our representation, opacity can be extended to closed nested transactions. Formally, we define a class of schedules called as Closed Nested Opacity or CNO as follows: A schedule S belongs to CNO if there exists a serial schedule SS such that: (1) Event Equivalence: The operations of S and SS are the same. (2) schedule-partial-order Equivalence: For any two nodes nY , nZ that are peers in the computation tree represented by S, if nY occurs before nZ in S then nY occurs before nZ in SS as well. (3) lastWrite Equivalence: The lastWrites of all read operations in S and SS are the same. Even though the definition of CNO is similar to opacity, the condition lastWrite equivalence captures the intricacies of nested transactions. This class ensures that the reads of all the transactions including all the sub-transactions of aborted transactions read consistent values. 3.2 Conflict Notion: optConf Checking opacity, like general serializability (for instance, view-serializability) cannot be done efficiently. Restricted classes of serializability (like conflict-serializability) have been defined based on conflicts which allow polynomial time membership test, and facilitate online scheduling. Along the same lines, we define a subclass of CNO, CPCNO. This subclass is defined based on a new conflict notion optConf for closed nested transactions. It is tailored for optimistic execution of sub-transactions. This notion is similar to the idea of conflicts presented in [4] for non-nested transactions. The conflict notion optConf is defined only between memory operations in extOpsSets (defined in SubSection2.3) of two peer nodes. As explained earlier, a node (or transaction) interacts with its peer nodes through its extOpsSet. Consider two peer nodes nA , nB . For two memory operations mX , mY on the same data buffer in the extOpsSets of nA , nB , S.isOptConf (mX , mY ) is true if mX occurs before mY in S and one of the following conditions hold: (1) r-w conflict: mX is an external-read rX of nA , mY is a commit-write wY of nB or (2) w-r conflict: mX is a commit-write wX of nA and mY is an external-read rY of nB or (3) w-w conflict: mX is a commit-write wX of nA and mY is a commit-write wY of nB . Consider a read rX that is in optConf with a write wY and let rX ’s lastWrite be wL . By defining the conflicts in this manner we ensure that wL is in w-r conflict with rX and if wY is also in w-r conflict with rX , then w-w conflict between wL and wY ensures that wY does not become rX ’s lastWrite in any optConf equivalent serial schedule. Similarly if wY is in r-w conflict with rX then it cannot become rX ’s lastWrite in the equivalent serial schedule. For S2 in Example 2, we get the set of conflicts as: (r111 (z), w12 (z)), (r111 (z), 112 2212 w223 (z)), (w11 (y), w13 (y)), (w112 (z), w223 (z)), (w113 (y), w221 (y)), (w221 (y), r2221 (y)), 13 21 12 (w1 (y), r311 (y)), (r311 (y), w2 (y)), (r311 (y), w312 (y)), (w1 (z), r321 (z)), (w223 (z), r321 (z)), (r321 (z), w322 (z))). It must be noted that there is no optConf between w113 (y)

Correctness of Concurrent Executions of Closed Nested Transactions

103

212 212 and r2221 (y) or between w21 (y) and r2221 (y) even though w113 (y) and w21 (y) are 2212 optVis to r2221 (y). This is because the w221 (y)’s level (which is the lastWrite of 212 r2221 (y)) is greater than w113 (y) and w21 (y). Hence r2221 (y) is not an external-read of 13 212 any peer of w1 (y) or w21 (y) Using optConf, we define a class of schedules called as Conflict-Preserving Closed Nested Opacity or CP-CNO. It differs from CNO in condition (3) in SubSection3.1. The lastWrite equivalence is replaced by optConf Implication: if two memory operations in S are in optConf then they are also in optConf in SS. Since optConf implication subsumes lastWrite equivalence, we have:

Theorem 1. If a schedule S is in the class CP-CNO then it is also in CNO. Benefits of optConf: Traditionally, two memory operations are said to be in conflict if one of them is a (simple) write operation. In STM systems that employ optimistic synchronization, a write of a transaction becomes visible only after it has committed. In this case for conflicts to be meaningful, two memory operations are said to be in conflict if one of them is a commit-write operation (and not a simple-write). Refining the conflict notion further, we define optConf only between an external-read and a commit-write operation (as well as between two commit-write operations). By defining optConf this way, the class CP-CNO is as non-restrictive as possible and yet does not compromise on any desired property. 3.3 Membership Verification Algorithm Now, we describe the algorithm for testing the membership in the class CP-CNO in polynomial time. Our algorithm is based on the graph construction algorithm by Resende and Abbadi [12] but adapted to optConf. For a schedule S, the algorithm constructs a conflict graph based on optConfs, denoted as S.optGraph, and checks for the acyclicity of that graph. We call this optGraphCons algorithm. The graph S.optGraph is constructed as follows: (1) Vertices: It comprises of all the nodes in the computation tree. The vertex for a node nX is denoted as vX . (2) Edges: Consider each transaction tX starting from tR . For each pair of children nP , nQ , (other than tinit and tf in ) in S.children(tX ) we add an edge from vP to vQ as follows: (2.1) Completion edges: If nP <S nQ . (2.2) Conflict edges: For any two memory operations, mY , mZ such that mY is in nP ’s dSet and mZ is in nQ ’s dSet, if S.isOptConf (mY , mZ ) is true. Since the position of the transactions tinit and tf in are fixed in the tree and in any schedule, we do not consider them in our graph construction algorithm. Now, we get the theorem, Theorem 2. For a schedule S, the graph S.optGraph is acyclic if and only if S is in CP-CNO. It must be noted that in our construction all the edges are between vertices corresponding to peer nodes. There are no edges between vertices that correspond to nodes of different levels. Thus the graph constructed consists of disjoint subgraphs. If the graph is acyclic, then an equivalent serial schedule can be constructed by executing topological sort on all the subgraphs [11]. Using this algorithm, it can be verified that S2 is not in CP-CNO. Further, S2 is also not in CNO.

104

S. Peri and K. Vidyasankar

4 Abort-Shielded Consistency Shortcoming of CNO: A single serial schedule involving all transactions, as required in CNO (and opacity) allows for the reads of an aborted transaction to affect the transactions that follow it. This effect is more pronounced in nested transactions. For instance in S2, transactions t1 and t2 write to the variables y and z. The aborted sub-transaction t31 reads y from t1 . The sub-transaction t32 reads z from t2 . As a result there is no equivalent serial schedule having the same lastWrites as in S2 and hence it is not in CNO. For that matter, any sub-transaction of t3 invoked after t31 ’s invocation (like t33 , t34 etc) that reads any variable written by t2 that has also been written by t1 will cause this schedule to be not opaque. In the worst case, all the sub-transactions of t3 invoked after t31 may satisfy this property and a scheduler (implementing CNO) will abort all of them. This effectively aborts t3 . This shows that with CNO, an aborted sub-transaction can cause its top-level transaction to abort. This can be avoided if the read operations of the aborted transactions are ignored as described below. 4.1 ASC Class Definition Let tA be an aborted transaction in a schedule S. If tA should not affect the transactions following it, then tA should be dropped while considering the correctness of the remaining transactions. Generalizing this idea to all aborted transactions, we construct a sub-schedule consisting of events only from committed transactions (and committed sub-transactions whose ancestors have not been aborted). Thus, the sub-schedule consists of all the events from S.prune(tR ) (prune is defined in SubSection2.2) and is denoted as commitSubSchR . The ordering of the events is same as in the original schedule. We check for the correctness of commitSubSchR. The sub-schedule 112 212 commitSubSchR for S2 is: r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)w212 (y)w21 (y) 12 13 21 23 322 32 c21 w13 (y)w1 (z)w1 (y)c1 w23 (z)w2 (y)w2 (z)c2 r321 (z)w322 (z)w32 (z)c32 w3 (z)c3 . As explained in [5], it is necessary that the aborted transaction tA also reads consistent values. To ensure this, we construct another sub-schedule of S denoted as ppref SubSchA (pruned prefix sub-schedule) for tA . We consider the prefix of all the events until tA ’s abort operation. From this prefix we construct the sub-schedule by removing (1) events from transactions that aborted earlier and (2) events from any aborted sub-transaction of tA . Thus, the sub-schedule consists of events from transactions that committed before tA , events from pruned sub-transactions of tA and events from live transactions (i.e., transactions that have not yet terminated) that executed until abort of tA . The ordering among the events is same as in the original schedule S. Finally, for each live transaction we add a commit operation after tA ’s abort operation to the sub-schedule. But we do not add the commit-writes for these transactions. Then we look for the correctness of this sub-schedule. In S2, for the aborted transaction 112 212 (y)c11 r211 (b)w212 (y)w21 (y)c21 t31 , ppref SubSch31 is: r111 (z)w112 (y)w12 (z)w11 12 13 21 23 w13 (y)w1 (z)w1 (y)c1 w23 (z)r311 (y)w2 (y)w2 (z)c2 w312 (y)a31 c3 . Similarly the sub-schedules for every aborted transaction can be constructed. Here all the sub-schedules have events from at most one aborted transaction. One can see that the sub-schedules commitSubSchR and ppref SubSchA for every aborted transaction tA have the property that if any event is in the sub-schedule, then any other

Correctness of Concurrent Executions of Closed Nested Transactions

105

event that is relevant to it is also in the sub-schedule. We call this property as causality completeness. Hence the lastWrite for any read operation in a sub-schedule will be same as the lastWrite as in the original schedule S. It can also be seen that the events of these sub-schedules form a valid sub-tree of the original computation tree represented by S. We verify the correctness of each of these sub-schedules by looking for an equivalent serial sub-schedule which has the same lastWrite for every read operation. Based on these sub-schedules, Abort-Shielded Consistency or ASC is defined. A schedule S belongs to class ASC if there exists a set of sub-schedules of S, denoted as subSchSet, such that the sub-schedules, commitSubSchR and pref SubSchA , for every aborted transaction tA in S, are in subSchSet and for every sub-schedule subS in subSchSet there exists a serial sub-schedule ssubS such that: (1) Sub-Schedule Event Equivalence: The operations of subS and ssubS are the same (2) schedule-partial-order Equivalence: For any two peer nodes nY , nZ in the computation tree represented by subS, if nY occurs before nZ in subS then nY occurs before nZ in ssubS as well. (3) lastWrite Equivalence: For all the read operations in ssubS, the lastWrites are the same as in subS. From this definition we get that, CNO is a subset of ASC. The schedule S2 is in ASC. Using optConfs with pprefSubSch we define a class of schedules, Conflict-Preserving Abort-Shielded Consistency or CP-ASC. It differs from the definition of the class ASC only in the condition (3), which is optConf Implication: If two memory operations in subS are in optConf then they are also in optConf in ssubS. Using the optGraphCons algorithm we can verify if there exists an equivalent serial sub-schedule for each subschedule in subSchSet. Thus checking whether a schedule is in CP-ASC or not, can be done in polynomial time [11]. Further, it can as well be proved that the class CP-CNO is a subset of CP-ASC. The schedule S2 is in CP-ASC. Using the optGraphCons algorithm an elegant online scheduler implementing CPASC can be designed [11]. The scheduler can be implemented in a completely distributed manner. The serialization graph has separate components for each (parent) sub-transaction. Each component can be maintained at a different site (process executing the sub-transaction) autonomously and the checking can be done in a distributed manner.

5 Conclusion Concurrent executions of transactions in Transactional Memory are expected to ensure that aborted transactions, as the committed ones, read consistent values. In addition, it is desirable that the aborted transactions do not affect the consistency for the other transactions. Incorporating these simple-sounding criteria has been non-trivial even for non-nested transactions as can be seen in recent publications [5, 8, 3]. In this paper, we have considered these requirements for closed nested transactions. We have also defined new conflict-preserving classes that allow polynomial time membership test, by means of constructing conflict-graphs and checking acyclicity. Further, a completely distributed STM scheduler can be designed using these conflict preserving classes. Our future work includes the study of how the above two properties manifest in executions with open nested transactions and with non-transactional steps.

106

S. Peri and K. Vidyasankar

References [1] Agrawal, K., Fineman, J.T., Sukha, J.: Nested parallelism in transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 163–174. ACM, New York (2008) [2] Agrawal, K., Leiserson, C.E., Sukha, J.: Memory models for open-nested transactions. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 70–81. ACM, New York (2006) [3] Doherty, S., Groves, L., Luchangco, V., Moir, M.: Towards formally specifying and verifying transactional memory. In: REFINE (2009) [4] Guerraoui, R., Henzinger, T., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) [5] Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 175–184. ACM, New York (2008) [6] Harris, T., Marlow, S., Peyton-Jones, S., Herlihy, M.: Composable memory transactions. In: PPoPP 2005: Proceedings of the tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 48–60. ACM, New York (2005) [7] Imbs, D., de Mendivil, J.R., Raynal, M.: Brief announcement: virtual world consistency: a new condition for stm systems. In: PODC 2009: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, pp. 280–281. ACM, New York (2009) [8] Imbs, D., Raynal, M.: A lock-based stm protocol that satisfies opacity and progressiveness. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 226–245. Springer, Heidelberg (2008) [9] Moss, J.E.B.: Open Nested Transactions: Semantics and Support. In: Workshop on Memory Performance Issues (2006) [10] Ni, Y., Menon, V.S., Adl-Tabatabai, A.-R., Hosking, A.L., Hudson, R.L., Moss, J.E.B., Saha, B., Shpeisman, T.: Open nesting in software transactional memory. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 68–78. ACM, New York (2007) [11] Peri, S., Vidyasankar, K.: Correctness criteria for closed nested transactions (in preperation). Technical report, Memorial University of Newfoundland (2010) [12] Resende, R.F., El Abbadi, A.: On the serializability theorem for nested transactions. Inf. Process. Lett. 50(4), 177–183 (1994)

Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky and Erez Petrank Dept. of Computer Science, Technion - Israel Institute of Technology {anastas,erez}@cs.technion.ac.il

Abstract. We extend state-of-the-art lock-free linked lists by building linked lists with special care for locality of traversals. These linked lists are built of sequences of entries that reside on consecutive chunks of memory. When traversing such lists, subsequent entries typically reside on the same chunk and are thus close to each other, e.g., in same cache line or on the same virtual memory page. Such cache-conscious implementations of linked lists are frequently used in practice, but making them lock-free requires care. The basic component of this construction is a chunk of entries in the list that maintains a minimum and a maximum number of entries. This basic chunk component is an interesting tool on its own and may be used to build other lock-free data structures as well.

1

Introduction

Lock-free (a.k.a. non-blocking) data structures provide a progress guarantee. If several threads attempt to concurrently apply an operation on the structure, it is guaranteed that one of the threads will make progress in ﬁnite time [7]. Many lock-free data structures have been developed since the original notion was presented [11]. Lock-free algorithms are error-prone and modifying existing algorithms requires care. In this paper we study lock-free linked lists and propose a design for a cache-conscious linked list. The ﬁrst design of lock-free linked lists was presented by Valois [12]. He maintained auxiliary nodes in between the list’s normal nodes, in order to resolve concurrent operations’ interference problems. A diﬀerent lock-free implementation of linked lists was given by Harris [6]. His main idea was to mark a node before deleting it in order to prevent concurrent operations from changing its next-entry pointer. Harris’ algorithm is simpler than Valois’s algorithm and his experimental results generally also perform better. Michael [8,10] proposed an extension to Harris’ algorithm that did not assume a garbage collection but reclaimed entries of the list explicitly. To this end, he developed an underlying mechanism of hazard pointers that was later used for explicit reclamation in other data structures as well. An improvement in complexity was achieved by Fomitchev and Rupert [3]. They use a smart retreat upon CAS failure, rather than the standard restart from scratch. In this paper we further extend Michael’s design to allow cache-conscious linked lists. Our implementation partitions the linked list into sub-lists that

Supported by THE ISRAEL SCIENCE FOUNDATION (grant No. 845/06).

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 107–118, 2011. c Springer-Verlag Berlin Heidelberg 2011

108

A. Braginsky and E. Petrank

reside on consecutive areas in the memory, denoted chunks. Each chunk contains several consecutive list entries. For example, setting each chunk to be one virtual page, causes list traversals to form a page-oriented memory access pattern. This partition of the list into sub-lists, each residing on a small chunk of memory is often used in practice (e.g., [1,5]), but there is no lock-free implementation for such a list. Breaking the list into chunks can be trivial if there is no restriction on the chunk size. In particular, if the size of each chunk can decrease to a single element, then clearly, each chunk can trivially reside in a single memory block, Michael’s implementation will do, but no locality improvement will be obtained for list traversals. The sub-list’s chunk that our design provides maintains upper and lower bounds on the number of elements it has. The upper bound simply follows from the size of the memory block on which the chunk is located, and a lower bound is provided by the user. If a chunk grows too much and cannot be held in a memory block, then it is split (in a lock-free manner) creating two chunks, each residing at a separate location. Conversely, if a chunk shrinks below the lower bound, then it is merged (in a lock-free manner) with the previous chunk in the list. In order for the split to create acceptable chunks, it is required that the lower bound (on the number of objects in a chunk) does not exceed half of the maximum number of entries in the chunk. Otherwise, a split would create two chunks that violate the lower bound. A natural optimization of searching for such a list is to quickly jump to the next chunk (without traversing all its entries), if the desired key is not within the key-range of this chunk. This gives us an additional performance improvement since the search progress is done in skips, where the size of each skip is at least the chunk’s minimal boundary. Furthermore the retreat upon CAS failure, in the majority of the cases is done by returning to beginning of the chunk, rather than the standard restart from the beginning of the list. To summarize, the contribution of this paper is the presentation of a lock-free linked list, based on a single word CAS commands, were the keys are unique and ordered. The algorithm assumes no lock-free garbage collector. The list design is locality conscious. The design poses a restriction on the keys and data length. For 64-bit architecture the key is limited to 31 bit, and the data is limited to 32 bit. Organization. In Section 2 we specify the underlying structure we use to implement the chunked linked list. In Section 3 we introduce the freeze mechanism that will serve the split and join operations. In Section 4 we provide the implementation of the linked list functions. A closer look at the freezing mechanism details appear in Section 5 and we conclude in Section 6. More detailed explanations and pseudo-code can be found in full version of this article [2].

2

Preliminaries and Data Structure

A linked list is a data structure that consists of a sequence of data records. Each data record contains a key by which the linked list is ordered. We denote each data record an entry. We think of the linked list as representing a set of keys,

Locality-Conscious Lock-Free Linked Lists

109

Fig. 1. The entry structure

each associated with a data part. Following previous work [4,6], a key cannot appear twice in the list. Thus, an attempt to insert a key that exists in the list fails. Each entry holds the key and data associated with it. Generally, this data is a pointer, or a mapping from the key to a larger piece of data associated with it. Next, we present the underlying data structure employed in the construction. We assume a 64-bit platform in this description. A 32-bit implementation can be derived, by cutting each ﬁeld in half, or by keeping the same structure, but using a wide compare-and-swap, which writes atomically to two consecutive words. The structure of an entry. A list entry consists of a key and a data ﬁelds, and the next pointer (pointing to next entry). These ﬁelds are arranged in two words, where the key and data reside in the ﬁrst word and the next pointer in the second. Three more bits are embedded in these two words. First, we embed the delete bit in the least bit of the next pointer, following Harris [6]. The delete bit is set to mark the logical deletion of the entry. The freeze bits are new in this design. They take a bit from each of the entry’s words and their purpose is to indicate that the entire chunk holding the entry is about to be retired. These three ﬂags consume one bit of the key and two bits from the next pointer. Notice that the three LSBs of a pointer do not really hold information on a 64-bit architecture. The entry structure is depicted in Figure 1. In what follows, we refer to the ﬁrst word as the keyData word, and the second word as the nextEntry word. We further reserve one key value, denoted by ⊥ to signify that the entry is currently not allocated. This value is not allowed as a key in the data structure. As will be discussed in Section 4, an entry is available for allocation if its key is ⊥ and its other ﬁelds are zeroed. The structure of a chunk. The main support for locality stems from the fact that consecutive entries are kept on a chunk, so that traversals of the list demonstrate better locality. In order to keep a substantial number of entries on each chunk, the linked list makes sure that the number of entries in a chunk is always between the parameters min and max. The main part of a chunk is an array that holds the entries in a chunk and may hold up to max entries of the linked list. In addition, the chunk holds some ﬁelds that help manage the chunk. First, we keep one special entry that serves as a dummy header entry, whose next pointer points to the ﬁrst entry in this chunk. The dummy header is not a must, but it simpliﬁes the algorithm’s code. To identify chunks that are too sparse, each chunk has a counter of the number of entries currently allocated in it. In the presence of concurrent mutations, this counter will not always be accurate, but it will always hold a lower bound on the number of allocated

110

A. Braginsky and E. Petrank

counter

entriesArray[MAX]

new

64 bit (word)

…

64 bit (word)

head

key: 9 del. bit: 1

key: 5 del. bit: 0

key: 8 del. bit: 0

mergeBuddy

key: ┴ del. bit: 0

64 bit (word)

...

3 LSBs freezeState

nextChunk 64 bit (word)

key: 1 del. bit: 1

key: 12 del. bit: 0

Fig. 2. The chunk structure

entries in the chunk. When an attempt is made to insert too many entries into a chunk, the chunk is split. When it becomes too small due to deletions, it is merged with a neighboring chunk. We require max > 2·min+1, since splitting a large chunk must create two well-formed new chunks. In practice max will be substantially larger than 2·min to avoid frequent splits and merges. Additional ﬁelds (new, mergeBuddy and freezeState) are needed for running the splits and the merges and are discussed in Section 5. The chunk structure is depicted in Figure 2. The structure of entire list. The entire list consists of a list of chunks. Initially we have a head pointer pointing to an empty ﬁrst chunk. We let the ﬁrst chunk’s min boundary be set to 0, to allow small lists. The list grows and shrinks due to the splitting and merging of the chunks. Every chunk has a pointer nextChunk to the next chunk, or to null if it is the last chunk of the list. The keys of the entries in the chunks never overlap, i.e., each chunk contains a consecutive subset of keys in the set, and a pointer to the next chunk, containing the next subset (with strictly higher keys) in the set. The entire list structure is depicted in Figure 3. We set the ﬁrst key in a chunk as its lowest possible key. Any smaller key is inserted in the previous chunk (except for the ﬁrst chunk that can also get keys smaller than its ﬁrst one.) Hazard pointers. Whole chunks and entries inside a chunk are reclaimed manually. Note that garbage collectors do not typically reclaim entries inside an array. To allow safe (and lock-free) reclamation of entries manually, we employ Michael’s hazard pointers methodology [8,10]. While a thread is processing an entry - and a concurrent reclamation of this entry can foil its actions - the thread registers the location of this entry in a special pointer called a hazard pointer. Reclamation of entries that have hazard pointers referencing them is avoided. Following Michael’s list implementation [10], each thread has two hazard pointers, denoted hp0 and hp1 that aid the processing of entries in a chunk. We further add four more hazard pointers hp2, hp3, hp4, and hp5, to handle the operations of the chunk list. Each thread only updates its own hazard pointers, though it can read the other threads’ hazard pointers.

Locality-Conscious Lock-Free Linked Lists H E A D

Chunk 1 Chunk's head

new, mergeBuddy, freezeState Entry with key 26

…

counter: 6 Entry with key 5

nextChunk

Chunk 2

Entry with key 90

Chunk's head

new, mergeBuddy, freezeState Entry with key 100

…

counter: 10 Entry with key 159

111

nextChunk Entry with key 123

Fig. 3. The list structure

3

Using a Freeze to Retire a Chunk

In order to maintain the minimum and maximum number of entries in a chunk, we devised a mechanism for splitting dense chunks, and for merging a sparse chunk with its predecessor. The main idea in the design of the split and merge lock-free mechanisms is the freezing of chunks. When a chunk needs to be split or merged, it is ﬁrst frozen. No insertions or deletions can be executed on a frozen chunk. To split a frozen chunk, two new chunks are created and the entries of the frozen chunk are copied into them. To merge a frozen chunk with a neighbor, the neighbor is ﬁrst frozen, and then one or two new chunks are allocated and the relevant entries from the two merging chunks are copied into them. Details of the freezing mechanism appear in Section 5. We now review this mechanism in order to allow the presentation of the list operations. The freezing of a chunk comprises three phases: Initiate Freeze. When a thread decides a chunk should be frozen, it starts setting the freeze bits in all its entries one by one. During the time it takes to set all these bits, other threads may still modify the entries not yet marked as frozen. During this phase, only part of the chunk is marked as frozen, but this freezing procedure cannot be reversed, and frozen entries cannot be reused. Stabilizing. Once all entries in a chunk are frozen, allocations and deletions can no longer be executed. At this point, we link the non-deleted entries into a list. This includes entries that were allocated, but not yet connected to the list. All entries that are marked as deleted are disconnected from the list. Recovery. The number of entries in the stabilized list is counted and a decision is made whether to split this chunk or merge it with a neighbor. Sometimes, due to changes that happen during the ﬁrst phase, the frozen chunk becomes a good one that does not require a split or a join. Nevertheless, the retired chunk is never resurrected. We always allocate a new chunk to replace it and copy the appropriate values to the new chunk. Whatever action is decided upon (split, join, or copy chunk) must be carried through. Any thread that fails to insert or delete a key due to the progress of a freeze, joins in helping the freezing of the chunk. However, threads that perform a search, continue to search in frozen chunks with no interference.

112

4

A. Braginsky and E. Petrank

The List Operations: Search, Insert and Delete

We now turn to describe the basic linked list operations. The high-level code for an insertion, deletion, or search of a key is very simple. Each of this operations starts by invoking FindChunk method to ﬁnd the relevant chunk. Then they call SearchInChunk, or InsertToChunk, or DeleteInChunkaccording to the desired operation, and ﬁnally, the hazard pointers hp2, hp3, hp4, and hp5 are nulliﬁed, to release the hazard pointers set by the FindChunk method and allow future reclamation. The main challenge is in the work inside the chunk and the handling of the freeze process, on which we elaborate below. More details appear in [2]. Turning to the operations inside the chunks, the delete and search methods are close to the previous design [10], except for the special treatment of the chunk bounds and the freeze status. For lack of space they are not speciﬁed in this short paper. The details appear in [2]. However, the insert method is quite diﬀerent, because it must allocate an entry in a shared memory (on the chunk), whereas previously, it was assumed that the insert allocates a local space for a new entry and privately prepares it for insertion in the list. For the purpose of handling the entries list in the chunk, we maintain ﬁve variables that are global and appear in all the code below. These variables are global for each thread’s code, but are not shared between threads, and all of them follow Michael’s design [10]. The ﬁrst three per-thread shared variables are (entry** prev), (entry* cur), and (entry* next). The other two are the two pointers (entry** hp0) and (entry** hp1) that point to the two hazard pointers of the thread. All other variables are local to the method that mentioned them. 4.1

The Insert Operation

The InsertToChunk method inserts a key with its associated data into a chunk. It ﬁrst attempts to ﬁnd an available entry and allocate it with the given key. If no available entry exists, a split is executed and the operation is retried. If an entry is obtained, the InsertEntry method is invoked to insert the entry into the list. The insertion will fail if the key already exists in the chunk. In this case InsertToChunk clears the entry to free it for future allocations. The InsertToChunk code is presented in Algorithm 1. It starts by an attempt to ﬁnd an available entry for allocation. A failure occurs when all entries are in use and in this case a freeze is initiated. The Freeze method gets the key and data as an input, and also an input indicating that it is invoked by an insertion operation. This allows the Freeze method to try to insert the key to the newly created chunk. When successful, it returns a null pointer to indicate the completion of the insertion. It also sets a local variable result to indicate whether the completed insertion actually inserted the key or it completed by ﬁnding that the key already existed in the list (which is also a legitimate completion of the insertion operation). If the insertion is not completed by the Freeze method, then it returns a pointer to the chunk on which the insertion should be retried. Connecting the entry to the list is done by InsertEntry. If the entry gets allocated and linked to the list, then the chunk counter is incremented only by

Locality-Conscious Lock-Free Linked Lists

113

Algorithm 1. Insert a key and its associated data into a chunk Bool InsertToChunk (chunk* chunk, key, data) { 1: current = AllocateEntry(chunk, key, data); // Find an available entry 2: while ( current == null ) { // No available entry. Freeze and try again 3: chunk = Freeze(chunk, key, data, insert, &result); 4: if ( chunk == null ) return result; // Freeze completed the insertion. 5: current = AllocateEntry(chunk, key, data); // Otherwise, retry allocation 6: } 7: returnCode = InsertEntry(chunk, current, key); 8: switch ( returnCode ) { 9: case success this: 10: IncCount(chunk); result = true; break; // Increase the chunk’s counter 11: case success other: // Entry was inserted by other thread 12: result = true; break; // due to help in freeze 13: case existed: // This key exists in the list. Reclaim entry 14: if ( ClearEntry(chunk, current) ) // Attempt to clear the entry 15: result = false; 16: else // Failure to clear the entry implies that a freeze thread 17: result = true; // eventually inserts the entry 18: break; 19: } // end of switch 20: *hp0 = *hp1 = null return result; // Clear all hazard pointers and return }

the thread that linked the entry itself. If the key already existed in the list, then ClearEntry attempts to clear the entry for future reuse. However, a rare scenario may foil clearing of the entry. This happens when the other occurrence of the key (which existed previously in the list) gets deleted before our entry gets cleared. Furthermore, a freeze occurs, in which the semi-allocated entry gets linked by other threads into the new chunk’s list. At this point, clearing this entry is avoided, and ClearEntry returns false. In such a scenario, clearing the entry fails and the insert operation succeeds. At the end of InsertToChunk, all hazard pointers are cleared and we return with a code specifying if the insert was successful, or the key previously existed in the list. The allocation of an available entry is executed using the AllocateEntry method (depicted in [2]). An available entry contains ⊥ as a key and zeros otherwise. An available entry is allocated by assigning the key and data values in the keyData word in a single atomic compare-and-swap (CAS) that assumes this word has the ⊥ symbol and zeros in it. An entry whose keyData has the freeze bit set cannot be allocated as it is not properly zeroed. Note also that once an entry is allocated, all the information required for linking it to the list is available to all threads. Thus, if a freeze starts, then all threads may create a stabilized list of the allocated entries in a chunk. The AllocateEntry method searches for an available entry. If no free entry can be found, null is returned.

114

A. Braginsky and E. Petrank

Algorithm 2. Connecting an allocated entry into the list returnCode InsertEntry (chunk* chunk, entry* entry, key) { 1: while ( true) { 2: savedNext = entry→next; 3: // Find insert location and pointers to previous and current entries (prev, cur) 4: if ( Find(chunk, key) ) // This key existed in the list 5: if ( entry == cur ) return success other; else return existed; 6: // If neighborhood is frozen, keep it frozen 7: if ( isFrozen(savedNext) ) markFrozen(cur); // cur will replace savedNext 8: if ( isFrozen(cur) ) markFrozen(entry); // entry will replace cur 9: // Attempt linking into the list. First attempt setting next field 10: if ( !CAS(&(entry→next), savedNext, cur) ) continue; 11: if ( !CAS(prev, cur, entry) ) continue; // Attempt linking 12: return success this; // both CASes were successful 13: } }

Next, comes the InsertEntry method, which takes an allocated entry and attempts to link it to the linked list. The InsertEntry code is presented in Algorithm 2. The input parameter entry is a pointer to an entry that should be inserted. It is already allocated and initiated with key and data. Before searching for the location to which to connect this entry, we memorize this entry’s next pointer. Normally, this should be a null, but in the presence of concurrent executions of InsertEntry (which may happen during a freeze), we must make sure later that the entry’s next pointer was not modiﬁed before we atomically wrote it in Line 10. After saving the current next pointer, we search for the entry’s location via the Find method. If the key already exists in the list, InsertEntry checks whether the returned entry is the same as the one it is trying to insert (by address comparison). The result determines the return code: either the key existed and we failed, or the key was inserted, but not by the current thread. (This can happen during a freeze when all threads attempt to stabilize the frozen list.) Otherwise, the key does not exist, and Find sets the global variable cur with a pointer to the entry that should follow our entry in the list, and the global variable prev with the pointer that should reference our entry. The Find method protects the entries referenced by prev and cur with the hazard pointers hp1 and hp0, respectively. There is no need to protect the newly allocated entry because it cannot be reclaimed by a diﬀerent thread. If any to-be-modiﬁed pointer is marked as frozen, we make sure that its replacement is marked as frozen well. An allocation of an entry can never occur on a frozen entry. However, once the allocation is successful, the new entry may freeze and still InsertEntry should connect it to the list. Finally, two CASs are used to link the entry to the list. Whenever a CAS fails, the insertion starts from scratch.

Locality-Conscious Lock-Free Linked Lists

115

Algorithm 3. The main freeze method chunk* Freeze(chunk* chunk, key, data, triggerType tgr, Bool* res) { 1: CAS(&(chunk→freezeState), no freeze, internal freeze); 2: // At this point, the freeze state is either internal freeze or external freeze 3: MarkChunkFrozen(chunk); 4: StabilizeChunk(chunk); 5: if ( chunk→freezeState == external freeze ) { 6: // This chunk was marked external freeze before Line 1 executed. 7: master = chunk→mergeBuddy; // Get the master chunk 8: // Fix the buddy’s mergeBuddy pointer. 9: masterOldBuddy = combine(null, internal freeze); 10: masterNewBuddy = combine(chunk, internal freeze); 11: CAS(&(master→mergeBuddy), masterOldBuddy, masterNewBuddy); 12: return FreezeRecovery(chunk→mergeBuddy, key, data, merge, chunk, tgr, res); 13: } 14: decision = FreezeDecision(chunk); // The freeze state is internal freeze 15: if ( decision == merge ) mergePartner = FindMergeSlave(chunk); 16: return FreezeRecovery(chunk, key, data, decision, mergePartner, trigger, res); }

5

The Freeze Procedure

We now provide more details about the freeze procedure. The full description is presented in [2]. The freezing process occurs when the number of entries in a chunk exceeds its boundaries. At this point, splitting or merging happens by copying the relevant keys (and data) into a newly allocated chunk (or chunks). This process comprises three phases: initiation, stabilization and recovery. The code for the Freeze method is presented in Algorithm 3. The input parameters are the chunk that needs to be frozen, the key, the data, and the event that triggered the freeze: insert, delete, enslave (if the freeze was called to prepare the chunk for merge with a neighboring chunk), or none (if the freeze is called while clearing an entry). The freeze will attempt to execute the insertion, deletion, or enslaving and will return a null pointer when successful. It will also set an input boolean ﬂag to indicate the return code of the relevant operation. When unsuccessful, it will return a pointer to the new chunk on which the operation should be retried. The Freeze method starts with an attempt to atomically change the freeze state from no freeze to internal freeze. This freeze state of the chunk is normally no freeze and is switched to internal freeze when a freeze process of this chunk begins. But it can also be external freeze when a neighbor requested a freeze on this chunk to allow a merge between the two. Thus, an external freeze can start even when no size violation is detected in this chunk. Whether or not the modiﬁcation succeeds, we know that the freeze state can no longer be no freeze. It can be either internal freeze or external freeze. The Freeze method then calls MarkChunkFrozen to mark each

116

A. Braginsky and E. Petrank

entry in the chunk as frozen and StabilizeChunk to ﬁnish stabilizing the entries list in the chunk. At this point, the entries in the chunk cannot be modiﬁed anymore. Freeze then checks if the freeze is external or internal. An external freeze can occur when a freeze is concurrently executed on the next chunk, and it has already enslaved the current chunk as its merge buddy. In this case, we cooperate with the joint freeze and joint recovery. When the state of the freeze is external, then the current chunk must have its mergeBuddy pointer already pointing to the chunk that initiated the merge, denoted the master. To ﬁnish this freeze, we make sure that the master has its merge buddy properly pointing back at the current chunk. The master chunk’s mergeBuddy pointer must be either null or already pointing to the buddy we found. Thus it is enough to use one CAS command to verify that it is not null. Finally, we execute the recovery phase on the master chunk and return its output. We do not need to check the decision about the freeze of the buddy. It must be a merge. If the freeze is internal, then we invoke FreezeDecision to see what should be done next (Line 14). If the decision is to merge, then we ﬁnd the previous chunk and “enslave” it for a joint merge using the FindMergeSlave method. Finally, the FreezeRecovery method is called to complete the freeze process. Next, we explain each of the stages. The full details including pseudo-code appear in [2]. Marking the chunk as frozen. The MarkChunkFrozen method simply goes over the entries one by one and marks each one as frozen. The setting of the freeze ﬂags is atomic and it is retried repeatedly until successful. By the end of this process all entries (including the free ones) are marked as frozen. Stabilizing the chunk. After all the entries in the chunk are marked as frozen, new entries cannot be allocated and existing entries cannot be marked as deleted. However, the frozen chunk may contain allocated entries that were not yet linked, and entries that were marked as deleted, but which have not yet been disconnected. The StabilizeChunk method disconnects all deleted entries and links all allocated ones. It uses the Find method to disconnect all entries that are marked as deleted. Such entries do not need to be reclaimed (when marked as frozen), but they should not be copied to the new chunk. Next, StabilizeChunk attempts to connect entries. It goes over all entries and searches for ones that are disconnected, but neither reclaimed nor deleted. Each such entry is linked to the list by invoking InsertEntry, which will only fail if the key already exists in a diﬀerent entry in the chunk’s list. In this case, this entry should indeed not be connected to the stabilized list. Reaching a decision. After stabilizing the chunk, everything is frozen, the list is completely connected, and nothing changes in the chunk anymore. At this point, we need to decide whether or not splitting or merging is required. To that end, a count is performed and a decision is made by comparison to min and max. It may happen that the resulting count is higher than min and lower than max, and then no operation is required. Nevertheless, the frozen chunk is never resurrected. Instead, we copy the chunk to a new chunk in the (upcoming) recovery stage.

Locality-Conscious Lock-Free Linked Lists

117

Making the recovery. Once a decision is reached, a recovery starts. The recovery procedure allocates a chunk (or two) and copies the relevant information into the new chunk (or chunks). If a merge is involved, the previous chunk in the list is ﬁrst frozen (under an external freeze) and both chunks bring entries for the merge. Several threads may perform the freeze procedure concurrently, but all of them will make the same recovery decision about the freeze, as the frozen stabilized chunk looks the same to all threads. A thread that performs the recovery creates a local chunk (or chunks) into which it copies the relevant entries. At this point all threads create the same new chunk (or chunks). But now, each thread performs the operation with which it initiated the freeze on the new chunks. It can be an insert, delete, or enslave. Performing the operation is easy because the new chunks are local to this thread and no race can occur. (Enslaving a chunk is simply done by modifying its freeze state from no freeze to external freeze and registration of the merge buddy.) But the success of making the local operation visible in the data structure is determined by whether the thread succeeds in creating a link to its new chunks in the frozen chunk, as explained next. After creating the new chunks locally and executing the original operation on them, there is an attempt to atomically insert the address of its local chunk into a dedicated pointer in the frozen chunk (new). When two chunks are created, the second one is locally linked to the ﬁrst one by the nextChunk ﬁeld. If the insertion is successful, then this thread has also completed the the operation it was performing (insert, delete, or enslave). If the insertion is unsuccessful, then this means that a diﬀerent thread has already completed the installation of new chunks and this thread’s local new chunks will not be used (i.e., can be reclaimed). In this case, the thread must try its operation again from scratch. According to the number of (live) entries on the frozen chunk there are three ways to recover from the freeze. Case I: min< count < max. In this case, the required action is to allocate a new chunk and copy all of the entries from the frozen chunk to the new chunk. Next we perform the insert, delete, or enslave operation on the local new chunk and attempt to link it to the frozen one. Case II: count == min. In this case we need to merge the frozen chunk with its previous chunk. We assume that the previous chunk has already been frozen by an external freeze before the recovery is executed, and that the freeze states in both chunks are properly set so that no thread can interfere with the freeze process. We start by checking the overall number of entries in these two chunks, to decide if the merged entries will ﬁt into one or two chunks. We then allocate a second new chunk, if needed, and perform the (local) copy to the new chunk or chunks. When copying into two new chunks, we split the entries evenly, and return the smallest key in the second chunk as the separating key. As before, we perform the original operation that started the freeze and try to create a link from the old chunk to the new chunk or chunks.

118

A. Braginsky and E. Petrank

Case III: count == max. In this case we need to split the old chunk into two new chunks. The basic operations of this case resemble those of the previous cases. We allocate two new chunks, perform the split locally, perform the original operation, and attempt to link the new chunks to the old one.

6

Conclusion

We have presented a chunking and freezing mechanisms that build a cacheconscious lock-free linked list. Our list consists of chunks, each containing consecutive list entries. Thus, a traversal of the list stays mostly within a chunk’s boundary (a virtual page or a cache line), and therefore, the traversal enjoys a reduced number of page faults (or cache misses) compared to a traversal of randomly allocated nodes, each containing a single entry. Maintaining a linked list in chunks is often used in practice (e.g., [1,5]) but a lock-free implementation of a cache-conscious linked list has not been available heretofore. We believe that the building blocks of this list, i.e., the chunks and the freeze operation, can be used for building additional data structures, such as lock-free hash functions, and others.

References 1. Unrolled Linked Lists, http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx 2. Full Version of Locality-Conscious Lock-Free Linked Lists, http://www.cs.technion.ac.il/~ erez/Papers/lf-linked-list-full.pdf 3. Fomitchev, M., Rupert, E.: Lock-free linked lists and skip lists. In: Proc. PODC (2004) 4. Fraser, K.: Practical lock-freedom, Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory (February 2004) 5. Frias, L., Petit, J., Roura, S.: Lists revisited: Cache-conscious STL lists. J. Exp. Algorithmics 14, 3.5–3.27 (2009) 6. Harris, T.L.: A pragmatic implementation of non-blocking linked-lists. In: Proc. PODC (2001) 7. Herlihy, M.: Wait-free synchronization. TOPLAS (1991) 8. Michael, M.M.: High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. In: Proc. SPAA (2002) 9. Michael, M.M.: Safe memory reclamation for dynamic lock-free objects using atomic reads and writes. In: Proc. PODC (2002) 10. Michael, M.M.: Hazard pointers: Safe memory reclamation for lock-free objects. TPDS (June 2004) 11. Herlihy, M., Shavit, N.: The art of multiprocessor programming. Morgan Kaufmann, San Francisco (2008) 12. Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proc. PODC (1995) 13. Treiber, R.K.: Systems programming: Coping with parallelism, Research report RJ 5118, IBM Almaden Research Center (1986)

Specification and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock Vibhor Bhatt and Prasad Jayanti Department of Computer Science, Dartmouth College, NH, USA

Abstract. Brandenburg and Anderson [1,2] recently introduced a phase-fair readers/writers lock [1,2], where read and write phases alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Thus, neither class of processes–readers or writer–has priority over the others, and no process starves. Brandenburg and Anderson [1,2] informally specify a phase fair lock and present an algorithm to implement it with O(n) remote memory reference complexity (RMR), where n is the number of processes in the system. In this work we give a rigorous specification of a phase fair lock and present an algorithm that implements it with O(1) RMR complexity.

1 Introduction Mutual exclusion [3] is a well-studied, fundamental problem in distributed computing. Here processes repeatedly cycle through four sections of code–Remainder Section, Try Section, Critical Section (CS), and Exit Section–in that order, and the problem consists of designing the code for the Try and Exit sections so that the mutual exclusion property–at most one process is in the CS at any time–is satisfied. Readers/Writers Exclusion [4] is a well known variant of Mutual exclusion, commonly used in operating systems and in parallel applications to implement shared data structures. In Readers/Writers Exclusion, processes are divided into two classes–readers and writers–and the exclusion property is revised to allow for more concurrency: multiple readers can be in the CS at the same time, although no process may be in the CS at the same time as a writer. Starting from the earliest paper [4], most works on Readers/Writers Exclusion studied the problem in three natural variants–one in which readers have priority over writers, one in which the writers have priority and one in which neither class of processes– readers or writers–has priority over the other. This work deals with the third variant. When neither class has priority, Brandenburg and Anderson [1,2] suggested a desirable property, which they called the phase-fairness property, that requires read and write phases to alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Their algorithm to realize this property has O(n) remote memory reference complexity (RMR complexity), where n is the number of M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 119–130, 2011. c Springer-Verlag Berlin Heidelberg 2011

120

V. Bhatt and P. Jayanti

processes in the system. (A memory reference to a shared variable X by a processor p is considered remote in Cache Coherent (CC) machines if X is not in p’s cache; it is considered remote in Distributed Shared Memory (DSM) machines if X is at a memory module of a different processor. The RMR complexity of an algorithm is the worst-case number of remote memory references that a process makes in order to execute the Try and Exit sections once.) Our paper makes two contributions. Brandenburg and Anderson stated the phasefairness property only informally, and did not include all of the elements that one would intuitively associate with phase-fairness. Our first contribution is a more comprehensive and rigorous specification of the phase-fairness property. Our second contribution is an algorithm that achieves this property with O(1) RMR complexity on CC machines,1 in contrast to their O(n) RMR complexity algorithm. The rest of the paper is organized as follows. Section 2 describes the model and states the basic properties required by the Reader-Writer Exclusion problem, followed by a rigorous formulation of the phase-fairness properties. Section 3 describes related work. The last two sections contain the algorithms. Section 4 describes the single-writer algorithm satisfying the phase fairness properties. This algorithm and its description are taken almost verbatim from [5]. Section 5 shows how to transform the single-writer algorithm into multi-writer algorithm satisfying the basic and phase-fairness properties. Proofs are omitted due to space constraint.

2 Model and Specification of the Reader-Writer Problem The system consists of a set of n asynchronous processes {p0 . . . pn−1 }, communicating with each other through shared variables that support the atomic operations read, write, and fetch&add (F&A). Each process is labeled a reader or a writer,2 and its program is a loop that consists of two sections of code—Try section and Exit section. We say a process is in the Remainder section if its program counter is at the first statement of the Try section; and it is in the Critical section (CS) if its program counter is at the first statement of the Exit section. The Try section, in turn, consists of two code fragments—a doorway, followed by a waiting room—with the requirement that the doorway is a bounded “straight line” code [6]. Intuitively, a process “registers” its request for the CS by executing the doorway, and then busywaits in the waiting room until it has the “permission” to enter the CS. Initially, all processes are in their remainder section. Each execution of the Try and Exit sections by a process is called an attempt; it is a read attempt (respectively, write attempt) if the process is a reader (respectively, writer). An attempt by a process p spans from a time t when p executes the first statement of its Try section to the earliest time t > t when p completes the Exit section. An attempt A 1

2

For DSM machines, Danek and Hadzilacos’ lower bound proof for 2-Session Group Mutual Exclusion implies that there is no O(1) RMR complexity algorithm for Readers/Writers exclusion. Our algorithms work even when this labeling is not static, i.e., when the same process performs read attempts sometimes and write attempts some other times. We assume static labeling only for simplicity.

Specification and Constant RMR Algorithm

121

is active in a configuration C of a run if A starts before C and does not complete before C. The following definitions of “precedence” and “enabled” will be useful for defining fairness properties in the next section. If A is an attempt by a process p, henceforth we write “A completes the doorway at time t” as a shorthand for “p completes the doorway at time t during the attempt A.” Definition 1. If A and A are any two attempts in a run (possibly by different processes), A doorway precedes A if A completes the doorway before A begins the doorway. A and A are doorway concurrent if neither doorway precedes the other. Definition 2. A process p is enabled to enter the CS in configuration C if p is in the Try section in C and there is an integer b such that, in all runs from C, p enters the CS in at most b of its own steps. The Phase-fair Reader-Writer Problem is to design the code for the Try and the Exit sections so that properties P1 through P5 stated below and the properties PF1 and PF2 of Subsection 2.2 hold in all runs of the algorithm. – (P1). Mutual Exclusion : If a writer is in the CS at any time, then no other process is in the CS at that time. – (P2). Bounded Exit : There is an integer b such that in every run, every process completes the Exit section in at most b of its steps. – (P3). First-Come-First-Served (FCFS) among writers : If w and w are any two write attempts in a run and w doorway precedes w , then w does not enter the CS before w. – (P4). First-In-First-Enabled (FIFE) among readers : Let r and r be any two read attempts in a run such that r doorway precedes r . If r enters the CS before r, then r is enabled to enter the CS at the time r enters the CS. – (P5). Concurrent Entering : Informally, if all writers are in the remainder section, readers should be able to enter the CS in a bounded number of steps. More precisely: There is an integer b such that, if σ is any run from a reachable configuration such that all writers are in the remainder section in every configuration in σ, then every read attempt in σ executes at most b steps of the Try section before entering the CS. Finally, we state the liveness property. When one class of processes (e.g., readers) has priority over the other class (e.g., writers), the starvation of processes belonging to the lower priority class is unavoidable. Therefore, instead of starvation-freedom, we require the weaker livelock-freedom property, which is appropriate in all three cases of readerpriority, writer-priority, and no-priority. Livelock-freedom guarantees that, under the standard assumption that no process crashes in the middle of the Try, CS or Exit section, some process in the Try section will eventually enter the CS and some process in the Exit section will eventually enter the Remainder section. – (P6). Livelock-freedom : If no process crashes in an infinite run, then infinitely many attempts complete in that run.

122

V. Bhatt and P. Jayanti

2.1 Reader-Priority and Writer-Priority Formulations In a recent paper [5] we presented the reader- and writer-priority formulations and presented constant RMR algorithms for those cases. This submission studies the nopriority formulation, described next. 2.2 Phase-Fairness Properties When neither readers nor writers have priority over the other, a most natural additional property to require is starvation-freedom—that no reader or writer gets stuck forever in the Try or Exit section. However, Brandenburg and Anderson have pointed out that we could demand more—that readers and writers take turns fairly, while still allowing for concurrency (by enabling multiple readers to cohabit the CS).[1,2]. Specifically, if readers are waiting when a writer leaves the CS, then all such waiting readers should be allowed to enter the CS before the next writer may enter the CS, i.e., the “session” should switch from being a “write session” to a ‘read session.” Likewise if a read session is in progress and one or more writers are waiting, then no new readers should be allowed into the CS. These “fair switching” properties were stated informally in Brandenburg and Anderson’s work, and we formulate these rigorously below. – ( PF1). Fair switch from writer to readers : If at some time in a run a write attempt w is in the CS and a read attempt r is in the waiting room, then r enters the CS before any write attempt w = w enters the CS in the future. – ( PF2). Fair switch from readers to writer : If at time t a read attempt is in the CS and a write attempt is in the waiting room, then some write attempt enters the CS in the future before any read attempt initiated after t enters the CS. Our quest is to identify properties that are desirable in any algorithm that aims to ensure fairness between readers and writers. In this quest, the two properties stated above may be considered necessary, but they are surely not sufficiently strong. To see this, consider a scenario where a writer w is in the CS while a set W of writers and a set R of readers are in the waiting room. When w leaves the CS, the first property blocks writers from entering the CS until all readers in R enter the CS, but it makes no guarantee about how quickly these waiting readers enter the CS. In particular, even after w completes the Exit section and goes back to the Remainder section, the writers in W may temporarily block the readers in R from entering the CS without violating the above properties. So we state a stronger property below that guarantees that, once w’s writing session is over, every reader in R will be able to enter the CS in a bounded number of its own steps. We consider w’s writing session to be over as soon as either of the following two events happens: (i) w goes back to the Remainder section, or (ii) some reader or writer enters the CS after w leaves the CS. – ( PF3). Fast switch from writer to readers : Suppose that at some time t a write attempt w is in the CS and a read attempt r is in the waiting room, and t > t is the earliest time when w is completed or some attempt a = w is in the CS. Then, at time t , either r is in the CS or r is enabled to enter the CS.

Specification and Constant RMR Algorithm

123

The next lemma states that this property is stronger than the “Fair switch from writer to readers” property stated earlier. Lemma 1. If an algorithm satisfies Mutual Exclusion and Fast switch from writer to readers, then it satisfies Fair switch from writer to readers.

3 Related Work Courtois et al. first posed and solved the Readers/Writers problem [4]. Mellor-Crummey and Scott’s algorithms [7] and their variants [8] are queue-based; they have constant RMR complexity, but do not satisfy Concurrent Entering. Anderson’s algorithm [1] and Danek and Hadzilacos’ algorithm [9] satisfy Concurrent Entering, but they have O(n) and O(log n) RMR complexity, respectively, where n is the number of processes. The first O(1) RMR Reader-Writer lock algorithm satisfying concurrent entering (P5) was designed by Bhatt and Jayanti [5]. In that work they studied all the three variations of the problem—reader-priority, writer-priority and starvation-free cases. Brandenburg and Anderson’s recent work [1,2], which is most closely related to this paper, noted that Mellor-Crummey and Scott’s queue-based algorithm [7] limit concurrency because of its strict adherence to a FIFO order among all waiting readers and writers. For instance if the queue contains a writer between two readers, then the two readers cannot be in the CS together. To overcome this drawback Brandenburg and Anderson proposed “phase-fairness” that requires readers and writers to take turns in a fair manner. The fair and fast switch properties ( PF1- PF3) stated in the section 2 are intended to rigorously capture the informal requirements stated in their work. Their algorithm in [1] has O(n) RMR complexity and satisfies PF3 (and hence also PF1), but not PF2. Their phase-fair queue based algorithm in [2] has constant RMR complexity and satisfies PF1, but not PF2 or PF3; it also fails to satisfy Concurrent Entering (P5). An algorithm in [5] has constant RMR complexity and satisfies all of PF1, PF2 and PF3 (and all of P1-P5), but it supports only a single writer. We use this algorithm as a building block to design a constant RMR algorithm that supports multiple writers and readers and satisfies all the properties from Section 2. As with Brandenburg and Anderson’s algorithm in [1], our algorithm also uses fetch&add primitives.3

4 Single Writer Algorithm Satisfying Phase-Fair Properties In this section, we present an algorithm that supports only a single writer and multiple readers, and satisfies the phase-fair properties((P 1) − (P 6)and(P F 2), (P F 3)) and an additional property called writer priority defined as follows. (WP1). Writer Priority : If a write attempt w doorway precedes a read attempt r, then r does not enter the CS before w.4 3

4

Without the use of synchronization instructions like fetch&add, it is well known that constant RMR complexity even for the standard mutual exclusion problem is not possible [10,11,12]. This property by itself implies PF2.

124

V. Bhatt and P. Jayanti

This algorithm is very similar to the algorithm presented in Section 5 of [5], and the description given in this section is almost verbatim from Section 5.1 of [5]. We encourage a full reading of Section 5 of [5] to understand the final algorithm given in Section 5 of this paper. The overall idea is as follows. The writer can enter the CS from two sides, 0 and 1. It never changes its side during one attempt of the CS. The writer also toggles its side for every new attempt. To enter from a certain side, say 1, the writer sets the shared variable D to 1. Then it waits for the readers from the previous side (in this case side 0) to exit the critical and the Exit section. The last reader to exit from side 0 lets the writer in the CS. Once the writer is done with the CS from side 1, it lets the readers waiting from side 1 into the CS, using variable Gate described later. The readers in their Try section set their side d equal to D. Then they increment their count in side d and attempt for the CS from side d. In order to enter the CS from side d, they busy wait on the variable Gate[d] until it is true. When the readers are exiting, they decrement their count from side d and the last exiting reader wakes up the writer. Now we describe the shared variables used in the algorithm. procedure Write-lock() R EMAINDER SECTION 1. prevD ← D 2. currD ← prevD 3. D ← currD 4. Permit[prevD] ← false 5. if (F&A(C[prevD], [1, 0]) = [0, 0]) 6. wait till Permit[prevD] 7. F&A(C[prevD], [−1, 0]) 8. Gate[prevD] ← false 9. ExitPermit ← false 10. if (F&A(EC, [1, 0]) = [0, 0]) 11. wait till ExitPermit 12. F&A(EC, [−1, 0]) C RITICAL S ECTION 13. Gate[currD] ← true

procedure Read-lock() R EMAINDER SECTION 14. d ← D 15. F&A(C[d], [0, 1]) 16. d ← D 17. if(d = d ) 18. F&A(C[d ], [0, 1]) 19. d←D 20. if(F&A(C[d], [0, −1]) = [1, 1]) 21. Permit[d] ← true 22. wait till Gate[d] C RITICAL S ECTION 23. F&A(EC, [0, 1]) 24. if(F&A(C[d], [0, −1]) = [1, 1]) 25. Permit[d] ← true 26. if(F&A(EC, [0, −1]) = [1, 1]) 27. ExitPermit ← true

Fig. 1. Single-Writer Multi-Reader Algorithm satisfying Starvation Freedom and Writer-Priority. The doorway of Write-lock comprises of Lines 1-3. The doorway of Read-lock comprises of Lines 14-21.

4.1 Shared Variables and Their Purpose All the shared variable names start with upper case and the local variables start with lower case. D: A single bit read/write variable written only by the writer. This variable denotes the side from which the writer wants to attempt for the CS.

Specification and Constant RMR Algorithm

125

Gate[d]5 : Boolean read/write variable written only by the writer. Gate[d] denotes whether the side d is open for the readers to enter the CS. Before entering the CS, a reader has to wait till Gate[d] = true (Gate from side d is open), where d is the side from which the reader is attempting for the CS. Permit[d]: Boolean read/write variable written and read by both readers and the writer. The writer busy waits on Permit[d] to get the permission from the readers (from side d) to enter the CS. The idea is that the last reader to exit side d, will wake up the writer using Permit[d]. ExitPermit: Boolean read/write variable written and read by both readers and the writer. It is similar to Permit, with the difference that it is used by the writer to wait for all the readers to leave the Exit section. C[d], d ∈ {0, 1} : A fetch&add variable read and updated both by the writer and readers. C[d] has two components [writer-waiting, reader-count].6 writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers from side d to leave the CS, it is only updated by the writer. reader-count denotes the number of readers currently registered in side d. EC : A fetch&add variable read and updated both by the writer and readers. Similar to C[d], it has two components [writer-waiting, reader-count]. writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers to complete the Exit section. reader-count denotes the number of readers currently in the Exit section. Following theorem summarizes the properties of this algorithm. Theorem 1 (Single-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 1 implements a Single-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3). The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(1) number of shared variables that support read, write, fetch&add.

5 Multi-Writer Multi-Reader Phase-Fair Algorithm In this section we will describe how we construct our multi-writer multi-reader phasefair lock using the single-writer writer-priority lock given in Figure 1. We denote this single-writer lock given in Figure 1 by SW in the rest of this section. In all the algorithms we discuss in this section the code of the Read − lock() procedure will be identical to the algorithm given in SW . The writers on the other hand will use a Mutual Exclusion lock to ensure only one writer accesses the underlying SW . More precisely a writer first needs to enter the CS of the Mutual Exclusion lock and then compete with the readers in the single-writer protocol from Figure 1. The Mutual Exclusion lock we use was designed by T. Anderson [13]. It is a constant RMR Mutual Exclusion lock satisfying P3 and P6. We use the procedures acquire(M ) and release(M ) to denote the Try and the Exit section of this lock. 5

6

The algorithm given in Section 5 of [5] had only one Gate ∈ {0, 1} variable. The value of Gate at any time denoted the side opened for the readers. This change is required to make the final algorithm in Section 5 of this paper (which transforms this single-writer algorithm to multi-writer algorithm) to work correctly. Both the components of C[d] (and EC) are stored in a single word.

126

V. Bhatt and P. Jayanti

Notations used in this section: We denote SW-Write-try (respectively, SW-Read-try) for the Try section code of the writer (respectively, reader) in the single-writer algorithm given in the Figure 1. Similarly we use SW-Write-exit (respectively, SW-Read-exit) for the Exit section code of the writer (respectively, reader). We first present a simple but incorrect multi-writer algorithm in Figure 2. This algorithm is exactly the same as the multi-writer starvation free algorithm from [5]. The readers simply execute SW-Read-try followed by SW-Read-exit. The writers first obtain a mutual exclusion lock M , then execute SW-Write-try followed by SW-Write-exit and finally exit M . As far as the underlying single-writer protocol is concerned there is only one writer executing at any time and it executes exactly the same steps as in the multiwriter version. Hence one can easily see that the algorithm satisfies (P1)-(P6). In fact it also satisfies the fast switch from writer to readers (PF3); say a writer is in the CS (say from side d), all the readers in the waiting room are waiting for Gate[d] to open. So when the writer opens Gate[d] in SW-Read-exit(), all the waiting readers get enabled. procedure Write-lock() R EMAINDER SECTION 1. acquire(M ) 2. SW-Write-try() C RITICAL SECTION 3. SW-Write-exit() 4. release(M )

procedure Read-lock() R EMAINDER SECTION 5. SW-Read-try() CRITICAL SECTION

6.

SW-Read-exit()

Fig. 2. Simple but incorrect Phase-Fair Multi-Writer Multi-Reader algorithm

But this algorithm does not satisfy Fair switch from readers to writer (PF2). To see this consider the following scenario. Say a reader is in the CS and no other processes are active. Then a writer w enters the Try section and executes the doorway of the lock M , hence w is in the waiting room of the multi-writer algorithm. At this point all the writers are still in Remainder section of SW , because the only active writer w has not even started SW -Write-lock. This means that any new reader who begins its Try section now can go past w in to the CS, thus violating (PF2). As mentioned in the previous section SW satisfies (WP1) property: if the writer completes the doorway before a reader starts its doorway, then the reader does not enter the CS before the writer. Also note that in the doorway of SW , the writer just flips the direction variable D and at that point Gate[D] is closed. So a tempting idea to overcome the troubling scenario described above is to make the incoming writer execute the doorway (toggle D) before it enters the waiting room of the multi-writer algorithm (essentially the waiting room in acquire(M )). One obvious problem with this approach is that the direction variable D will keep on flipping as the new writers enter the waiting room, thus violating the invariants and correctness of SW . Another idea is that the exiting writer, say w, before exiting SW (which just comprises of opening Gate[currD]), flips the direction variable D. This way only the readers currently waiting are enabled but any new readers starting their Try section will be blocked. This idea will work in the cases when there is some writer in the Try section when w is exiting. If w is the only active writer in the system, and if flips the direction

Specification and Constant RMR Algorithm

127

variable D in the Exit section, then Gate[D] will be closed till the next writer flips D again. Hence, starvation freedom and concurrent entry might be in danger. One way to prevent this might be that w before opening Gate[d], checks if there are any writers in the Try section, and only if it sees presence of a writer in the Try section it flips the direction, else it does not. One inherent difficulty with this idea is that if w does not see any writers in the Try section, and is poised to open Gate[d] for the waiting readers, and just then bunch of writers enter the waiting room, and w opens the Gate for the waiting readers, the property PF2 might be in jeopardy. In this case, one might be tempted to think that one of these writers in the Try section should flip the direction. But which of these writers should flip the direction? Should the writer with the best priority (one with smallest token in the lock M ) say w∗ flip the direction? But if w∗ sleeps before flipping direction and many other writers enter the waiting room while some reader is in the CS, again (PF2) is in danger. From all these discussions above we can see the challenges in designing a multiwriter algorithm satisfying all of the phase fairness properties. In particular (PF2) property seems hard to achieve. Before we describe our correct phase-fair multi-writer algorithm presented in Figure 3, we lay down some essential invariants necessary for any phase-fair algorithm in which the readers simply execute Read-lock procedure of SW , and the writers obtain a mutual exclusion lock M before executing Write-lock procedure of SW . i. if there are no writers active, then Gate[D] = true. (required for starvation freedom and concurrent entry) ii. if there is a reader is in the CS, and a writer in the waiting room, then Gate[D] = false. (required for PF2) Now we are ready to describe our correct phase-fair multi-writer algorithm given in Figure 3. First we give the overall idea. 5.1 Informal Description of the Algorithm in Figure 3 The algorithm given in the Figure 3 is on the same lines as the discussion above. Both readers and writers use the underlying single-writer protocol SW . The Read-Lock procedure is same as in SW . The writers first obtain a Mutual exclusion lock M and then execute the Write-lock procedure of SW . We make one slight but crucial change in the way processes execute SW ; we replace the direction variable D with a fetch&add variable Y . Also when a process wants to know the current direction it calls the procedure GetD() which in tun accesses Y to determine the appropriate direction. The crux of the algorithm lies in the use of Y and we will explain that later. We denote the Lines 4-12 of SW , i.e., waiting room of Write-lock by SW -wwaiting. Similarly, SW -r-exit corresponds to the Exit section of the reader in SW , i.e., Lines 23-27 of SW . 5.2 Shared Variables and Their Purpose The shared variables Gate, C, EC, Permit, ExitPermit and writer’s local variable currD and prevD have the same type and purpose as in SW . As mentioned earlier the direction

128

V. Bhatt and P. Jayanti

Shared Variables Y is a fetch&add variable with two components [dr ∈ {0, 1}, wcount ∈ N], initialized to [0, 0] ExitPermit is a Boolean read/write variable ∀d ∈ {0, 1}, Permit[d] is a Boolean read/write variable ∀d ∈ {0, 1}, Gate[d] is a Boolean read/write variable, initially Gate[0] is true and Gate[1] is false EC is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initially EC = [0, 0] ∀d ∈ {0, 1}, C[d] is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initialized to [0, 0] procedure Write-lock() R EMAINDER SECTION 1. F&A(Y, [0, 1]) 2. acquire(M ) 3. currD ← GetD() 4. prevD ← currD 5. SW -w-waiting() CRITICAL SECTION

6. 7. 8.

F&A(Y, [1, −1]) Gate[currD] ← true release(M )

procedure Read-lock() procedure GetD() R EMAINDER SECTION 19. (dr, wc) ← Y 9. d ← GetD() 20. if (wc = 0) 10. F&A(C[d], [0, 1]) 21. return dr 11. d ← GetD() 22. return dr 12. if(d = d ) 13. F&A(C[d ], [0, 1]) 14. d ← GetD() 15. if(F&A(C[d], [0, −1]) = [1, 1]) 16. Permit[d] ← true 17. wait till Gate[d] CRITICAL SECTION

18. SW -r-exit() Fig. 3. Phase-Fair Multi-Writer Multi-Reader Algorithm. The doorway of Write-lock comprises of Line 1 and the doorway of acquire(M ). The doorway of Read-lock comprises of Lines 9-16.

variable D from SW has been replaced by a new fetch&add variable Y . We now describe this variable in detail. Y : fetch&add variable updated only by the writers and read both by the readers and the writers. Y has two components [dr, wcount]. The component dr ∈ {0, 1} is used to indicate the direction of the writers (one corresponding to the replaced direction variable D). Intuitively dr is the direction of the last writer to be in the CS. The wcount component keeps the count of the writers in the Try section and CS. The writer in the beginning of the Try section increments the wcount component, and in its Exit section it decrements the wcount component and flips the dr component. We assume the dr bit is the most significant bit of the word, hence the writer has to only add 1 at the most significant bit to flip it. How is Y able to give the appropriate direction? When a writer flips the dr component in the Exit section, there are two possibilities: either no writer is in the Try section or some writer is present in the Try section. In the former case, this flipping of the direction should not change the “real” direction and in the later case it should. Here is where the component wcount comes into play. If no writer is present and some reader r reads Y to determine the direction (Lines 9, 11 or 14), r will notice the wcount component of Y to be zero, hence it will infer the direction to be dr, i.e., the direction of the last writer to be in the CS. On the other hand if some writer is present in the Try section or CS,

Specification and Constant RMR Algorithm

129

then wcount = 0, so r can infer that some writer started an attempt since the last writer has exited, hence it will take the direction as the complement of dr. This mapping of the value of Y to the appropriate direction of the writer is done by the procedure GetD() (Lines 19-22). Now we explain the algorithm in detail line by line. 5.3 Line by Line Commentary The Read-lock procedure is exactly the same as in SW with the only difference that instead of reading D at Lines 9, 11 or 14, the reader takes the return value from the procedure GetD(), which in turn extracts the direction based on the value of Y as described above. The doorway of the reader comprises of Lines 9-16. Now we explain the Write-lock procedure in detail. In the Write-lock procedure, a writer w first increments the wcount component of Y (Line 1). Note that if wcount of Y was zero just before w executes Line 1, then w has implicitly flipped the direction, and if it was non-zero already then the direction is unchanged. Now w tries to acquire the lock M (Line 2) and proceeds to Line 3 when it is in the CS of lock M . Also note that, the configuration of SW at this point is as if w has already executed its doorway (of SW ), we will get into this detail more when we explain the code of the Exit section of Write-lock (Lines 6-8). w sets its local variable currD and prevD appropriately (Lines 3-4). Then w executes the waiting room of SW (Lines 4-12 of Figure 1) to compete with the readers (Line 5). Once out of this waiting room, w enters the CS as it is assured that no process (reader or a writer) is present in the CS. Before we describe the Exit section, note that all the readers currently in waiting room are waiting on Gate[currD], and at this point both Gate[0], Gate[1] are closed. In the first statement on the Exit section, the writer flips the dr component of Y and at the same time decrements the wcount component (Line 6).7 Note that at this point (when PCw = 7), if there are no active writers other than w in the system, wcount would be zero and the direction would be exactly the same as currD. Hence the invariant i. mentioned in the previous subsection holds. Similarly, if there is some writer present in the Try section (Y.wcount > 0), then w has flipped the direction to currD, essentially w has executed the doorway of SW for the next writer. At Line 7, w enables the waiting readers by opening Gate[currD]. Note at this point if there is a writer in the waiting room, the direction is equal to currD and Gate[currD] is closed, hence the invariant ii. from previous subsection holds. Then finally w releases the lock M (Lines 8). Following theorem summarizes the properties of this algorithm. Theorem 2 (Multi-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 3 implements a Multi-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3) using the lock M from [13] and the algorithm in Figure 1. The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(m) number of shared variables that support read, write, fetch&add, where m is the number of writers in the system. 7

We assume that the dr bit is the most significant bit of the word storing Y , and any overflow bit is simply dropped. Hence w has to only fetch&add [1, −1] to Y to atomically decrement wc and flip dr.

130

V. Bhatt and P. Jayanti

References 1. Brandenburg, B.B., Anderson, J.H.: Reader-writer synchronization for shared-memory multiprocessor real-time systems. In: ECRTS 2009: Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems, Washington, DC, USA, pp. 184–193. IEEE Computer Society, Los Alamitos (2009) 2. Brandenburg, B.B., Anderson, J.H.: Spin-based reader-writer synchronization for multiprocessor real-time systems. Submitted to the Real-Time Systems (December 2009), http://www.cs.unc.edu/˜anderson/papers/rtj09-for-web.pdf 3. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Commun. ACM 8(9), 569 (1965) 4. Courtois, P.J., Heymans, F., Parnas, D.L.: Concurrent control with “readers” and “writers”. Commun. ACM 14(10), 667–668 (1971) 5. Bhatt, V., Jayanti, P.: Constant rmr solutions to reader writer synchronization. In: PODC 2010: Proceeding of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, pp. 468–477. ACM, New York (2010) 6. Lamport, L.: A new solution of Dijkstra’s concurrent programming problem. Commun. ACM 17(8), 453–455 (1974) 7. Mellor-Crummey, J.M., Scott, M.L.: Scalable reader-writer synchronization for sharedmemory multiprocessors. In: PPOPP 1991: Proceedings of the third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 106–113. ACM, New York (1991) 8. Lev, Y., Luchangco, V., Olszewski, M.: Scalable reader-writer locks. In: SPAA 2009: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pp. 101–110. ACM, New York (2009) 9. Hadzilacos, V., Danek, R.: Local-spin group mutual exclusion algorithms. In: Liu, H. (ed.) DISC 2004. LNCS, vol. 3274, pp. 71–85. Springer, Heidelberg (2004) 10. Cypher, R.: The communication requirements of mutual exclusion. In: SPAA 1995: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 147–156. ACM, New York (1995) 11. Attiya, H., Hendler, D., Woelfel, P.: Tight RMR lower bounds for mutual exclusion and other problems. In: STOC 2008: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 217–226. ACM, New York (2008) 12. Anderson, J.H., Kim, Y.-J.: An improved lower bound for the time complexity of mutual exclusion. Distrib. Comput. 15(4), 221–253 (2002) 13. Anderson, T.E.: The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1(1), 6–16 (1990)

On the Performance of Distributed Lock-Based Synchronization Yuval Lubowich and Gadi Taubenfeld The Interdisciplinary Center, P.O. Box 167, Herzliya 46150, Israel {yuval,tgadi}@idc.ac.il

Abstract. We study the relation between two classical types of distributed locking mechanisms, called token-based locking and permission-based locking, and several distributed data structures which use locking for synchronization. We have proposed, implemented and tested several lock-based distributed data structures, namely, two different types of counters called find&increment and increment&publish, a queue, a stack and a linked list. For each one of them we have determined what is the preferred type of lock to be used as the underling locking mechanism. Furthermore, we have determined which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures. Keywords: Locking, synchronization, distributed mutual exclusion, distributed data structures, message passing, performance analysis.

1 Introduction 1.1 Motivation and Objectives Simultaneous access to a data structure shared among several processes, in a distributed message passing system, must be synchronized in order to avoid interference between conflicting operations. Distributed mutual exclusion locks are the de facto mechanism for concurrency control on distributed data structures. A process accesses the data structure only while holding the lock, and hence the process is guaranteed exclusive access. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locks. These locks can be grouped into two main classes: token-based locks and permission-based locks. In token-based locks, a single token is shared by all the processes, and a process acquires the lock (i.e., is allowed to enter its critical section) only when it possesses the token. Permission-based locks are based on the principle that a process acquires the lock only after having received “enough” permissions from other processes. Our first objectives is: Objective one. To determine which one of the two locking techniques – tokenbased locking or permission-based locking – is more efficient. Our strategy to achieve this objective is to implement one classical token-based lock (SuzukiKasami’s lock [28]), and two classical permission-based locks (Maekawa’s lock [10] and Ricart-Agrawala’s lock [22]), and to compare their performance. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 131–142, 2011. c Springer-Verlag Berlin Heidelberg 2011

132

Y. Lubowich and G. Taubenfeld

It is possible to trivially implement a lock by letting a single pre-defined process (machine) to act as an “arbiter” or even to let all the data structures reside in the local memory of a single process and letting this process impose a definite order between concurrent operations. Such a centralized solution might be preferred in some situations, although it limits the degree of concurrency, imposes an extra load on the arbiter, and is less robust. In this work, we focus on fully distributed implementations of locks. Locks are just a tool used when implementing various distributed applications, thus, our second objective has to do with implementing lock-based data structures. Objective two. To propose, implement and test several distributed data structures, namely, two different types of counters, a queue, a stack and a linked list; and for each one of the data structures to determine what is the preferred mutual exclusion lock to be used as the underling locking mechanism. Furthermore, the worst-case message complexity of one of the permissionbased locks (i.e., Maekawa’s lock) is better than the worst-case message complexity of the other two locks. It would be interesting to find out whether this theoretical result be reflected in our performance analysis results. In a shared memory implementation of a data structure, the shared data is usually stored in the shared memory. But, who should hold the data in a distributed message passing system? one process? all of them? In particular, when implementing a distributed counter, who should hold the current value of the counter? one process? all of them? To address this question, for the case of a shared counter, we have implemented and compared two types of shared counter: A find&increment counter, where only the last process to update the counter needs to know its value; and an increment&publish counter, where everybody should know the value of the counter after each time the counter is updated. We notice that in the find&increment counter, once the counter is updated there is no need to “tell” its new value to everybody, but in order to update such a counter one has to find its current value first. In the increment&publish counter the situation is the other way around. Objective three. To determine which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures (such as a queue, a stack or a linked list). We point out that, in our implementations of a queue, a stack, and a linked list, the shared data is distributed among all the processes; that is, all the items inserted by a process are kept in the local memory of that process. 1.2 Experimental Framework and Performance Analysis In order to measure and analyze the performance of the proposed five data structures and of the three locks, we have implemented and run each data structure with each one of the three implemented distributed mutual exclusion locks as the underling locking mechanism. We have measured each data structure’s performance, when using each one of the locks, on a network with one, five, ten, fifteen and twenty processes, where

On the Performance of Distributed Lock-Based Synchronization

133

each process runs in a different node. A typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s lock as its underling locking mechanism. The queue, stack, and linked list were also tested using each one of the two counters as a building block in order to determine which of the two counter performers better when used as a building block for implementing other high level data structures. Special care was taken to make the experiments more realistic by preventing runs which would display an overly optimistic performance; for example, preventing runs where a process completes several operations while acquiring and holding the lock once. Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. 1.3 Our Findings The experiments, as reported in Section 5, lead to the following conclusions: 1. Permission-based locking always outperforms token-based locking. I.e, each one of the two permission-based locks always outperforms the token-based lock. This result about locks may suggest that, in general when implementing distributed data structures, it is better to take the initiative and search for information (i.e., ask for permissions) when needed, instead of waiting for your turn. 2. Maekawa’s permission-based lock always outperforms Ricart-Agrawala and SuzukiKasami locks, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list. Put another way, for each one of the five data structures, the preferred lock to be used as the underling locking mechanism is always Maekawa’s lock. The worst-case message complexity of Maekawa’s lock is better than the worstcase message complexity of both Ricart-Agrawala and Suzuki-Kasami locks; thus, the performance analysis supports and confirms the theoretical analysis. 3. The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. This result about counters may suggest that, in general when implementing distributed data structures, it is more efficient to actively search for information only when it is needed, instead of distributing it in advance. As expected, all the data structures exhibit performance degradation as the number of processes grows.

2 Distributed Mutual Exclusion Algorithms We consider a system which is made up of n reliable processes, denoted p1 , . . . , pn , which communicate via message passing. We assume that the reader is familiar with the definition of the mutual exclusion problem [4, 29]. The three mutual exclusion algorithms (i.e., locks) implemented in this work satisfy the mutual exclusion and the starvation freedom requirements.

134

Y. Lubowich and G. Taubenfeld

The first published distributed mutual exclusion algorithm, due to Lamport [8], is based on the notion of logical clocks. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locking algorithms [19, 20, 25, 26]. These algorithms can be grouped into two main classes: token-based algorithms [13, 18, 28, 27] and permission-based algorithms [3, 8, 10, 12, 22, 24]. – In token-based algorithms, a single token is shared by all the processes, and a process is allowed to enter its critical section only when it possesses the token. A process continues to hold the token until its execution of the critical section is over, and than it may pass it to some other process. – Permission-based algorithms, are based on the principle that a process may enter its critical section only after having received “enough” permissions from other processes. Some permission based algorithms require that a process receives permission for all of the other processes whereas others, more efficient algorithms, require a process to receive permissions from a smaller group. Below we describe the basic principles of the three known distributed mutual exclusion algorithm that we have implemented. 2.1 Suzuki-Kasami’s Token-Based Algorithm In Suzuki and Kasami’s algorithm [28], the privilege to enter a critical section is granted to the process that holds the PRIVILEGE token (which is always held by exactly one process). Initially process p1 has the privilege . A process requesting the privilege sends a REQUEST message to all other processes. A process receiving a PRIVILEGE message (i.e. the token) is allowed to enter its critical section repeatedly until the process passes the PRIVILEGE to some other process. A REQUEST message of process pj has the form REQUEST(j, m) where j is the process identifier and m is a sequence number which indicates that pj is requesting its (m + 1)’th critical section invocation. Each process has an array RN of size n, where n is the number of processes. This array is used to record the largest sequence of number ever received from each one of the other processes. When a REQUEST(j, m) message is received by pi , the process updates RN by executing RN [j] = max(RN [j], m). A PRIVILEGE message has the form of PRIVILEGE(Q, LN ) where Q is a queue of requesting processes and LN is an array of size n such that, LN [j] is the sequence number of the request of pj granted most recently. When pi finishes executing its critical section, the array LN , contained in the last PRIVILEGE message received by pi , is updated by executing LN [i] = RN [i], indicating that the current request of pi has been granted. Next, every process pj such that RN [j] = LN [j] + 1, is appended to Q provided that pj is not already in Q. When these updates are completed, if Q is not empty then PRIVILEGE(tail(Q), LN ) is send to the process at the head of Q. If Q is empty then pi retain the privilege until some process requests it. The algorithm requires, in the worst case, n message exchanges per mutual exclusion invocation: (n − 1) REQUEST messages and one PRIVILEGE message. In the best case, when the process requesting to enter its critical section already holds the privilege token, the algorithm requires no messages at all.

On the Performance of Distributed Lock-Based Synchronization

135

2.2 Ricart-Agrawala’s Permission-Based Algorithm The first permission based algorithm, due to Lamport [8], has 3(n − 1) message complexity. Ricart and Agrawala had modified Lamport’s algorithm and were able to achieve 2(n − 1) message complexity [22]. In this algorithm, when a process, say pi , wants to enter its critical section, it sends a REQUEST(m, i) message to all other processes. This message contains a sequence number m and the process’ identifier i, which are then used to define a priority among requests. Process pj , upon receipt of a REQUEST message from process pi , sends an immediate REPLY message to pi if either pj itself has not requested to enter its critical section, or pj ’s request has a lower priority than that of pi . Otherwise, process pj defers its REPLY (to pi ) until its own (higher priority) request is granted. Process pi enters its critical section when it receives REPLY messages from all the other n − 1 processes. When pi releases the critical section, it sends a REPLY message to all deferred requests. Thus, a REPLY message from process pi implies that pi has finished executing its critical section. This algorithm requires only 2(n − 1) messages per critical section invocation: n − 1 REQUEST messages and n − 1 REPLY messages. 2.3 Maekawa’s Permission-Based Algorithm In Maekawa’s algorithm [10], process pi acquires permission to√enter its critical section from a set of processes, denoted Si ,√which consists of at most n processes that act as arbiters. The algorithm uses only c n messages per critical section invocation, where c is a constant between 3 for light traffic and 5 for heavy traffic. Each process can issue a request at any time. In order to arbitrate requests, any two requests from different processes must be known to at least one arbitrator process. Since process pi must obtain permission to enter its critical section from every process in Si , the intersection of every pair of sets Si and Sj must not be empty, so that processes in Si ∩ Sj can serve as arbitrators between conflicting requests of pi and pj . There are many efficient constructions of the sets S1 ,..., Sn (see for example√[7, 23, 15]). The construction used in our implementation is as follows: Assume that n is a positive of √ integer √ (if not then few dummy processes can be added). Consider a matrix √ size n × n, where the value of an entry (i, j) in the matrix is (i − 1) × n + j. Clearly, for every k ∈ {1, ..., n} there is exactly one entry, √ √ denoted (ik , jk ), whose value is k. The unique entry (ik , jk ) is (k/ n, k (mod n) + 1). For each k ∈ {1, ..., n}, a subset Sk is defined to be the set of values on the row and the column passing √ through (ik , jk ). Clearly, Si ∩ Sj = ∅ for all pairs i and j (and the size of each set is 2 n − 1). Thus, whenever two processes pi and pj try to enter their critical sections, the arbiter processes in Si ∩ Sj will grant access to only one of them at a time, and thus the mutual exclusion property is satisfied. By carefully designing the algorithm deadlock is also avoided.

3 Distributed Data Structures We have proposed and implemented five distributed data structures: two types of counters, a queue, a stack and a linked-list. Each one of these data structures implementations

136

Y. Lubowich and G. Taubenfeld

makes use of an underlying locking mechanism. As already mentioned, we have implemented the three mutual exclusion algorithms described in the previous section, and for each of the five data structures determined what is the preferred mutual exclusion algorithm to be use for locking. Various lock-based data structures have been proposed in the literature mainly for use in databases, see for example [2, 5, 9]. A distributed dictionary structure is studied in [16]. Below we describe the five distributed data structures that we have studied. All the data structures are linearizable. Linearizability means that, although several processes may concurrently invoke operations on a linearizable data structure, each operation appears to take place instantaneously at some point in time, and that the relative order of non-concurrent operations is preserved [6]. Two counters. A shared counter is a linearizable data structure that supports the single operation of incrementing the value of the counter by one and returning its previous value. We have implemented and compared two types of shared counter: 1. A find&increment counter. In this type of a counter only the last process to update the counter needs to know its value. In the implementation, a single lock is used, and only the last process to increment the counter knows its current value. A process p that tries to increment the shared counter first acquires the lock. Then p sends a FIND message to all other processes. When the process that knows the value of the counter receives a FIND message, it replies by sending a message with the value of the counter to p. When p receives the message, it increments the counter and releases the lock. (We notice that p can keep on incrementing the counter’s value until it gets a FIND message.) 2. An increment&publish counter. In this counter everybody should know the value of the counter each time it is updated. In the implementation, a single lock is used. A process that tries to increment the shared counter first acquires the lock. Then, it raises the counter value, sends messages to all other processes informing them of the new counter value, gets acknowledgements, and releases the lock. A queue. A distributed queue is a linearizable data structure that supports enqueue and dequeue operations, by several processes, with the usual queue semantics. We have implemented a distributed queue which consists of local queues residing in the individual processes participating in the distributed queue. A single lock and a shared counter are used for the implementation. Each element in a queue has a timestamp that is generated using the shared counter. An ENQUEUE operation is carried out by raising the counter’s value by one and enqueuing an element in the local queue along with the counter’s value. A DEQUEUE operation is carried out by first acquiring the lock, locating the process that holds the element with the lowest timestamp, removing this element from this process’ local queue, and releasing the lock. A stack. A distributed stack is a linearizable data structure that supports push and pop operations, by several processes, with the usual stack semantics. We have implemented a distributed stack which is similar to the distributed queue. A single lock and a shared counter are used for the implementation. It consists of local stacks residing in the

On the Performance of Distributed Lock-Based Synchronization

137

individual processes participating in the distributed stack. Each element in the stack has a timestamp that is generated by the shared counter. A PUSH operation is carried out by incrementing the counter value by one and pushing the element in the local stack along with the counter’s value. A POP operation is carried out by acquiring the lock, locating the process that contains the element with the highest timestamp, removing this element from its local stack, an releasing the lock. A linked list. A distributed linked list is a linearizable data structure that supports insertion and deletion of elements from any point in the list. We have implemented a list which consists of a sequence of elements, each containing a data field and two references (“links”) pointing to the next and previous elements. Each element can reside in any process. The distributed list also supports the operations “traverse list”; and “size of list”. The list contains a head and a tail “pointers” that can be sent to requesting processes. Each pointer maintains a reference to a certain process and a pointer to a real element stored in that process. Manipulating the list requires that a lock be acquired. A process that needs to insert an element to the head of the list acquires the lock, and sends a request for the “head pointer” to the rest of the processes. Whenever a process that holds the “head pointer” receives the message, it immediately replies by sending the pointer to the requesting process. Once the requesting process has the “head pointer” inserting the new element is purely a matter of storing it locally and modifying the “head pointer” to point to the new element (the new element of course now points to the element previously pointed by the “head pointer”). Deleting an element from the head of list is done much the same way. Inserting or deleting elements from the list requires a process to acquire the (single) lock, traverse the list and manipulate the list’s elements. A process is able to measure the size of the list by acquiring the lock and then querying the other processes about the size of their local lists.

4 The Experimental Framework Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. We measured each data structure’s performance, using each one of the distributed mutual exclusion algorithms, by running each data structure on a network with one, five, ten, fifteen and twenty processes, where each process runs in a different node of the network. For example a typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s algorithm as its underling locking mechanism. All tests were implemented using Coridan’s messaging middleware technology called MantaRay. MantaRay’s is a fully distributed server-less architecture where processes running in the network are aware of one another and as a result are able to send messages back and forth directly. We have tested each of the implementations in hours, and sometimes days long, of executions on various number of processes (machines).

138

Y. Lubowich and G. Taubenfeld

5 Performance Analysis and Results All the experiments done on the data structures we have implemented, start with an initially empty data structure (queue, stack etc.) to which processes have performed a series of operations. For example, in the case of a queue, the processes performed a series of enqueue/dequeue operations. Each process enqueued an element, did “something else” and repeated for a million times. After that, the process dequeued an element, did “something else” and repeated for a million times again. The “something else” consisted of approximately 30 mSeconds of doing nothing and waiting. As with the tests done on the locking algorithms, this served in making the experiments more realistic in preventing long runs by the same process which would display an overly optimistic performance, as a process may complete several operations while holding the lock. The time a process took to complete the “something else” is not reported in our figures. The experiments, as reported below, lead to the following conclusions: – Maekawa’s permission-based algorithm always outperforms Ricart-Agrawala and Suzuki-Kasami algorithms, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list; – The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. As expected, the data structures exhibit performance degradation as the number of processes grows. 5.1 Counters The two graphs in Figure 1, show the time one process spends performing a single count up operation averaged over one million operations for each process using each of the three locking algorithms implemented. As can be seen, the counters perform worse when using Ricart-Agrawala algorithm and perform best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the find&increment counter behaves and scales better than the increment&publish counter when the number of processes grows. The observation that the find&increment counter is better than the increment&publish counter will become also clear when examining the results for the queue and stack implementations that make use of shared counters as building blocks. 5.2 A Queue The two graphs in Figure 2, show the time one process spends performing a single enqueue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the two counters, the queue performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. It is clear that the queue performs better when using the find&increment counter than when using increment&publish counter.

On the Performance of Distributed Lock-Based Synchronization

Increment&publish Counter

450

450

400

400

350

350

300

300

mSeconds

mSeconds

Find&increment Counter

250 200 150 Suzuki Kasami

50 15

150 Ricart Agrawala Suzuki Kasami Maekawa

0

0 10 Processes

200

50

Maekawa

5

250

100

Ricart Agrawala

100

1

139

1

20

5

10 15 Processes

20

Fig. 1. The time one process spends performing a single count up operation averaged over one million operations per process, in the find&increment counter and in the increment&publish counter

Queue - Enqueue Operation Employing Increment&publish Counter

450

450

400

400

350

350

300

300 mSeconds

mSeconds

Queue - Enqueue Operation Employing Find&increment Counter

250 200

250 200 150

150 Ricart Agrawala

100

Suzuki Kasami

50

Maekawa 0 1

5

10 Processes

15

20

Ricart Agrawala

100

Suzuki Kasami

50

Maekawa 0 1

5

10 Processes

15

20

Fig. 2. The time one process spends performing an enqueue operation averaged over one million operations per process, in a queue employing the find&increment counter and in a queue employing the increment&publish counter

The dequeue operation does not make use of a shared counter. Figure 3bb shows the time one process spends performing a single dequeue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the enqueue operation, the dequeue operation is the slowest when using Ricart-Agrawala algorithm, and is the fastest when using Maekawa’s algorithm. 5.3 A Stack As expected, the graphs of the performance analysis results for a stack are almost the same as those presented in the previous subsection for a queue, and hence omitted from this abstract. As in all previous examples, the stack performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the stack performs better when using the find&increment counter than when using the increment&publish counter.

140

Y. Lubowich and G. Taubenfeld Queue - Dequeue Operation 450 400 350 mSeconds

300 250 200 150 Ricart Agrawala

100

Suzuki Kasami

50

Maekawa

0 1

5

10 Processes

15

20

Fig. 3. The time one process spends performing a dequeue operation averaged over one million operations per process

5.4 A Linked List The linked list we have implemented does not make use of a shared counter. Rather it uses the locking algorithm directly to acquire a lock before manipulating the list itself. The graphs in Figure 4, show the time one process spends performing a single insert operation or a single delete operation, averaged over one million operations for each process using each of the three locking algorithms implemented. As in all previous examples, the linked list performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm as the underling locking mechanism. Linked List Delete Operation

350

350

300

300

250

250 mSeconds

mSeconds

Linked List Insert Operation

200 150 Ricart Agrawala

100

200 150 Ricart Agrawala

100

Suzuki Kasami -

Suzuki Kasami 50

50 Maekawa -

Maekawa 0

0 1

5

10 Processes

15

20

1

5

10 Processes

15

20

Fig. 4. The time one process spends performing an insert operation or delete operation averaged over one million operations per process in a linked list

6 Discussion Data structures such as shared counters, queues and stacks are ubiquitous in programming concurrent and distributed systems, and hence their performance is a matter of concern. While the subject of data structures is a very hot research topic in recent years

On the Performance of Distributed Lock-Based Synchronization

141

in the context of concurrent (shared memory) systems, this is not the case for distributed (message passing) systems. In this work, we have studied the relation between classical locks and specific distributed data structures which use locking for synchronization. The experiments consistently revealed that the implementation of Maekawa’s lock is more efficient than that of the other two locks, and that the implementation of the find&increment counter is consistently more efficient than that of the increment&publish counter. The fact that Maekawa’s lock performs better is, in part, due to the fact that its worst-case message complexity is better. The results suggest that, in general, it is more efficient to actively search for information (or ask for permissions) only when it is needed, instead of distributing it to everybody in advance. Thus, we expect to find similar type of results also for different experimental set ups. For our investigation, it is important to implement and use the locks as completely independent building blocks, so that we can compare their performance. In practice, various optimizations are possible. For example, when implementing the find&increment counter using a lock, a simple optimization would be to store the value of the counter along with the lock. Thus, when a process requests and obtains a lock, it obtains the current value of the counter along with the lock, thereby eliminating the need for any find messages. Future work would entail implementing and evaluating other locking algorithms [3, 14], and fault tolerant locking algorithms that do not assume an error-free network [1, 11, 18, 21, 17]. It would also be interesting to consider additional distributed lockbased data structures, and different experimental set ups. When using locks, the granularity of synchronization is important. Our implementations are examples of coarse-grained synchronization, as they allow only one process at a time to access the data structure. It would be interesting to consider data structures which use fine-grained synchronization in which it is possible to lock “small pieces” of a data structure, allowing several processes with non-interfering operations to access it concurrently. Coarse-grained synchronization is easier to program but is less efficient and is not fault-tolerant compared to fine-grained synchronization.

References 1. Agrawal, D., El-Abbadi, A.: An efficient and fault-tolerant solution for distributed mutual exclusion. ACM Transactions on Computer Systems 9(1), 1–20 (1991) 2. Bayer, R., Schkolnick, M.: Concurrency operations on B-trees. Acta Informatica 1(1), 1–21 (1977) 3. Carvalho, O.S.F., Roucairol, G.: On mutual exclusion in computer networks. Communications of the ACM 26(2), 146–147 (1983) 4. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Communications of the ACM 8(9), 569 (1965) 5. Ellis, C.S.: Distributed data structures: A case study. IEEE Transactions on Computers c34(12), 1178–1185 (1985) 6. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. on Programming Languages and Systems 12(3), 463–492 (1990) 7. Ibaraki, T., Kameda, T.: A theory of coteries: Mutual exclusion in distributed systems. IEEE Transactions on Parallel and Distributed Systems 4(7), 779–794 (1993)

142

Y. Lubowich and G. Taubenfeld

8. Lamport, L.: Time, clocks, and the order of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978) 9. Lehman, P.L., Yao, S.B.: Efficient locking for concurrent operations on B-trees. ACM Transactions on Database √ Systems 6(4), 650–670 (1981) 10. Maekawa, M.: A N algorithm for mutual exclusion in decentralized systems. ACM Transactions on Computer Systems 3(2), 145–159 (1985) 11. Mishra, S., Srimani, P.K.: Fault-tolerant mutual exclusion algorithms. Journal of Systems and Software 11(2), 111–129 (1990) 12. Mizuno, M., Mesterenko, M., Kakugawa, H.: Lock-based self-stabilizing distributed mutual exclusion algorithm. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 708–716 (1996) 13. Naimi, M., Trehel, M.: An improvement of the log n distributed algorithm for mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 371–375 (1987) 14. Neilsen, M.L., Mizuno, M.: A DAG-based algorithm for distributed mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 354–360 (1991) 15. Neilsen, M.L., Mizuno, M.: Coterie join algorithm. IEEE Transactions on Parallel and Distributed Systems 3(5), 582–590 (1992) 16. Peleg, D.: Distributed data structures: A complexity-oriented structure. In: van Leeuwen, J., Santoro, N. (eds.) WDAG 1990. LNCS, vol. 486, pp. 71–89. Springer, Heidelberg (1991) 17. Rangarajan, S., Tripathi, S.K.: A robust distributed mutual exclusion algorithm. In: Toueg, S., Kirousis, L.M., Spirakis, P.G. (eds.) WDAG 1991. LNCS, vol. 579, pp. 295–308. Springer, Heidelberg (1992) 18. Raymond, K.: A tree-based algorithm for distributed mutual exclusion. ACM Transactions on Computer Systems 7(1), 61–77 (1989) 19. Raynal, M.: Algorithms for mutual exclusion. The MIT Press, Cambridge (1986); Translation of: Algorithmique du parall´elisme (1984) 20. Raynal, M.: Simple taxonomy for distributed mutual exclusion algorithms. Operating Systems Review (ACM) 25(2), 47–50 (1991) 21. Reddy, R.L.N., Gupta, B., Srimani, P.K.: New fault-tolerant distributed mutual exclusion algorithm. In: Proc. of the ACM/SIGAPP Symp. on Applied Computing, pp. 831–839 (1992) 22. Ricart, G., Agrawala, A.K.: An optimal algorithm for mutual exclusion in computer networks. CACM 24(1), 9–17 (1981); Corrigendum in CACM 24(9), 578 (1981) 23. Shou, D., Wang, S.D.: A new transformation method for nondominated coterie design. Information Sciences 74(3), 223–246 (1993) 24. Singhal, M.: A dynamic information-structure mutual exclusion algorithm for distributed systems. IEEE Transactions on Parallel and Distributed Systems 3(1), 121–125 (1992) 25. Singhal, M.: A taxonomy of distributed mutual exclusion. Journal of Parallel and Distributed Computing 18(1), 94–101 (1993) 26. Singhal, M., Shivaratri, N.G.: Advanced concepts in operating systems: distributed, database and multiprocessor operating systems. McGraw-Hill, Inc., New York (1994) 27. van de Snepscheut, J.L.A.: Fair mutual exclusion on a graph of processes. Distributed Computing 2, 113–115 (1987) 28. Suzuki, I., Kasami, T.: A distributed mutual exclusion algorithm. ACM Transactions on Computer Systems 3(4), 344–349 (1985) 29. Taubenfeld, G.: Synchronization Algorithms and Concurrent Programming, 423 pages. Pearson/Prentice-Hall (2006) ISBN 0-131-97259-6

Distributed Generalized Dynamic Barrier Synchronization Shivali Agarwal1, Saurabh Joshi2 , and Rudrapatna K. Shyamasundar3 1

IBM Research, India Indian Institute of Technology, Kanpur Tata Institute of Fundamental Research, Mumbai 2

3

Abstract. Barrier synchronization is widely used in shared-memory parallel programs to synchronize between phases of data-parallel algorithms. With proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new languages such as X10 wherein the processes participating in barrier synchronization are not known a priori, and the processes in distinct “places” don’t share memory. Thus, the challenge here is to not only achieve barrier synchronization in a distributed setting without any centralized controller, but also to deal with dynamic nature of such a synchronization as processes are free to join and drop out at any synchronization phase. In this paper, we describe a solution for the generalized distributed barrier synchronization wherein processes can dynamically join or drop out of barrier synchronization; that is, participating processes are not known a priori. Using the policy of permitting a process to join only in the beginning of each phase, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k + 1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and (ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a ﬁnite number of phases. The above protocol is further generalized to multiple groups of processes (possibly non-disjoint) engaged in barrier synchronization.

1

Introduction

Synchronization and coordination play an important role in parallel computation. Language constructs for eﬃcient coordination of computation on shared memory multi-processors, and multi-core processors are of growing interest. There are a plethora of language constructs used for realizing mutual exclusion, point-to-point synchronization, termination detection, collective barrier synchronization etc. Barrier [8] is one of the important busy-wait primitives used to ensure that none of the processes proceed beyond a particular point in a computation until all have arrived at that point. A software implementation of the barrier using shared variables is also referred to as phase synchronization [1,7]. The issues of remote references while realizing barriers has been treated exhaustively in the seminal work [3]. Barrier synchronization protocols, either centralized and M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 143–154, 2011. c Springer-Verlag Berlin Heidelberg 2011

144

S. Agarwal, S. Joshi, and R.K. Shyamasundar

distributed, have been proposed earlier for the case when processes that have to synchronize are given a priori [7][5][13][6][14]. With the proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new distributed shared memory based languages such as X10 [16] wherein the processes participating in barrier synchronization are not known a priori. Some of the recent works that address dynamic number of processes for barrier synchronization are [20][18][21]. More details on existing work on barrier synchronization can be found in section 6. Surprisingly, a distributed solution to the phase synchronization problem in such dynamic environments has not yet been proposed. In this paper, we describe a distributed solution to the problem of barrier synchronization used as an underlying synchronization mechanism for achieving phase synchronization where processes are dynamically created (in the context of nested parallelism). The challenge arises in arriving at the common knowledge of the processes that want to participate in phase synchronization for every phase in a de-centralized manner such that there are some guarantees on the progress and starvation freedom properties of the processes in addition to the basic correctness property. 1.1

Phase Synchronization Problem

The problem of phase synchronization [7] is described below: Consider a set of asynchronous processes where each process executes a sequence of phases; a process begins its next phase only upon completion of its previous phase (for the moment let us ignore the constitution of a phase). The problem is to design a synchronization scheme which guarantees the following properties: 1. No process begins it’s (k +1)th phase until all processes have completed their k th phase, k ≥ 0. 2. No process will be permanently blocked from executing it’s (k + 1)th phase if all processes have completed their k th phase, k ≥ 0. The set of processes that have to synchronize can be either given a priori which remains unchanged or the set can be a dynamic set in which new processes may join as and when they want to phase synchronize or existing processes may drop out of phase synchronization. In this paper, we describe a distributed solution for the dynamic barrier synchronization in the context of phase synchronization, wherein processes can dynamically join or drop out of phase synchronization. Using the policy of permitting a process to join in a phase subsequent to the phase of registration, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k +1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and

Distributed Generalized Dynamic Barrier Synchronization

145

(ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a ﬁnite number of phases. Our protocol establishes a bound of at most two phases from the phase it registered it’s intention to join1 the phase synchronization. The lower bound is one phase. The correctness of the solution is formally established. The dynamic barrier synchronization algorithm is further generalized to cater to groups of barrier synchronization processes.

2

Barrier Synchronization with Dynamic Set of Processes

We consider a distributed system which gets initialized with a non-empty set of processes. New processes can join the system at will and existing processes may drop out of the system when they are done with their work. They carry out individual computations in phases and synchronize with each other at the end of each phase. Since it is a distributed system with no centralized control and no a priori knowledge of number of processes in the system, each process has to dynamically discover the new processes that have joined the system in such a manner that the new process can start synchronizing with them in ﬁnite amount of time. The distributed barrier synchronization protocol described below deals with this issue of including new processes in the ongoing phase synchronization in a manner that progress of existing as well as newly joined processes is ensured. It also handles the processes that drop out of the system so that existing processes know that they do not have to wait on these for commencing the next phase. Note that there is no a priori limit on the number of processes. The abstract linguistic constructs for registration and synchronization of processes is described in the following. 2.1

Abstract Language

We base our abstract language for the barrier synchronization protocol on X10. The relevant syntax is shown in ﬁgure 1 and explained below: < P rogram > < P roc > < clockDec > < stmtseq > < basic-stmt > clock-id

::=< async P roc > || < async P roc > ::=< clockDec >; < stmtseq > | clocked < clock-id >< stmtseq > ::= new clock c1, c2.. ::=< basic-stmt > | < basic-stmt >< stmtseq > ::= async P roc|atomic stmt|seq stmt|c.register|c.drop|next ::= c1, c2, ...

Fig. 1. Abstract Clock Language for Barrier Synchronization 1

Starvation Freedom is guaranteed only for processes that are registered.

146

S. Agarwal, S. Joshi, and R.K. Shyamasundar

• Asynchronous activities: The keyword to denote asynchronous processes is async. The async is used with an optional place expression and a mandatory code block. A process that creates another process is said to be the parent of the process it creates. • Clock synchronization: Special variables of type clock are used for barrier synchronization of processes. A clock corresponds to the notion of a barrier. A set of processes registered with a clock synchronize with each other w.r.t. that clock. A barrier synchronization point in a process is denoted by next. If a process is registered on multiple clocks, then next denotes synchronization on all of those. This makes the barrier synchronization deadlock-free. The abstraction of phase synchronization through clocks enables to form groups of processes such that groups can merge or disjoin dynamically for synchronization. Some important points regarding dynamic joining rule for phase synchronization are: – A process registered on clock c, can create a child process synchronizing on c via async clocked c {body}. The child process joins the phase synchronization in phase k + 1, if the parent is in phase k while executing async. – A process can register with a clock c using c.register. It will join the phase synchronization from phase k + 1 or k + 2 if the clock is in phase k at the time of registration. Some important points regarding dropping out of phase synchronization are: - A process that drops in phase k is dropped out in the same phase and is not allowed to create child processes that want to join phase synchronization in that phase. Note that this does not restrict the expressiveness of the language in any way and ensures a clean way of dropping out. - A process that registers after dropping loses the information of parent and is treated as a process whose parent is not clocked on c. - An implicit c.drop is assumed when a process registered on clock c terminates. We now provide a solution in the form of a protocol for distributed dynamic barrier synchronization problem that provably obeys the above mentioned dynamic joining rules.

3

Distributed Barrier Synchronization Solution

The distributed barrier synchronization protocol for a single clock is given in ﬁgure 2. The ﬁgures respectively describe the protocol for barrier operations like initialization, synchronization and drop. The notations used in the solution are described below. Notation: - We denote ith process by Ai (also referred as process i). - Phases are tracked in terms of k − 1, k, k + 1, · · · . - We use the guarded command notation [2] for describing our algorithm as it is easy to capture interleaving execution and termination in a structured manner.

Distributed Generalized Dynamic Barrier Synchronization

147

Assumption: The processes do not have random failures and will always call c.drop if they want to leave the phase synchronization. 3.1

Correspondance between Protocol Steps and Clock Operations

The correspondence of the clock operations with the protocol operations is given below: • new clock c: Creation of a clock(barrier) corresponds to creation of a special process Ac that executes as follows where the code blocks INIT c, SYNC and CONSOLIDATE are shown in ﬁgure 2. INIT_c; while (true) { SYNC; CONSOLIDATE;}

Note on Ac : It is a special process that exists till the program terminates. It acts like a point of contact for processes in case of explicit registration through c.register as seen below without introducing any centralized control. • next: A process Ai already in phase synchronization performs a next for barrier synchronization. A next corresponds to: SYNC;

CONSOLIDATE;

• Registration through clocked: A process Ai can either register through clocked at time of creation in which case it gets into the list Aj .registered of its parent process, Aj . In this case, Ai joins from the next phase. The speciﬁc code that gets executed in the parent for a clocked process is: INIT_i; A_j.registered:=A_j.registered+A_i;

The code that gets executed in Ai is: while (!A_i.proceed);

• Registration through c.register: If Ai registers like this, then it may join phase synchronization within atmost the next two phases. Following code gets executed in Ai : INIT_i; A_c.registered:=A_c.registered+A_i; while (!A_i.proceed);

• c.drop : Process Ai drops out of phase synchronization through c.drop (see ﬁg. 2). The code that gets executed is: DROP;

Note: 1) Though we assume Ac to exist throughout the program execution, our algorithm is robust with respect to graceful termination of Ac , that is, it terminates after completing CONSOLIDATE and there are no processes in Ac .registered upon consolidation. The only impact on phase synchronization being that no new processes can register through c.register. 2) The assignments are done atomically.

148

S. Agarwal, S. Joshi, and R.K. Shyamasundar

3.2 How the Protocol Works The solution achieves phase synchronization by ensuring that the set of processes that enter a phase is a common knowledge to all the processes. Attaining common knowledge of the existence of new processes and the non-existence of dropped processes in every phase is the non-trivial part of the phase synchronization protocol in a dynamic environment. The machinery built to solve this problem is shown in ﬁgures 2 and described below in detail. Protocol Variables: Ac – The special clock process for clock c Ai .executing – The current phase the process i is executing Ai .proceed – used for allowing new processes to join the active phase; Ai .next – the next phase that process i wants to execute Ai .Iconcurrent – set of processes executing the phase Ai .executing; Ai .newIconcurrent – the set of new processes that will be part of the next phase. Ai .registered – the set of new processes that want to enter phase synchronization with Ai . Ai .newsynchproc – set of new processes registered with process i that will synhronize from the next phase. Ai .drop – when a process wants to drop (or terminates), it sets Ai .drop to true and exits. Ai .checklist – the subset of Ai .Iconcurrent carried to the next phase for synchronization. Ai .Iconcurrent denotes the set of processes that Ai is synchronizing with in a phase.

This set may shrink or expand after each phase depending on if an existing process drops or a new process joins respectively. The variable Ai .newsynchproc is assigned to the set that process Ai wants the other processes to include for synchronization from the next phase onwards. The variable Ai .newIconcurrent is used to accumulate the processes that will form Ai .Iconcurrent in the next phase. Ai .executing denotes the current phase of Ai and Ai .next denotes the next phase that the process will move to. INIT c: This initializes the special clock process Ac that is started at the creation of a clock c. Note that the clock process is initialized to itself for the set of initial processes that it has to synchronize with. Ac .proceed is set to true to start the synchronization. INIT i: When a process registers with a clock, Ai .proceed is set to f alse. The newly registered process waits for Ai .proceed to be made true which is done in CONSOLIDATE block of the process that contains Ai in its registered set. Rest of the variables are also set properly in this CONSOLIDATE block. In the following, we explain the protocol for SYNC and CONSOLIDATE. SYNC: This is the barrier synchronization stage of a process and performs the following main functions: 1) checks if all the processes in the phase are ready to move to the next phase; Ai .next is used to denote the completion of phase and check for others in the phase, 2) informs the other processes about the new processes that have to join from the next phase, 3) establishes if the processes have exchanged the relevant information so that it can consolidate the information required for the execution of the next phase.

Distributed Generalized Dynamic Barrier Synchronization

149

The new processes that are registered with Ai form the set Ai .newsynchproc. This step is required to capture the local snapshot. Note that for processes other than clock process, Ai .registered will be same as Ai .newsynhproc. However for the special clock process Ac , Ac .registered may keep on changing during the SYNC execution. Therefore, we need to take a snapshot so that consistent set of processes that have to be included from the next phase can be conveyed to other processes that are present in the synhcronization. The increment of Ai .next denotes that eﬀectively the process has completed the phase and is preparing to move to the next phase. Note that after this operation the diﬀerence between Ai .next and Ai .executing becomes 2 denoting the transition. The second part of SYNC is a do-od loop that forms the crux of barrier synchronization. There are three guarded commands in this loop which are explained below. 1. The ﬁrst guarded command checks if there exists a process j in Ai .Iconcurrent that has also reached barrier synchronization. If the value of Aj .next is greater or equal to Ai .next, then it implies that Aj has also reached the barrier point. If this guard is evaluated true, then that process is removed from Ai .Iconcurrent and the new processes that registered with Aj are added to the set Ai .newIconcurrent. 2. The second guard checks if any process in Ai .Iconcurrent has dropped out of synchronization and accordingly the set Ai .newIconcurrent is updated. 3. The third guard is true if the process j has not yet reached the barrier synchronization point. The associated statement with this guard is a no-op. It is this statement which forms the waiting part for barrier synchronization. By the end of this loop, Ai .Iconcurrent shall only contain Ai . The current phase denoted by Ai .executing is incremented to denote that process can start with the next phase. However, to ensure that the local snapshot captured in Ai .newsynchproc is properly conveyed to the other processes participating in phase synchronization, another do-od loop is executed that checks if processes have indeed moved to the next phase by incrementing Ai .executing. CONSOLIDATE: After ensuring that Ai has synchronized on the barrier, a ﬁnal round of consolidation is performed to prepare Ai for executing in the next phase. This phase consolidation is described under label CONSOLIDATE. The set of processes that Ai needs to phase synchronize are in Ai .newIconcurrent, therefore, Ai .Iconcurrent is assigned to Ai .newIconcurrent. All the new processes that will join from Ai .executing are signalled to proceed after initializing them properly. The set Ai .registered is updated to ensure that it has only those new processes that got registered after the value of Ai .registered was last read in SYNC. This is possible because of the explicit registration that is allowed through the special clock process. DROP: Ai .drop is set to true so that the guarded command in SYNC can become true appropriately. The restriction posed on a drop command ensures that Ai .registered will be empty and thus the starvation freedom guarantee is preserved.

150

S. Agarwal, S. Joshi, and R.K. Shyamasundar

Protocol for Process i INIT c: (* Initialization of clock process *) Ac .executing, Ac .next, Ac .Iconcurrent, Ac .registered, Ac .proceed, Ac .drop := 0, 1, Ac , ∅, true, f alse; INIT i: (* Initialization of Ai that performs a registration *) Ai .proceed := f alse; SYNC : (*CHECK COMPLETION of Ai .executing by other members*) Ai .newsynchproc := Ai .registered; Ai .newIconcurrent := Ai .newsynchproc + Ai ; Ai .next := Ai .next + 1; Ai .checklist := ∅; do Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent∧ i = j ∧ Ai .next ≤ Aj .next → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; Ai .newIconcurrent := Ai .newIconcurrent+Aj .newsynchproc +; {Aj }; Ai .checklist := Ai .checklist + Aj [] Ai .Iconcurrent = ∅ ∧ Aj .drop → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; [] Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent ∧ Ai .next > Aj .next (* no need to check i = j *) → skip; od; Ai .executing := Ai .executing + 1; (* Set the current phase *) do (* Check for completion of phase in other processes *) Ai .checklist = ∅ ∧ j ∈ Ai .checklist ∧ Ai .executing == Aj .executing → Ai .checklist := Ai .checklist − {Aj } od; CONSOLIDATE: (* CONSOLIDATE processes for the next phase *) Ai .Iconcurrent := Ai .newIconcurrent; Ai .registered := Ai .registered − Ai .newsynchproc; for all j ∈ Ai .newsynchproc do Aj .executing, Aj .next, Aj .Iconcurrent, Aj .registered, Aj .drop := Ai .executing, Ai .next, Ai .Iconcurrent, ∅, f alse; Aj .proceed := true; DROP : (* Code when Process Ai calls c.drop*) Ai .proceed := f alse; Ai .drop := true;

Fig. 2. Action of processes in phase synchronization

4

Correctness of the Solution

The proof obligations for synchronization and progress are given below. We have provided the proof in semi-formal way in the style of [1] which is available in the longer version of the paper [22]. The symbol ‘→’ denotes leads to. – Synchronization We need to show that the postcondition of SYNC;CONSOLIDATE; (corresponding to barrier synchronization) for processes that have proceed set to true is : {∀i, j((Ai .proceed = true ∧ Aj .proceed = true) ⇒ Ai .executing = Aj .executing)}

Distributed Generalized Dynamic Barrier Synchronization

151

– Progress

Property 1: The progress for processes already in phase synchronization is given by (k is used to denote current phase) the following property which says that if all the processes have completed phase k , then each of the processes move to a phase greater than k if they do not drop out. P1: {∀i(Ai .drop = f alse ∧ ∀j(Aj .drop = f alse ⇒ Aj .executing ≥ k) → (Ai .executing ≥ k + 1))}

Property 2: The progress for new processes that want to join the phase synchronization is given by the following property which says that a process that gets registered with a process involved in phase synchronization will also join the phase synchronization. P2: {∃i((Ai .proceed = f alse ∧ ∃j(i ∈ Aj .registered)) → Ai .proceed = true)} Complexity Analysis: The protocol in it’s simplest form has a remote message complexity of O(n2 ) where n is the upper bound on the number of processes that can participate in the barrier synchronization in any phase. This bound can be improved in practice by optimizing the testing of the completion of a phase by each of the participating processes. The optimization is brieﬂy explained in the following. When a process Ai checks for completion of phase in other process, say Aj , and it ﬁnds that Ai .executing < Aj .executing, then it can actually come out of the do-od loop by copying Aj .newIconcurrent that has the complete information about the processes participating in the next phase. This optimization can have the best case of O(n) messages and is very straightforward to embed in the proposed protocol. Note that in any case, atleast n messages are always required to propagate the information.

5

Generalization: Multi-clock Phase Synchronization

In this section, we outline the generalization of the distributed dynamic barrier synchronization for multiple clocks. 1) There is a special clock process for each clock. 2) The processes maintain protocol variables for each of the clocks that they register with. 3) A process can register with multiple clocks through C.register, where C denotes a set of clocks, as the ﬁrst operation before starting with phase synchronized computation. The notation C.register denotes that for all clocks c such that c∈C, perform c.register. The corresponding code is: for each c in C INIT_i_c; A_c.registered:=A_c.registered+A_i; while (!A_i_c.proceed);

Some important restrictions to avoid deadlock scenarios are: i) C.register, when C contains more than one clock, can only be done by the process that creates the clocks contained in C.

152

S. Agarwal, S. Joshi, and R.K. Shyamasundar

ii) If a process wants to register with a single clock c that is in use for phase synchronization by other processes, it will have to drop all it’s clocks and then it can use c.register to synchronize on the desired clock. Note that the clock c need not be re-created. iii) Subsequent child processes should use clocked to register with any subset of the multiple clocks that the parent is registered with. This combined with (iv) below avoids the deadlock scenarios of the likes of mobile barriers [20]. iv) For synchronization, the process increments the value of Ai .next for each registered clock, then executes the guarded loop for each of the clocks before it can move to CONSOLIDATE stage. The SYNC part of the protocol for multi-clock is very similar to single-clock except for an extra loop to run the guarded command loop for each of the clocks. A process clocked on multiple clocks results in synchronization of all the processes that are registered with these clocks. This is evident from the second do-od loop in SYNC part of the barrier synchronization protocol. For example, if a process A1 is synchronizing on c1, A2 on c1 and c2 and A3 on c2, then A1 and A3 also get synchronized as long as A2 does not drop one or both of the clocks. These clocks can be thus thought of as forming a group and A2 can be thought of as pivot process. In the following, we state the guarantees of synchronization provided by the protocol: 1) A process that synchronizes on multiple clocks can move to the next phase only when all the processes in the group formed by the clocks have also completed their current phase. 2) Two clock groups that do not have a common pivot process but have a common clock may diﬀer in phase by atmost one. Note that it cannot exceed one because that would imply improper synchronization between processes clocked on same clock which as has been proved above to be impossible in our protocol. 3) A new process registered with multiple clocks starts in the next phase (from the perspective of a local observer) w.r.t. each of the clocks individually.

6

Comparing with Other Dynamic Barrier Schemes

The clock syntax resembles that of X10 but diﬀers in the joining policy. An X10 activity that registers with a clock in some phase starts the synchronization from the same phase. The advantage of our dynamic joining policy (starting from the next phase) is that when a process starts a phase, it exactly knows the processes that it is synchronizing with in the phase. This makes it simpler to detect the completion of a phase in a distributed set-up. Whether to join in same phase or the next phase is more a matter of semantics rather than expressiveness. If there is a centralized manager process to manage phase synchronization, then the semantics of starting a newly registsered activity in same phase is feasible. However, for a distributed phase synchronization protocol with dynamically joining processes, the semantics of starting from the next phase is more eﬃcient.

Distributed Generalized Dynamic Barrier Synchronization

153

The other clock related works [19], [18] are directed more towards eﬃcient implementations of X10 like clocks rather than dealing with synchronization in a distributed setting. Barriers in JCSP [21] and occam-pi [20] do allow process to dynamically join and resign from barrier synchronization. Because the synchronization is barrier speciﬁc in both JCSP ( using barrier.sync() ) and occam-pi ( using SYNC barrier, it is a burden on the programmer to write a deadlock free program which is not the case here, as the use of next achieves synchronization over all registered clocks. JCSP and occam-pi barriers achieve linear time synchronization due to centralized control of barriers which is also possible in the optimized version of our protocol. Previous work on barrier implementation has focussed on algorithms that work on pre-speciﬁed number of processes or processors. The Butterﬂy barrier algorithm [9], Dissemination algorithm [10][5], Tournament algorithm [5], [4] are some of the earlier algorithms. Most of them emphasized on how to reduce the number of messages that need to be exchanged in order to know that all the processes have reached the barrier. Some of the more recent works on barrier algorithms in software are described in [6][11][12][15][14],[3]. As contrasted to the literature, our focus has been on developing algorithms for barrier synchronization where processes dynamically join and drop out; thus, processes that can be in a barrier synchronization need not be known a priori.

7

Conclusions

In this paper, we have described a solution for distributed dynamic phase synchronization that is shown to satisfy properties of progress and starvation freedom. To our knowledge, this is the ﬁrst dynamic distributed multi-processor synchronization algorithm wherein we have the established properties of progress, starvation freedom and shown the dependence of the progress on the entry strategies (captured through process registration). A future direction is to consider fault tolerance in the context of distributed barrier synchronization for dynamic number of processes.

References 1. Chandy, K.M., Misra, J.: Parallel program Design: A Foundation. Addison-Wesley, Reading (1988) 2. Dijkstra, E.W.: Guarded commands, non-determinacy and formal derivation of programs. Communications of the ACM 18(8) (August 1975) 3. Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared -memory multiprocessors. ACM TOCS 9(1), 21–65 (1991) 4. Feldmann, A., Gross, T., O’Hallaron, D., Stricker, T.M.: Subset barrier synchronization on a private-memory parallel system. In: SPAA (1992) 5. Hensgen, D., Finkel, R., Manbet, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1), 1–17 (1988) 6. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008)

154

S. Agarwal, S. Joshi, and R.K. Shyamasundar

7. Misra, J.: Phase Synchronization, Notes on Unity, 12-90, U. Texas at Austin (1990) 8. Tang, P., Yew, P.C.: Processor Self scheduling for multiple-nested parallel loops. In: Proc. ICPP, pp. 528–535 (Augus 1986) 9. Brooks III, E.D.: The butterﬂy barrier. International Journal of Parallel Programming 15(4) (1986) 10. Han, Y., Finkel, R.: An Optimal Scheme for Disseminating Information. In: Proc. of 17th International Conference on Parallel Processing (1988) 11. Scott, M.L., Michael, M.M.: The Topological Barrier: A Synchronization Abstraction for Regularly-Structured Parallel Applications Tech. Report TR605, Univ. of Rochester (1996) 12. Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. International Journal of Parallel Programming 18(3) (1990) 13. Livesey, M.: A Network Model of Barrier Synchronization Algorithms. International Journal of Parallel Programming 20(1) (February 1991) 14. Xu, H., McKinley, P.K., Ni, L.M.: Eﬃcient implementation of barrier synchronization in wormhole-routed hypercube multicomputers. In: Proceedings of the 12th International Conference on, vol. 9(12) (June 1992) 15. Yang, J.-S., King, C.-T.: Designing Tree-Based Barrier Synchronization on 2D Mesh Networks. IEEE Transactions on Parallel and Distributed Systems 9(6) (1998) 16. Saraswat, V., Jagadeesan, R.: Concurrent clustered programming. In: Jayaraman, K., de Alfaro, L. (eds.) CONCUR 2005. LNCS, vol. 3653, pp. 353–367. Springer, Heidelberg (2005) 17. Uniﬁed Parallel C Language,http://www.gwu.edu/~ upc/ 18. Shirako, J., Peixotto, M.D., Sarkar, V., Scherer, W.N.: Phasers: a uniﬁed deadlockfree construct for collective and point-to-point synchronization. In: ICS, pp. 277– 288 (2008) 19. Vasudevan, N., Tardieu, O., Dolby, J., Edwards, S.A.: Compile-Time Analysis and Specialization of Clocks in Concurrent Programs. In: CC, pp. 48–62 (2009) 20. Welsch, P., Barnes, F.: Mobile Barriers for occam-pi: Semantics, Implementation and Application. Communicating Process Architecture (2005) 21. Welsch, P., Brown, N., Moores, J., Chalmers, K., Sputh, B.H.C.: Integrating and Extending JCSP. In: CPA, pp. 349–370 (2007) 22. Agarwal, S., Joshi, S., Shyamasundar, R.K.: Distributed Generalized Dynamic Barrier Synchronization, Longer Version, http://www.tcs.tifr.res.in/~ shyam/Papers/dynamicbarrier.pdf

A High-Level Framework for Distributed Processing of Large-Scale Graphs Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal VU University Amsterdam {ekr,kielmann,wanf,bal}@cs.vu.nl

Abstract. Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present H IP G, a distributed framework that facilitates high-level programming of parallel graph algorithms by expressing them as a hierarchy of distributed computations executed independently and managed by the user. H IP G programs are in general short and elegant; they achieve good portability, memory utilization and performance.

1 Introduction We live in a world of graphs. Some graphs exist physically, for example transportation networks or power grids. Many exist solely in electronic form, for instance a state space of a computer program, the network of Wikipedia entries, or social networks. Graphs such as protein interaction networks in bioinformatics or airplane triangulations in engineering are created by scientists to represent real-world objects and phenomena. With the increasing abundance of large graphs, there is a need for a parallel graph processing language that is easy to use, high-level, and memory- and computation-efficient. Real-world graphs reach billions of nodes and keep growing: the World Wide Web expands, new proteins are being discovered, and more complex programs need to be verified. Consequently, graphs need to be partitioned between memories of multiple machines and processed in parallel in such a distributed environment. Real-world graphs tend to be sparse, as, for instance, the number of links in a web page is small compared to the size of the network. This allows for efficient storage of edges with the source nodes. Because of their size, partitioning graphs into balanced fragments with small a number of edges spanning different fragments is hard [1, 2]. Parallelizing graph algorithms is challenging. The computation is typically driven by a node-edge relation in an unstructured graph. Although the degree of parallelism is often considerable, the amount of computation per graph’s node is generally very small, and the communication overhead immense, especially when many edges spawn different graph chunks. Given the lack of structure of the computation, the computation is hard to partition and locality is affected [3]. In addition, on a distributed memory machine good load balancing is hard to obtain, because in general work cannot be migrated (part of the graph would have to be migrated and all workers informed). While for sequential graph algorithms a few graph libraries exist, notably the Boost Graph Library [4], for parallel graph algorithms no standards have been established. The current state-of-the-art amongst users wanting to implement parallel graph algorithms M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 155–166, 2011. c Springer-Verlag Berlin Heidelberg 2011

156

E. Krepska et al.

is to either use the generic C++ Parallel Graph Boost Library (PBGL) [5, 6] or, most often, create ad-hoc implementations, which are usually structured around their communication scheme. Not only does the ad-hoc coding effort have to be repeated for each new algorithm, but it also results in obscuring the original elegant concept. The programmer spends considerable time tuning the communication, which is prone to errors. While it may result in a highly-optimized problem-tailored implementation, the code can only be maintained or modified with substantial effort. In this paper we propose H IP G, a distributed framework aimed at facilitating implementations of HIerarchical Parallel Graph algorithms that operate on large-scale graphs. H IP G offers an interface to perform structure-driven distributed graph computations. Distributed computations are organized into a hierarchy and coordinated by logical objects called synchronizers. The H IP G model supports, but is not limited to, creating divide-and-conquer graph algorithms. A H IP G parallel program is composed automatically from the sequential-like components provided by the user. The computational model of H IP G, and how it can be used to program graph algorithms, is explained in Section 2, where we present three graph algorithms in increasing order of complexity: reachability search, finding single-source shortest paths and strongly connected components decomposition. These are well-known algorithms explained for example in [7]. Although the user must be aware that a H IP G program runs in a distributed environment, the code is high-level: explicit communication is not exposed by the API. Parallel composition is done in a way that does not allow race conditions, so that no locks or thread synchronization code are necessary from the user’s point of view. These facts, coupled with the use of an object-oriented language, makes for an easy-to-use, but expressive, language to code hierarchical parallel graph algorithms. We have implemented H IP G in the Java language. We discuss this choice as well as details of the implementation in Section 3. Using H IP G we have implemented algorithms presented in Section 2 and we evaluate their performance in Section 4. We processed graphs of size of the order of 109 of nodes on our cluster and obtained good performance. The H IP G code of the most complex example discussed in this paper, the strongly connected components decomposition, is an order of magnitude shorter than the hand-written C/MPI version of this program and three times shorter than the corresponding implementation in PBGL—See Section 5 for a discussion of the related work in the field of distributed graph processing. H IP G’s current limitations and future work are discussed in the concluding Section 6.

2 The H IP G Model and Programming Interface The input to a H IP G program is a directed graph. H IP G partitions the graph in a number of equal-size chunks and divides chunks between workers that are made responsible for processing nodes they own. A chunk consists of a number of nodes uniquely identified by pairs (chunk, index). H IP G uses the object-oriented paradigm of programming— namely, nodes are objects. Each node has arbitrary data and a number of outgoing edges associated and co-located with it. The target node of an edge is called a neighbor. In the current setup, the graph cannot be modified at runtime, but new graphs can be created.

A High-Level Framework for Distributed Processing of Large-Scale Graphs interface MyNode extends Node { public void visit(); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { boolean visited = false; public void visit() { if (!visited) { visited = true; for (MyNode n : neighbors()) n.visit(); } } }

Fig. 1. Reachability search in H IP G

157

p s visit()

q

visit()

visit()

visit()

Node s receives visit() and sends it to its neighbors.

t

visit() visit()

visit()

visit()

r

Neighbors forward the message to their neighbors. Fig. 2. Illustration of the reachability search

Graphs are commonly processed starting at a certain graph node and by following the structure of the graph, i.e. the node-edge relationship, until all reached nodes are processed. H IP G allows to process graphs this way by offering a seamless interface to execute methods on local and remote nodes. When necessary these method calls are automatically translated by H IP G into messages. In Section 2.1 we show how the methods can be used to create a distributed graph computation in H IP G. More complex algorithms require managing more than one such distributed computations. In particular, the objective of a divide-and-conquer graph algorithm is to divide computation on a graph into several sub-computations on sub-graphs. H IP G enables creation of sub-algorithms by introducing synchronizers—logical objects that manage distributed computations. The concept and API of a synchronizer are explained further in this section: in Section 2.2 we show how to use a single synchronizer, and in Section 2.3 an entire hierarchy of synchronizers is created to solve a divide-and-conquer graph problem. 2.1 Distributed Computation H IP G allows to implement graph computations with only regular methods executed on graph nodes. Typically, the user initializes the first method, which in turn executes methods on its neighbor nodes. In general, a node can execute methods on any node of which the unique identifier is known. To implement a graph computation, the user extends the provided LocalNode class with custom fields and methods. In a local node, neighbor nodes can be accessed with neighbors(), or inNeighbors() for incoming edges. Under the hood, the methods executing on remote nodes are automatically translated by H IP G into asynchronous messages. On reception of such a message, an appropriate method is executed, which thus acts as a message handler. The order of received messages cannot be predicted. Method parameters are automatically serialized, and we strive to make the serialization efficient. Distributed computation terminates when there are no more messages present in the system, which is detected automatically. Since

158

E. Krepska et al.

interface MyNode extends Node { public void found(SSSP sp, int d); } class MyLocalNode extends LocalNode<MyNode> implements MyNode { int dist = −1; public void found(SSSP sp, int d) { if (dist < 0) { dist = d; sp.Q.add(this); } } public void found0(SSSP sp, int d) { for (MyNode n : neighbors()) n.found(sp, d); } }

class SSSP extends Synchronizer { Queue<MyLocalNode> Q = new Queue(); int localQsize; public SSSP(MyLocalNode pivot) { if (pivot != null) Q.add(pivot); localQsize = Q.size(); } @Reduce public int GlobalQSize(int s) { return s + Q.size(); } public void run() { int depth = 0; do { for (int i = 0; i < localQsize; i++) Q.pop().found0(this, depth); barrier(); depth++; localQsize = Q.size(); } while (GlobalQSize(0) > 0); } }

Fig. 3. Single-source shortest paths (breadth-first search) implemented in H IP G

messages are asynchronous, returning a value of a method can be realized by sending a message back to the source. Typically, however, a dedicated mechanism, discussed later in this section, is used to compute the result of a distributed computation. Example: Reachability search. In a directed graph, a node s is reachable from node t if a path from t to s exists. Reachability search computes the set of nodes reachable from a given pivot. A reachability search implemented in H IP G (Fig. 1) consists of an interface MyNode that represents any node and a local node implementation MyLocalNode. The visit() method visits a node and its neighbors (Fig. 2). The algorithm is initiated by pivot.visit(). We note that, if it was not for the unpredictable order of method executions, the code for visit() could be understood sequentially. In particular, no locks or synchronization code were needed. 2.2 Coordination of Distributed Computations A dedicated layer of a H IP G algorithm coordinates the distributed computations. Its main building block is a synchronizer, which is a logical object that manages distributed computations. A synchronizer can initiate a distributed computation and wait for its termination. After a distributed computation has terminated, the synchronizer typically computes global results of the computation by invoking a global reduction operation. For example, the synchronizer may compute the global number of nodes reached by the computation, or a globally elected pivot. Synchronizers can execute distributed computations in parallel or one after another.

A High-Level Framework for Distributed Processing of Large-Scale Graphs F B(V ) : p = pick a pivot f rom V F = F W D(p) B = BW D(p) Report (F ∩ B) asSCC In parallel : F B(F \ B) F B(B \ F ) F B(V \ (F ∪ B))

159

V

F

B p

Fig. 4. FB: a divide-and-conquer algorithm to search for SCCs

To implement a synchronizer, the user subclasses Synchronizer and defines a run() method that, conceptually, will execute sequentially on all processors. Termination detection is provided by barrier(). The reduce methods, annotated @Reduce, must be commutative, as the order, in which they are executed, cannot be predicted. Example: Single-source shortest paths. Fig. 3 shows an implementation of a parallel single-source shortest paths algorithm. For simplicity, each edge has equal weight, so that the algorithm is in fact a breadth-first search [7]. We define an SSSP synchronizer, which owns a queue Q that represents the current layer of graph nodes. The run() method loops over all nodes in the current layer to create the next layer. The barrier blocks until the current layer is entirely processed. GlobalQSize computes the global size of Q by summing the sizes of queues Q on all processors. The algorithm terminates when all layers have been processed. 2.3 Hierarchical Coordination of Distributed Computations The key idea of the H IP G coordination layer is that synchronizers can spawn any number of sub-synchronizers to solve graph sub-problems. Therefore, the coordination layer is, in fact, a tree of executing synchronizers, and thus a hierarchy of distributed algorithms. All synchronizers execute independently and in parallel. The order, in which synchronizers progress cannot be predicted, unless they are causally related or explicitly synchronized. The user starts a graph algorithm by spawning root synchronizers. The system terminates when all synchronizers terminate. The H IP G parallel program is composed automatically from the two components provided by the user, namely node methods (message handlers) and the synchronizer code (coordination layer). Parallel composition is done in a way which does not allow race conditions. No explicit communication or thread synchronization is needed. Example: Strongly connected components. A strongly connected component (SCC) of a directed graph is a maximal set of nodes S such that there exists a path in S between any pair of nodes in S. In Fig. 4 we describe FB [8], a divide-and-conquer graph algorithm for computing SCCs. FB partitions the problem of finding SCCs of a set of nodes V into three sub-problems on three disjoint subsets of V . First an arbitrary pivot node is selected from V . Two sets F and B are computed as the sets of nodes that are, respectively, forward and backward reachable from the pivot. The set F ∩B is an SCC.

160

E. Krepska et al.

interface MyNode extends Node { public void fwd(FB fb, int f, int b); public void bwd(FB fb, int f, int b); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { int labelF = −1, labelB = −1; public void fwd(FB fb, int f, int b) { if (labelF == fb.ff && (labelB == b||labelB == fb.bb)){ labelF = f; fb.F.add(this); for (MyNode n : neighbors()) n.fwd(); } } public void bwd(FB fb, int f, int b) { if (labelB == fb.bb && (labelF == f||labelF == fb.ff)){ labelB = b; fb.B.add(this); for (MyNode n : inNeighbors()) n.bwd(); } } }

class FB extends Synchronizer { Queue<MyLocalNode> V, F, B; int ff, bb; FB(int f, int b, Queue<MyLocalNode> V0) { V = V0; F,B = new Queue(); ff = f; bb = b; } @Reduce MyNode SelectPivot(MyNode p) { return (p==null && !V.isEmpty())? V.pop():null; } public void run() { MyNode pivot = SelectPivot(null); if (pivot == null) return; int f = 2∗getId(), b = f+1; if (pivot.isLocal()) { pivot.fwd(this, f, b); pivot.bwd(this, f, b); } barrier(); spawn(f, bb, new FB(F.ﬁlterB(b)); spawn(ff, b, new FB(B.ﬁlterF(f)); spawn(f, b, new FB(V.ﬁlterFuB(f, b)); } }

Fig. 5. Implementation of the FB algorithm in H IP G

All SCCs remaining in V must be entirely contained either within F \B or within B\F or within the complement set V \(F ∪B). The H IP G implementation of the FB algorithm is displayed in Fig. 5. The FB creates subsets F and B of V by executing forward and backward reachability searches from a global pivot. Each set is labeled with a unique pair of integers (f,b). FB spawns three sub-synchronizers to solve sub-problems on F \B, B \F and V \(F ∪B). We note that the algorithm in Fig. 5 reflects the original elegant algorithm in Fig. 4. The entire H IP G program is 113 lines of code, while a corresponding C/MPI application (see Section 4) has over 1700 lines, and the PBGL implementation has 341 lines.

3 Implementation H IP G is designed to execute in a distributed-memory environment. We chose to implement it in Java because of the portability and performance (due to just-in-time compilation) as well as excellent software support of the language, although Java required us to carefully ensure that the memory is utilized efficiently. We used the Ibis [9] messagepassing communication library and the Java 6 virtual machine implemented by Sun [10]. Partitioning an input graph into equal-size chunks means that each chunk contains similar number of nodes and edges (currently, minimization of the number of edges spawning different chunks is not taken into account). Each worker stores one chunk in the form of an array of nodes. Outgoing edges are not stored within the node object. This would be impractical due to memory overhead (in 64-bit HotSpot this overhead is 16 B per object). As a compromise, nodes are objects but edges are not—rather, they are all stored in a single large integer array. We note that, although this structure is not elegant, it is transparent to the user, unless explicitly requested, e.g. when the program needs to be highly optimized. In addition, as most of the worker’s memory is used to store the graph, we tuned the garbage collector to use a relatively small young generation size (5–10% of the heap size).

A High-Level Framework for Distributed Processing of Large-Scale Graphs

161

After reading the graph, a H IP G program typically initiates root synchronizers, waits for completion, and handles the computed results. A part of the system that executes synchronizers we refer to as a worker. A worker consists of one main thread that emulates the abstraction of independent executions of synchronizers by looping over an array of active synchronizers and making progress with them in turn. When all synchronizers have terminated, the worker returns control to the user’s main program. We describe the implementation from the synchronizer’s point of view. A synchronizer is given a unique identifier, determined on spawn. Each synchronizer can take one of the three actions: either it communicates while waiting for a distributed routine to finish; or it proceeds when the distributed routine is finished; or it terminates. The bulk of synchronizer’s communication consists of messages that correspond to methods executed on graph nodes. Such messages contain identifiers of the synchronizer, the graph, the node and the executed method, followed by serialized method parameters. The messages are combined in non-blocking buffers and flushed repeatedly. Besides communicating, synchronizers perform distributed routines. Barriers are implemented with the distributed termination detection algorithm by Safra [11]. When a barrier returns, it means that no messages that belong to the synchronizer are present in the system. The reduce operation is also implemented by token traversal [12] and the result announced to all workers. Before a H IP G program can be executed, its Java bytecode has to be instrumented. Besides optimizing object serialization by Ibis [9], the graph program is modified: methods are translated into messages, neighbor access is optimized, and synchronizers are rewritten so that no separate thread is needed for each synchronizer instance. The latter is done by translating the blocking routines into a checkpoint followed by a return. This way a worker can execute a synchronizer’s run() method step-by-step. The instrumentation is part of the provided H IP G library, and needs to be called before execution. No special Java compiler is necessary. Release. More implementation details and a GPL release of H IP G can be found at http://www.cs.vu.nl/~ekr/hipg.

4 Memory Utilization and Performance Evaluation In this section we report on the results of experiments conducted with H IP G. The evaluation was carried out on the VU-cluster of the DAS-3 system [13]. The cluster consists of 85 dual-core, dual-CPU 2.4 GHz Opteron compute nodes, each equipped with 4 GB of memory. The processors are interconnected with Myri-10G (MX) and 1G Ethernet links. The time to initialize workers and input graphs was not included in the measurements. All graphs were partitioned randomly—meaning that if a graph is partitioned in p chunks, a graph node is assigned to a chunk with probability p1 . The portion of remote edges is thus p−1 p , which is very high (87-99% in used graphs) and realistic when modeling an unfavorable partitioning (many edges spawning different chunks). We start with the evaluation of performance of applications that almost solely communicate (only one synchronizer spawned). Visitor, the reachability search (see Section 2.1) was started at the root node of a large binary tree directed towards the leaves. SSSP, the single-source shortest paths (breadth first search) (see Section 2.2), was started at the root node of the binary tree, and at a random node of a synthetic social

162

E. Krepska et al.

Table 1. Performance of V ISITOR and SSSP

60

Appl. Workers Input Time[s] Mem[GB]

50

Visitor Visitor Visitor Visitor

8 16 32 64

Bin-27 Bin-28 Bin-29 Bin-29

19.1 21.4 24.5 16.9

2.8 2.9 3.1 2.1

SSSP SSSP SSSP SSSP

8 16 32 64

Bin-27 Bin-28 Bin-29 Bin-29

31.5 38.0 42.5 29.8

2.8 3.0 3.2 2.4

SSSP SSSP SSSP SSSP

8 16 32 64

LN-80 LN-160 LN-320 LN-640

30.8 33.7 34.6 38.5

1.3 1.5 1.7 2.0

Perfect speedup Visitor SSSP SSSP-LN

40 30 20 10 0

10

20

30

40

50

60

#processors

Fig. 6. Speedup of V ISITOR and SSSP

network. The results are presented in Tab. 1 and Fig. 6. We tested both applications on 8–64 processors on Myrinet. To obtain more fair results, rather than keeping the problem size constant and dividing the input into more chunks, we doubled the problem size with doubling the number of processors (Tab. 1, with the exception of Bin–30 that should have been run on 64 processors but did not fit the memory). Thanks to this we avoid spurious improvement due to better cache behavior, keep the heap filled, but also avoid too many small messages that occur if the stored portion of a graph is small. We normalized the results for the speedup computation (Fig. 6). We used binary trees, Bin– n, of height n = 27..29 that have 0.27–1.0 ·109 nodes and edges. The LN–n graphs are random directed graphs with degrees of nodes sampled from the log-normal distribution ln N (4, 1.3), aimed to resemble real-world social networks [14, 15]. An LN–n graph has n · 105 nodes and n · 1.27 · 106 edges. We used LN–n graphs of size n = 80..640 and thus up to 64·106 nodes and 8·109 edges. In each experiment, all edges of the input graphs were visited. Both applications achieved about 60% efficiency on a binary tree graph on 64 processors, which is satisfactory for an application with little computation, O(n), compared to O(n) communication. The efficiency achieved by SSSP on LN–n graphs reaches almost 80%, as the input is more randomized, and has a small diameter compared to a binary tree, which reduces the number of barriers performed. To evaluate the performance of hierarchical graph algorithms written in H IP G, we ran the OBFR-MP algorithm that decomposes a graph into SCCs [16]. OBFR-MP is a divide-and-conquer algorithm like FB [8] (see Section 2.3), but processes the graph in layers. We compared the performance of the OBFR-MP implemented in H IP G against a highly-optimized C/MPI version of this program used for performance evaluation in [16] and kindly provided to us by the authors. The H IP G version was implemented to maximally resemble the C/MPI version: the data structures used and messages sent are the same. Here, we are not interested in the speedup of the MPI implementation of OBFR-MP, on which we don’t have any influence. Rather, we want to see the difference in performance between an optimized C/MPI version and H IP G version of the same application. In general, we found that the H IP G version was substantially faster when compared with MPI implementations that used sockets. The detailed results are presented in Tab. 2. We used two different implementations of MPI over Myrinet: the MPICH-MX implementation provided by Myricom that directly accesses

A High-Level Framework for Distributed Processing of Large-Scale Graphs

163

Table 2. Performance comparison of the OBFR-MP SCC-decomposition algorithm tested on three LnLnTm graphs. OM (OpenMPI) and P4 are socket-based MPI implementations, while the MX MPI implementation directly uses the Myrinet interface. Time is given in seconds.

p 4 8 16 32 64

MX 36.6 26.6 96.5 40.0 24.1

L487L487T5 Myri Eth OM HipG P4 HipG 141.4 41.1 94.8 45.7 81.6 22.1 82.5 30.0 60.5 48.4 179.0 37.0 57.3 39.1 163.4 41.0 46.7 24.4 234.6 41.8

MX 69 73 89 136 128

L10L10T16 Myri Eth OM HipG P4 HipG 255 148 302 225 280 226 462 330 376 315 804 506 661 485 1794 851 646 277 1659 461

MX 45.1 34.5 37.1 30.1 32.0

L60L60T11 Myri Eth OM HipG P4 HipG 152.9 47.3 110.8 98.8 99.8 46.8 111.5 116.0 128.6 60.4 216.2 125.9 82.0 57.4 214.7 171.8 108.8 66.1 311.4 141.2

the interface, and OpenMPI that goes through TCP sockets. On Ethernet we used the standard MPI implementation (P4). We tested OBFR-MP on synthetic graphs called LmLmTn, which are in essence trees of height n of SCCs, such that each SCC is a lattice (m + 1) × (m + 1). An LmLmTn graph has thus (2n+1 − 1) SCCs, each of size (m + 1)2 . The performance of the OBFR-MP algorithm strongly depends on the SCC-structure of the input graph. We used three graphs: one with a small number of large SCCs, L487L487T5; one with a large number of small SCCs, L10L10T16; and one that balances the number of SCCs and their size, L60L60T11. Each graph contains a little over 15·106 nodes and 45·106 edges. The performance of the C/MPI application running over MX is the fastest, as it has the smallest software stack. The OpenMPI and P4 MPI implementations offer a more realistic comparison as they use a deeper software stack (sockets) like H IP G: H IP G ran on average 2.2 times faster than the C/MPI in this case. Most importantly, the speedup or slowdown of H IP G follows the speedup or slowdown of the C/MPI application run over MX, which suggests that the overhead of H IP G will not explode with further scaling of the application. The communication pattern of many graph algorithms is an intensive all-to-all communication. Generally, message sizes decrease with the increase of the number of processors. Good performance results from balancing the size of flushed messages and the frequency of flushing: too many flushes decrease performance, while too few flushes cause other processors to stall. Throughput on 32 processors over MX for the V ISITOR application on Bin-29 is constant (not shown): the application sends 16 GB in 24 s. A worker’s memory is divided between the graph, the communication buffers and the memory allocated by the user’s code in synchronizers. On a 64-bit machine, a graph node uses 80 B in V ISITOR and on average 1 KB in SSSP, including the edges and all overhead. Tab. 1 presents the maximum heap size used by a V ISITOR/SPPP worker. Expectedly, it remains almost constant. SSSP uses more memory than Visitor, because it stores a queue of nodes (see Section 2.2). The results in this section do not aim to prove that we obtained the most efficient implementations of the V ISITOR, SSSP or OBFR-MP algorithms. When processing large-scale graphs, the speedup is of secondary importance; it is of primary importance to be able to store the graph in memory and process it in acceptable time. We aimed to show that large-scale graphs can be handled by H IP G and satisfactory performance can be obtained with little coding effort, even for complex hierarchical graph algorithms.

164

E. Krepska et al.

5 Related Work H IP G is a distributed framework aimed at providing users with a way to code, with little effort, parallel algorithms that operate on partitioned graphs. An analysis of other platforms suitable for the execution of graph algorithms is provided in an inspiring paper by Lumsdaine et al. [3] that, in fact, advocates using massively multithreaded shared-memory machines for this purpose. However, such machines are very expensive and software support is lacking [3]. The library in [17] realizes this concept on a Cray machine. Another interesting alternative would be to use the partitioned global address space languages like UPC [18], X10 [19] or ZPL [20], but we are not aware of support for graph algorithms in these languages, except for a shared memory solution [21] based on X10 and Cilk. The Bulk Synchronous Parallel (BSP) model of computation [22] alternates work and communication phases. We know of two BSP-based libraries that support the development of distributed graph algorithms: CGMgraph and Pregel. CGMgraph [23] uses the unified communication API and parallel routines offered by CGMlib, which is conceptually close to MPI [24]. In Google’s Pregel [15] the graph program is a series of supersteps. In each superstep the Compute(messages) method, implemented by the user, is executed in parallel on all vertices. The system supports fault-tolerance consisting of heartbeats and checkpointing. Impressively, Pregel is reported to be able to handle billions of nodes and use hundreds of workers. Unfortunately, it is not available for download. Pregel is similar to H IP G in two aspects: the vertex-centered programming and composing the parallel program automatically from user-provided simple sequential-like components. However, the repeated global synchronization phase in the Bulk Synchronous Parallel model, although suitable for many applications, is not always desirable. H IP G is fundamentally different from BSP in this respect, as it uses asynchronous messages with computation synchronized on the user’s request. Notably, H IP G can simulate the BSP model as we did in the SSSP application (Section 2.2). The prominent sequential Boost Graph Library (BGL) [4] gave rise to a parallelization that adopts a different approach to graph algorithms. Parallel BGL [5,6] is a generic C++ library that implements distributed graph data structures and graph algorithms. The main focus is to reuse existing sequential algorithms, only applying them to distributed data structures, to obtain parallel algorithms. PBGL supports a rich set of parallel graph implementations and property maps. The system keeps information about ghost (remote) vertices, although that works well only if the number of edges spanning different processors is small. Parallel BGL offers a very general model, while both Pregel and H IP G trade expressiveness (for example neither offers any form of remote read) for more predictable performance. ParGraph [25] is another parallelization of BGL, similar to PBGL, but less developed; it does not seem to be maintained. We are not aware of any work directly supporting the development divide-and-conquer graph algorithms. To store graphs we used the SVC-II distributed graph format advocated in [26]. Graph formats are standardized only within selected communities. In case of large graphs, binary formats are typically preferable to text-based formats, as compression is not needed. See [26] for a comparison of a number of formats used in the formal methods community. A popular text format is XML, which is used for example to store Wikipedia [27]. RDF [28] is used to represent semantic graphs in the form of

A High-Level Framework for Distributed Processing of Large-Scale Graphs

165

triples (source, edge, target). Contrastingly, in bioinformatics, graphs are stored in many databases and integrating them is ongoing research [29].

6 Conclusions and Future Work In this paper we described H IP G, a model and a distributed framework that allows users to code, with little effort, hierarchical parallel graph algorithms. The parallel program is automatically composed of sequential-like components provided by the user: node methods and synchronizers, which coordinate distributed computations. We realized the model in Java and obtained short and elegant implementations of several published graph algorithms, good memory utilization and performance, as well as out-of-the box portability. Fault-tolerance has not been implemented in the current implementation of H IP G, as the programs that we executed so far run on a cluster and were not mission-critical. A solution using checkpointing could be implemented, in which, when a machine fails, a new machine is requested and the entire computation restarted from the last checkpoint. Such a solution is standard and similar to the one used in [15]. Creating a checkpoint takes somewhat more effort, because of the lack of global synchronization phases in H IP G. Creating a consistent image of the state space could be done either by freezing the entire computation or with a distributed snapshot algorithm in the background such as the one by Lai-Yang [12]. Distributed snapshot poses overhead on messages, which however can be minimized when using message combining, which is the case in H IP G. H IP G is work in progress. We would like to improve speedup by using better graph partitioning methods, e.g. [1]. If needed, we could implement graph modification during runtime, although in all cases that we looked at, this could be solved by creating new graphs during execution, which is possible in H IP G. We are currently working on providing tailored support for multicore processors and extending the framework to execute on a grid. Currently the size of the graph that can be handled is limited to the amount of memory available. Therefore, we are interested if a portion of a graph could be temporarily stored on disk without completely sacrificing efficiency [30]. Acknowledgments. We thank Jaco van de Pol who initiated this work and provided C code, and Ceriel Jacobs for helping with the implementation.

References 1. Karypis, G., Kumar, V.: A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J. of Par. and Distr. Computing 48(1), 71–95 (1998) 2. Feige, U., Krauthgamer, R.: A polylog approximation of the minimum bisection. SIAM Review 48(1), 99–130 (2006) 3. Lumsdaine, A., Gregor, D., Hendrickson, B., Berry, J.: Challenges in parallel graph processing. PPL 17(1), 5–20 (2007) 4. Siek, J., Lee, L.-Q., Lumsdaine, A.: The Boost Graph Library. Addison-Wesley, Reading (2002) 5. Gregor, D., Lumsdaine, A.: The parallel BGL: A generic library for distributed graph computations. In: Parallel Object-Oriented Scientific Computing (2005)

166

E. Krepska et al.

6. Gregor, D., Lumsdaine, A.: Lifting sequential graph algorithms for distributed-memory parallel computation. OOPSLA 40(10), 423–437 (2005) 7. Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (1990) 8. Fleischer, L., Hendrickson, B., Pinar, A.: On identifying strongly connected components in parallel. In: Rolim, J.D.P. (ed.) IPDPS-WS 2000. LNCS, vol. 1800, pp. 505–511. Springer, Heidelberg (2000) 9. Bal, H.E., Maassen, J., van Nieuwpoort, R., Drost, N., Kemp, R., van Kessel, T., Palmer, N., Wrzesi´nska, G., Kielmann, T., van Reeuwijk, K., Seinstra, F., Jacobs, C., Verstoep, K.: Real-world distributed computing with Ibis. IEEE 43(8), 54–62 (2010) 10. The Java SE HotSpot virtual machine. java.sun.com/products/hotspot 11. Dijkstra, E.: Shmuel Safra’s version of termination detection. Circulated privately (January 1987) 12. Tel, G.: Introduction to distributed algorithms. Cambridge University Press, Cambridge (2000) 13. Distributed ASCI Supercomputer DAS-3, www.cs.vu.nl/das3 14. Pennock, D.M., Flake, G.W., Lawrence, S., Glover, E.J., Giles, C.L.: Winners don’t take all: Characterizing the competition for links on the web. PNAS 99(8), 5207–5211 (2002) 15. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010) 16. Barnat, J., Chaloupka, J., van de Pol, J.: Improved distributed algorithms for SCC decomposition. In: PDMC 2007. ENTCS, vol. 198(1), pp. 63–77 (2008) 17. Berry, J., Hendrickson, B., Kahan, S., Konecny, P.: Graph software development and performance on the MTA-2 and Eldorado. At 48-th Cray Users Group meeting (2006) 18. Coarfa, C.: et al. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In: PPoPP 2005, pp. 36–47. ACM, New York (2005) 19. Charles, P.: et al. X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA, pp. 519–538. ACM, New York (2005) 20. Chamberlain, B.L., Choi, S.-E., Lewis, E.C., Snyder, L., Weathersby, W.D., Lin, C.: The case for high-level parallel programming in ZPL. IEEE Comput. Sci. Eng. 5(3), 76–86 (1998) 21. Cong, G., Kodali, S., Krishnamoorthy, S., Lea, D., Saraswat, V., Wen, T.: Solving large, irregular graph problems using adaptive work-stealing. In: ICPP, pp. 536–545. IEEE, Los Alamitos (2008) 22. Valiant, L.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990) 23. Chan, A., Dehne, F., Taylor, R.: CGMgraph/CGMlib: Implementing and testing CGMgraph alg. on PC clusters and shared memory machines. J. of HPC App., 19(1):81–97 (2005) 24. MPI Forum: MPI: A message passing interface. J of Supercomp Appl. 8(3/4), 169–416 (1994) 25. Hielscher, F., Gottschling, P.: ParGraph library. pargraph.sourceforge.net (2004) 26. Blom, S., van Langevelde, I., Lisser, B.: Compressed and distributed file formats for labeled transition systems. In: PDMC 2003. ENTCS, vol. 89, pp. 68–83 (2003) 27. Denoyer, L., Gallinari, P.: The Wikipedia XML corpus. SIGIR Forum 40(1), 64–69 (2006) 28. Resource description framework, http://www.w3.org/RDF 29. Joyce, A.R., Palsson, B.O.: The model organism as a system: Integrating ’omics’ data sets. Nat. Rev. Mol. Cell. Biol. 7(3), 198–210 (2006) 30. Hammer, M., Weber, M.: To store or not to store reloaded: Reclaiming memory on demand. In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J. (eds.) FMICS 2006 and PDMC 2006. LNCS, vol. 4346, pp. 51–66. Springer, Heidelberg (2007)

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Ankur Narang1 , Abhinav Srivastava1 , Naga Praveen Kumar1, and Rudrapatna K. Shyamasundar2 1

2

IBM Research - India, New Delhi Tata Institute of Fundamental Research, Mumbai

Abstract. With the advent of many-core architectures efficient scheduling of parallel computations for higher productivity and performance has become very important. Distributed scheduling of parallel computations on multiple places1 needs to follow affinity and deliver efficient space, time and message complexity. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. In this paper, we address this challenge by using a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing. This paper presents an online algorithm for affinity driven distributed scheduling of multi-place2 parallel computations. Theoretical analysis of the expected and probabilistic lower and upper bounds on time and message complexity of this algorithm has been provided. On well known benchmarks, our algorithm demonstrates 16% to 30% performance gain as compared to Cilk [6] on multi-core Intel Xeon 5570 architecture. Further, detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. To the best of our knowledge, this is the first time affinity driven distributed scheduling algorithm has been designed and theoretically analyzed in a multi-place setup for many core architectures.

1 Introduction The exascale computing roadmap has highlighted efficient locality oriented scheduling in runtime systems as one of the most important challenges (”Concurrency and Locality” Challenge [10]). Massively parallel many core architectures have NUMA characteristics in memory behavior, with a large gap between the local and the remote memory latency. Unless efficiently exploited, this is detrimental to scalable performance. Languages such as X10 [9], Chapel [8] and Fortress [4] are based on partitioned global address space (PGAS [11]) paradigm. They have been designed and implemented as part of DARPA HPCS program3 for higher productivity and performance on many-core massively parallel platforms. These languages have in-built support for initial placement of threads (also referred as activities) and data structures in the parallel program. 1 2

3

Place is a group of processors with shared memory. Multi-place refers to a group of places. For example, with each place as an SMP(Symmetric MultiProcessor), multi-place refers to cluster of SMPs. www.highproductivity.org/

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 167–178, 2011. c Springer-Verlag Berlin Heidelberg 2011

168

A. Narang et al.

Therefore, locality comes implicitly with the program. The run-time systems of these languages need to provide efficient algorithmic scheduling of parallel computations with medium to fine grained parallelism. For handling large parallel computations, the scheduling algorithm (in the run-time system) should be designed to work in a distributed fashion. This is also imperative to get scalable performance on many core architectures. Further, the execution of the parallel computation happens in the form of a dynamically unfolding execution graph. It is difficult for the compiler to always correctly predict the structure of this graph and hence perform correct scheduling and optimizations, especially for data-dependent computations. Therefore, in order to schedule generic parallel computations and also to exploit runtime execution and data access patterns, the scheduling should happen in an online fashion. Moreover, in order to mitigate the communication overheads in scheduling and the parallel computation, it is essential to follow affinity inherent in the computation. Simultaneous consideration of these factors along with low time and message complexity, makes distributed scheduling a very challenging problem. In this paper, we address the following affinity driven distributed scheduling problem. Given: (a) An input computation DAG (Fig. 1) that represents a parallel multithreaded computation with fine to medium grained parallelism. Each node in the DAG is a basic operation such as and/or/add etc. and is annotated with a place identifier which denotes where that node should be executed. Each edge in the DAG represents one of the following: (i) spawn of a new thread or, (ii) sequential flow of execution or, (iii) synchronization dependency between two nodes. The DAG is a strict parallel computation DAG (synchronization dependency edge represents an activity waiting for the completion of a descendant activity, details in section 3); (b) A cluster of n SMPs (refer Fig. 2) as the target architecture on which to schedule the computation DAG. Each SMP 4 also referred as place has fixed number(m) of processors and memory. The cluster of SMPs is referred as the multi-place setup. Determine: An online schedule for the nodes of the computation DAG in a distributed fashion that ensures the following: (a) Exact mapping of nodes onto places as specified in the input DAG. (b) Low space, time and message complexity for execution. In this paper, we present the design of a novel affinity driven, online, distributed scheduling algorithm with low time and message complexity. The algorithm assumes initial placement annotations on the given parallel computation with the consideration of load balance across the places. The algorithm controls the online expansion of the computation DAG. Our algorithm employs an efficient remote spawn mechanism across places for ensuring affinity. Randomized work stealing within a place helps in load balancing. Our main contributions are: – We present a novel affinity driven, online, distributed scheduling algorithm. This algorithm is designed for strict multi-place parallel computations. – Using theoretical analysis, we prove that the lower bound of the expected execution k time is O(maxk T1k /m + T∞,n ) and the upper bound is O( k (T1k /m + T∞ )), where k is a variable that denotes places from 1 to n, m denotes the number of processors per place, T1k denotes the execution time on a single processor for place 4

Symmetric MultiProcessor: group of processors with shared memory.

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

169

k, and T∞,n denotes the execution time of the computation on n places with infinite processors on each place. Expected and probabilistic lower and upper bounds for the message complexity have also been provided. – On well known parallel benchmarks (Heat, Molecular Dynamics and Conjugate Gradient), we demonstrate performance gains of around 16% to 30% over Cilk on multi-core architectures. Detailed analysis shows the scalability of our algorithm as well as efficienct space utilization.

2 Related Work Scheduling of dynamically created tasks for shared memory multi-processors has been a well studied problem. The work on Cilk [6] promoted the strategy of randomized work stealing. Here, a processor that has no work (thief ) randomly steals work from another processor (victim) in the system. [6] proved efficient bounds on space (O(P · S1 )) and time (O(T1 /P + T∞ )) for scheduling of fully-strict computations (synchronization dependency edges go from a thread to only its immediate parent thread, section 3) in an SMP platform; where P is the number of processors, T1 and S1 are the time and space for sequential execution respectively, and T∞ is the execution time on infinite processors. We consider locality oriented scheduling in distributed enviroments and hence are more general than Cilk. The importance of data locality for scheduling threads motivated work stealing with data locality [1] wherein the data locality was discovered on the fly and maintained as the computation progressed. This work also explored initial placement for scheduling and provided experimental results to show the usefulness of the approach; however, affinity was not always followed, the scope of the algorithm was limited to only SMP environments and its time complexity was not analyzed. [5] analyzed the time complexity (O(T1 /P + T∞ )) for scheduling general parallel computations on SMP platforms but does not consider locality oriented scheduling. We consider distributed scheduling problem across multiple places (cluster of SMPs) while ensuring affinity and also provide time and message complexity bounds. [7] considers work-stealing algorithms in a distributed-memory environment, with adaptive parallelism and fault-tolerance. Here task migration was entirely pull-based (via a randomized work stealing algorithm) hence it ignored affinity and also didn’t provide any formal proof for the resource utilization properties. The work in [2] described a multi-place(distributed) deployment for parallel computations for which initial placement based scheduling strategy is appropriate. A multi-place deployment has multiple places connected by an interconnection network where each place has multiple processors connected as in an SMP platform. It showed that online greedy scheduling of multi-threaded computations may lead to physical deadlock in presence of bounded space and communication resources per place. However, the computation did not respect affinity always and no time or communication bounds were provided. Also, the aspect of load balancing was not addressed even within a place. We ensure affinity along with intra-place load balancing in a multi-place setup. We show empirically, that our algorithm has efficient space utilization as well.

170

A. Narang et al.

3 System and Computation Model The system on which the computation DAG is scheduled is assumed to be cluster of SMPs connected by an Active Message Network (Fig. 2). Each SMP is a group of processors with shared memory. Each SMP is also referred to as place in the paper. Active Messages ((AM)5 is a low-level lightweight RPC(remote procedure call) mechanism that supports unordered, reliable delivery of matched request/reply messages. We assume that there are n places and each place has m processors (also referred to as workers). The parallel computation to be dynamically scheduled on the system, is assumed to be specified by the programmer in languages such as X10 and Chapel. To describe our distributed scheduling algorithm, we assume that the parallel computation has a DAG(directed acyclic graph) structure and consists of nodes that represent basic operations like and, or, not, add and so forth. There are edges between the nodes in the computation DAG (Fig. 1) that represent creation of new activities (spawn edge), sequential execution flow between the nodes within a thread/activity (continue edge) and synchronization dependencies (dependence edge) between the nodes. In the paper we refer to the parallel computation to be scheduled as the computation DAG. At a higher level, the parallel computation can also be viewed as a computation tree of activities. Each activity is a thread (as in multi-threaded programs) of execution and consists of a set of nodes (basic operations). Each activity is assigned to a specific place (affinity as specified by the programmer). Hence, such a computation is called multi-place computation and DAG is referred to as place-annotated computation DAG (Fig. 1: v1..v20 denote nodes, T 1..T 6 denote activities and P 1..P 3 denote places). Based on the structure of dependencies between the nodes in the computation DAG, there can be following types of parallel computations: (a) Fully-strict computation: Dependencies are only between the nodes of a thread and the nodes of its immediate parent thread; (b) Strict computation: Dependencies are only between the nodes of a thread and the nodes of any of its ancestor threads; (c) Terminally strict computation: (Fig. 1). Dependencies arise due to an activity waiting for the completion of its descendants. Every dependency edge, therefore, goes from the last instruction of an activity to one of its ancestor activities with the following restriction: In a subtree rooted at an activity called Γr , if there exists a dependence edge from any activity in the subtree to the root activity Γr , then there cannot exist any dependence edge from the activities in the subtree to the ancestors of Γr . The following notations are used in the paper. P = {P1 , · · · , Pn } denote the set of places. {Wi1 , Wi2 ..Wim } denote the set of workers at place Pi . S1 denotes the space required by a single processor execution schedule. Smax denotes the size in bytes of the largest activation frame in the computation. Dmax denotes the maximum depth of the computation tree in terms of number of activities. T∞,n denotes the execution time of k the computation DAG over n places with infinite processors at each place. T∞ denotes the execution time for activities assigned to place P using infinite processors. Note k k k that, T∞,n ≤ T . T denotes the time taken by a single processor for the 1 1≤k≤n ∞ activities assigned to place k. 5

Active Messages defined by the AM-2: http://now.cs.berkeley.edu/AM/active_messages.html

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Single Place with multiple processors

171

Multiple Places with multiple processors per place

T1 @ P1 v1

v2

v14

v18

v19

T2 @ P2

v20

PE

v3

v6

v9

v13

v15

v16

PE

SMP Node PE

Spawn edge

T6 @ P1

L2 Cache

….

SMP Node PE

PE

….

PE

Continue edge

v17

Dependence edge

System Bus

Memory

…

Memory

L2 Cache

T3 @ P3 v4

T5 @ P2

T4 @ P3 v5

v7

v8

v10

v11

v12

PE

PE

SMP

Fig. 1. Place-annotated Computation Dag

Interconnect (Active Message Network)

SMP Cluster

Fig. 2. Multiple Places: Cluster of SMPs

4 Distributed Scheduling Algorithm Consider a strict place-annotated computation DAG. The distributed scheduling algorithm described below schedules activities with affinity, at only their respective places. Within a place, work-stealing is enabled to allow load-balanced execution of the computation sub-graph associated with that the place. The computation DAG unfolds in an online fashion in a breadth-first manner across places when the affinity driven activities are pushed onto their respective remote places. For space efficiency, before a placeannotated activity is pushed onto a place, the remote place buffer (FAB, see below) is checked for space utilization. If the space utilization of the remote buffer (FAB) is high then the push gets delayed for a limited amount of time. This helps in appropriate spacetime tradeoff for the execution of the parallel computation. Within a place, the online unfolding of the computation DAG happens in a depth-first manner to enable efficient space and time execution. Sufficient space is assumed to exist at each place, so physical deadlocks due to lack of space cannot happen in this algorithm. Each place maintains a Fresh Activity Buffer (FAB) which is managed by a dedicated processor (different from workers) at that place. An activity that has affinity for a remote place is pushed into the FAB at that place. Each worker at a place has a Ready Deque and a Stall Buffer (refer Fig. 3). The Ready Deque of a processor contains the activities of the parallel computation that are ready to execute. The Stall Buffer contains the activities that have been stalled due to dependency on another activities in the parallel computation. The FAB at each place as well as the Ready Deque at each worker use a concurrent deque implementation. An idle worker at a place will attempt to randomly steal work from other workers at the same place (randomized work stealing). Note that an activity which is pushed onto a place can move between workers at that place (due to work stealing) but can not move to another place and thus obeys affinity at all times. The distributed scheduling algorithm is given below. At any step, an activity A at the rth worker (at place i), Wir , may perform the following actions: 1. Spawn: (a) A spawns activity B at place,Pj , i = j: A sends AM(B) (active message for B) to the remote place. If the space utilization of FAB(j) is below a given threshold, then AM(B) is successfully inserted in the FAB(j) (at Pj ) and A continues

172

A. Narang et al. Place(j)

Worker(m)

Place(i) FAB

Remote Spawn

Worker(m)

request (AM(B))

FAB

Worker(2) Dedicated Processor

Worker(1)

Spawn accept

Stall Buffer Ready Deque

Ready Deque

Stall Buffer

Worker(2)

Dedicated Processor

Worker(1)

Fig. 3. Affinity Driven Distributed Scheduling Algorithm

execution. Else, this worker waits for a limited time, δt , before retrying the activity B spawn on place Pj (Fig. 3). (b) A spawns B locally: B is successfully created and starts execution whereas A is pushed into the bottom of the Ready Deque. 2. Terminates (A terminates): The worker at place Pi , Wir , where A terminated, picks an activity from the bottom of the Ready Deque for execution. If none available in its Ready Deque, then it steals from the top of other workers’ Ready Deque. Each failed attempt to steal from another worker’s Ready Deque is followed by attempt to get the topmost activity from the FAB at that place. If there is no activity in the FAB then another victim worker is chosen from the same place. 3. Stalls (A stalls): An activity may stall due to dependencies in which case it is put in the Stall Buffer in a stalled state. Then same as Terminates (case 2) above. 4. Enables (A enables B): An activity, A, (after termination or otherwise) may enable a stalled activity B in which case the state of B changes to enabled and it is pushed onto the top of the Ready Deque. 4.1 Time Complexity Analysis The time complexity of this affinity driven distributed scheduling algorithm in terms of number of throws during execution is presented below. Each throw represents an attempt by a worker(thief ) to steal an activity from either another worker(victim) or FAB at the same place. Lemma 1. Consider a strict place-annotated computation DAG with work per place, T1k , being executed by the distributed scheduling algorithm presented in section 4. Then, the execution (finish) time for place,k, is O(T1k /m+Qkr /m+Qke /m), where Qkr denotes the number of throws when there is at least one ready node at place k and Qke denotes the number of throws when there are no ready nodes at place k. The lower bound on the executiontime of the full computation is O(maxk (T1k /m + Qkr /m)) and the upper bound is O( k (T1k /m + Qkr /m)). Proof Sketch: (Token based counting argument) Consider three buckets at each place in which tokens are placed: work bucket where a token is placed when a worker at that place executes a node of the computation DAG; ready-node-throw bucket where a token is placed when a worker attempts to steal and there is at least one ready node at that place; null-node-throw bucket where a token is placed when a worker attempts to steal and there are no ready nodes at that place (models the wait time when there is no work

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

173

at that place). The total finish time of a place can be computed by counting the tokens in these three buckets and by considering load balanced execution within a place (using randomized work stealing). The upper and lower bounds on the execution time arise from the structure of the computation DAG and the structure of the online schedule generated. The detailed proof is presented in [3]. Next, we compute the bound on the number of tokens in the ready-node-throw bucket using potential function based analysis [5]. Our unique contribution is in proving the lower and upper bounds of time complexity and message complexity for the multiplace affinity driven distributed scheduling algorithm presented in section 4 that involves both intra-place work stealing and remote place affinity driven work pushing. Let there be a non-negative potential associated with each ready node in the computation dag. If the execution of node u enables node v, then edge(u,v) is called the enabling edge and u is called the designated parent of v. The subgraph of the computation DAG consisting of enabling edges forms a tree, called the enabling tree. During the execution of the affinity driven distributed scheduling algorithm 4, the weight of a node u in the enabling tree, w(u) is defined as (T∞,n − depth(u)). For a ready node, u, we define φi (u), the potential of u at timestep i, as: φi (u) = 32w(u)−1 , if u is assigned; =3

2w(u)

, otherwise

(4.1a) (4.1b)

All non-ready nodes have 0 potential. The potential at step i, φi , is the sum of the potential of all the ready nodes at step i. When an execution begins, the only ready node is the root node with potential, φ(0) = 32T∞,n −1 . At the end the potential is 0 since there are no ready nodes. Let Ei denote the set of workers whose Ready Deque is empty at the beginning of step i, and let Di denote the set of all other workers with non-empty Ready Deque. Let, Fi denote the set of all ready nodes present across the FAB at all places. The total potential can be partitioned into three parts as follows: φi = φi (Ei ) + φi (Di ) + φi (Fi )

(4.2)

Actions such as assignment of a node from Ready Deque to the worker for execution, stealing nodes from the top of victim’s Ready Deque and execution of a node, lead to decrease of potential. The idle workers at a place do work-stealing alternately between stealing from Ready Deque and stealing from the FAB. Thus, 2m throws in a round consist of m throws to other workers’s Ready Deque and m throws to the FAB. For randomized work-stealing one can use the balls and bins game [3] to compute the expected and probabilistic bound on the number of throws. Using this, one can show that whenever m or more throws occur for getting nodes from the top of the Ready Deque of other workers at the same place, the potential decreases by a constant fraction of φi (Di ) with a constant probability. The component of potential associated with the FAB at place Pk , φki (Fi ), can be shown to deterministically decrease for m throws in a round. Furthermore, at each place the potential also drops by a constant factor of φki (Ei ). The detailed analysis of decrease of potential for each component is given in [3]. Analyzing the rate of decrease of potential and using Lemma 1 leads to the following theorem. Theorem 1. Consider a strict place-annotated computation DAG with work per place k, denoted by T1k , being executed by the affinity driven multi-place distributed scheduling

174

A. Narang et al.

algorithm (section 4). Let the critical-path length for the computation be T∞,n . The lower bound on the expected execution time is O(maxk T1k /m + T∞,n ) and the upper k bound is O( k (T1k /m + T∞ )). Moreover, for any > 0, the lower bound for the k execution time is O(maxk T1 /m + T∞,n + log(1/)) with probability at least 1 − . Similar probabilistic upper bound exists. Proof Sketch: For the lower bound, we analyze the number of throws (to the ready-nodethrow bucket) by breaking the execution into phases of θ(P = mn) throws (O(m) throws per place). It can be shown that with constant probability, a phase causes the potential to drop by a constant factor. More precisely, between phases i and i + 1, P r{(φi − φi+1 ) ≥ 1/4.φi } > 1/4 (details in [3] ). Since the potential starts at φ0 = 32T∞,n −1 and ends at zero and takes integral values, the number of successful phases is at most (2T∞,n − 1) log4/3 3 < 8T∞,n . Thus, the expected number of throws per place gets bounded by O(T∞,n · m), and the number of throws is O(T∞,n · m) + log(1/)) with probability at least 1 − (using Chernoff Inequality). Using Lemma 1 we get the lower bound on the expected execution time as O(maxk T1k /m + T∞,n ). The detailed proof and probabilistic bounds are presented in [3] . For the upper bound, consider the execution of the subgraph of the computation at each place. The number of throws in the ready-node-throw bucket per place can be k similarly bounded by O(T∞ ·m). Further, the place that finishes the execution in the end, can end up with the number of tokens in the null-node-throw bucket equal to the tokens in the work buckets and the read-node-throw buckets of all other places. Hence, the finish time for this place, which is also the execution time of the full computation DAG k is O( k (T1k /m + T∞ )). The probabilistic upper bound can be similarly established using Chernoff Inequality. The following theorem bounds the message complexity of the affinity driven work stealing algorithm 4. Theorem 2. Consider the execution of a strict place-annotated computation DAG with critical path-length T∞,n by the Affinity Driven Distributed Scheduling Algorithm (section 4). Then, the total number of bytes communicated across places is O(I · (Smax + nd )) and the lower bound on number of bytes communicated within a place has the expectation O(m·T∞,n ·Smax ·nd ), where nd is the maximum number of dependence edges from the descendants to a parent and I is the number of remote spawns from one place to a remote place. Moreover, for any > 0, the probability is at least (1 − ) that the lower bound on the communication overhead per place is O(m·(T∞,n +log(1/))·nd ·Smax ). Similarly message upper bounds exist. Proof. First consider inter-place messages. Let the number of affinity driven pushes to remote places be O(I), each of maximum O(Smax ) bytes. Further, there could be at most nd dependencies from remote descendants to a parent, each of which involves communication of constant, O(1), number of bytes. So, the total inter place communication is O(I.(Smax + nd )). Since the randomized work stealing is within a place, the lower bound on the expected number of steal attempts per place is O(m.T∞,n ) with each steal attempt requiring Smax bytes of communication within a place. Further, there can be communication when a child thread enables its parent and puts the parent into the child processors’ Ready Deque. Since this can happen nd times for each time

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

175

the parent is stolen, the communication involved is at most nd .Smax ). So, the expected total intra-place communication across all places is O(n.m.T∞,n .Smax .nd ). The probabilistic bound can be derived using Chernoff’s inequality and is omitted for brevity. Similarly, expected and probabilistic upper bounds can be established for communication complexity within the places.

5 Results and Analysis We implemented our distributed scheduling algorithm (ADS) and the pure Cilk style work stealing based scheduler (CWS) using pthreads (NPTL) API. The code was compiled using gcc version (4.1.2) with options -O2 and -m64. Using well known benchmarks the performance of ADS was compared with CWS and also with original Cilk6 scheduler (referred as CORG in this section). These benchmarks are the following. Heat: Jacobi over-relaxation that simulates heat propagation on a two dimensional grid for a number of steps [1]. For our scheduling algorithm (ADS), the 2D grid is partitioned uniformly across the available cores.7 ; Molecular Dynamics (MD): This is classical Molecular Dynamics simulation, using the Velocity Verlet time integration scheme. The simulation was carried on 16K particles for 100 iterations; Conjugate Gradient (NPB8 benchmark): Conjugate Gradient (CG) approximates the largest eigenvalue of a sparse, symmetric, positive definite matrix using inverse iteration. The matrix is generated by summing outer products of sparse vectors, with a fixed number of nonzero elements in each generating vector. The benchmark computes a given number of eigenvalue estimates, referred to as outer iterations, using 25 iterations of the CG method to solve the linear system in each outer iteration. The performance comparison between ADS and CORG was done on Intel multi-core platform. This platform has 16 cores (2.93 GHz, intel Xeon 5570, Nehalem architecture) with 8M B L3 cache per chip and around 64GB memory. Intel Xeon 5570 has NUMA characteristics even though it exposes SMP style programming. Fig. 4 compares the performance for the Heat benchmark (matrix: 32K ∗ 4K, number of iterations = 100, leafmaxcol = 32). Both ADS and CORG demonstrate strong scalability. Initially, ADS is around 1.9× better than CORG, but later this gap stabilizes at around 1.20×. 5.1 Detailed Performance Analysis In this section, we analyze the performance gains obtained by our ADS algorithm vs. the Cilk style scheduling (CWS) algorithm and also investigate the behavior of our algorithm on Power6 multi-core architecture. Fig. 5 demonstrates the gain in performance of ADS vs CWS with 16 cores. For CG, Class B matrix is chosen with parameters: NA = 75K, Non-Zero = 13M , Outer iterations = 75, SHIFT = 60. For Heat, the parameters values chosen are: matrix size 6 7

8

http://supertech.csail.mit.edu/cilk/ The Dmax for this benchmark is log(numCols/leaf maxcol) where numCols represents the number of columns in the input two-dimensional grid and leafmaxcol represents the number of columns to be processed by a single thread. http://www.nas.nasa.gov/NPB/Software

176

A. Narang et al. Strong Scalability Comparison: ADS vs CORG

WS & FAB Overheads: ADS vs CWS

Performance Comparison: ADS vs CWS 2000

CORG

1000

ADS

500 0

40 30

CWS

20

ADS

10 0

2

4

8

16

CORG

1623

812

415

244

ADS

859

683

348

195

Number of Cores

Fig. 4. CORG vs ADS

CG

CWS

45.7

ADS

31.9

Heat

MD

12.2

10.6

9.8

8.9

Number of Cores

Fig. 5. ADS vs CWS

Time (s)

1500

Total Time (s)

Total Time (s)

50

20 18 16 14 12 10 8 6 4 2 0

CWS_WS_time ADS_WS_time ADS_Fab_Overhead

CG

Heat

MD

Number of Cores

Fig. 6. ADS vs CWS

= 32 ∗ 4K, number of iterations = 100 and leafmaxcol = 32. While CG has maximum gain of 30%, MD shows gain of 16%. Fig. 6 demonstrates the overheads due to work stealing and FAB stealing in ADS and CWS. ADS has lower work stealing overhead because the work stealing happens only within a place. For CG, work steal time for ADS (5s) is 3.74× better than CWS (18.7s). For Heat and MD, ADS work steal time is 4.1× and 2.8× better respectively, as compared to CWS. ADS has FAB overheads but this time is very small, around 13% to 22% of the corresponding work steal time. CWS has higher work stealing overhead because the work stealing happens from any place to any other place. Hence, the NUMA delays add up to give a larger work steal time. This demonstrates the superior execution efficiency of our algorithm over CWS. We measured the detailed characteristics of our scheduling algorithm on multi-core Power6 platform. This has 16 Power6 cores and total 128GB memory. Each core has 64KB instruction L1 cache and 64KB L1 data cache along with 4M B semi-private unified L2 cache. Two cores on a Power6 chip share an external 32M B L3 cache. Fig. 7 plots the variation of the work stealing time, the FAB stealing time and the total time with changing configurations of a multi-place setup, for MD benchmark. With constant total number of cores = 16, the configurations, in the format (number of places * number of processors per place), chosen are: (a) (16 ∗ 1), (b) (8 ∗ 2), (c) (4 ∗ 4), and (d) (2 ∗ 8). As the number of places increase from 2 to 8, the work steal time increases from 3.5s to 80s as the average number of work steal attempts increases from 140K to 4M . For 16 places, the work steal time falls to 0 as here there is only a single processor per place, so work stealing does not happen. The FAB steal time, however, increases monotonically from 0.3s for 2 places, to 110s for 16 places. In the (16 ∗ 1) configuration, the processor at a place gets activities to execute, only through remote push onto its place.Hence, the FAB steal time at the place becomes high, as the number of FAB attempts (300M average) is very large, while the successful FAB attempts are very low (1400 average). With increasing number of places from 2 to 16, the total time increases from 189s to 425s, due to increase in work stealing and/or FAB steal overheads. Fig. 8 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using Class C matrix with parameter values: NA = 150K, Non-Zero = 13M , Outer Iterations = 75 and SHIFT = 60). In this case, the work steal time increases from 12.1s (for (2 ∗ 8)) to 13.1 (for (8 ∗ 2)) and then falls to 0 for (16 ∗ 1) configuration. The FAB time initially increases slowly from 3.6s to 4.1s but then jumps to 81s for (16 ∗ 1) configuration. This behavior can be explained as in the case of MD benchmark (above).

Affinity Driven Distributed Scheduling Algorithm for Parallel Computations

177

Fig. 9 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using parameter values: matrix size = 64K ∗ 8K, Iterations = 100 and leafmaxcol = 32). The variation of work stealing time, FAB stealing time and total time follow the pattern as in the case of MD.

350

350

300

300

350 300 250

250

ADS_FAB_time

200 150 100

ADS_Total_Time

50 0

ADS_WS_time

200

ADS_FAB_time

150

ADS_Total_Time

Time (s)

250 ADS_WS_time

Time (s)

Time (s)

WS & FAB Overheads Variation: Heat

WS & FAB Overheads Variation: CG

WS & FAB Overheads Variation: MD 450 400

ADS_FAB_time

150 100

50

50

0 (2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(4 * 4)

(8 * 2)

(16 * 1)

(2 * 8)

(Num Places * Num Procs Per Place)

Fig. 7. Overheads - MD

ADS_Total_Time

0 (2 * 8)

(Num Places * Num Procs Per Place)

ADS_WS_time

200

100

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 8. Overheads - CG

Fig. 9. Overheads - HEAT

Fig. 10 gives the variation of the Ready Deque average space and maximum space consumption across all processors and FAB average space and maximum space consumption across places, with changing configurations of the multi-place setup. As the number of places increase from 2 to 16, the FAB average space increase from 4 to 7 stack frames first, and, then decreases to 6.4 stack frames. The maximum FAB space usage increases from 7 to 9 stack frames but then returns back to 7 stack frames. The average Ready Deque space consumption increases from 11 stack frames to 12 stack frames but returns back to 9 stack frames for 16 places, while the average Ready Deque monotonically decreases from 9.69 to 8 stack frames. The Dmax for this benchmark setup is 11 stack frames, which leads to 81% maximum FAB utilization and roughly 109% Ready Deque utilization. Fig. 12 gives the variation of FAB space and Ready Deque space with changing configurations, for CG benchmark (Dmax = 13). Here, the FAB utilization is very low and remains so with varying configurations. The Ready Deque utilization stays close to 100% with varying configurations. FIg. 11 gives the variation of FAB space and Ready Deque space with changing configurations, for Heat benchmark (Dmax = 12). Here, the FAB utilization is high (close to 100%) and remains so with varying configurations. The Ready Deque utilization also stays close to 100% with varying configurations. This empirically demonstrates that our distributed scheduling algorithm has efficient space utilization as well.

Ready Deque & FAB Space Variation: MD

16

12 10

Ready_Deque_Avg

8

Ready_Deque_Max

6

FAB_Avg FAB_Max

4 2 0

14 12 Ready_Deque_Avg

10

Ready_Deque_Max

8

FAB_Avg

6

FAB_Max

4 2

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 10. Space Util - MD

14 12 Ready_Deque_Avg

10

Ready_Deque_Max

8

FAB_Avg

6

FAB_Max

4 2 0

0

(2 * 8)

Number of Stack Frames

16 Number of Stack Frames

Number of Stack Frames

Ready Deque & FAB Space Variation: CG

Ready Deque & FAB Space Variation: Heat

14

(2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 11. Space Util - HEAT

(2 * 8)

(4 * 4)

(8 * 2)

(16 * 1)

(Num Places * Num Procs Per Place)

Fig. 12. Space Util - CG

178

A. Narang et al.

6 Conclusions and Future Work We have addressed the challenging problem of affinity driven online distributed scheduling of parallel computations. We have provided theoretical analysis of the time and message complexity bounds of our algorithm. On well known benchmarks our algorithm demonstrates around 16% to 30% performance gain over typical Cilk style scheduling. Detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. This is the first such work for affinity driven distributed scheduling of parallel computations in a multi-place setup. In future, we plan to look into space-time tradeoffs and markov-chain based modeling of the distributed scheduling algorithm.

References 1. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: SPAA, New York, NY, USA, pp. 1–12 (December 2000) 2. Agarwal, S., Barik, R., Bonachea, D., Sarkar, V., Shyamasundar, R.K., Yellick, K.: Deadlockfree scheduling of x10 computations with bounded resources. In: SPAA, San Diego, CA, USA, pp. 229–240 (December 2007) 3. Agarwal, S., Narang, A., Shyamasundar, R.K.: Affinity driven distributed scheduling algorithms for parallel computations. Tech. Rep. RI09010, IBM India Research Labs, New Delhi (July 2009) 4. Allan, E., Chase, D., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., TobinHochstadt, S.: The Fortress language specification version 0.618. Tech. rep., Sun Microsystems (April 2005) 5. Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: SPAA, Puerto Vallarta, Mexico, pp. 119–129 (1998) 6. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999) 7. Blumofe, R.D., Lisiecki, P.A.: Adaptive and reliable parallel computing on networks of workstations. In: USENIX Annual Technical Conference. Anaheim, California (1997) 8. ChamberLain, B.L., Callahan, D., Zima, H.P.: Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21(3), 291–312 (2007) 9. Charles, P., Donawa, C., Ebcioglu, K., Grothoff, C., Kielstra, A., von Praun, C., Saraswat, V., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA 2005 Onward! Track (2005) 10. Exascale Study Group, Peter Kogge (Editor and Study Lead), William Harrod (Program Manager): Exascale computing study: Technology challenges in achieving exascale systems. Tech. rep. (September 2008) 11. Yelick, K., et al.: D.B.: Productivity and performance using partitioned global address space languages. In: PASCO 2007: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, pp. 24–32. ACM, New York (2007)

Temporal Specifications for Services with Unboundedly Many Passive Clients Shamimuddin Sheerazuddin The Institute of Mathematical Sciences C.I.T. Campus, Chennai 600 113, India [email protected]

Abstract. We consider a client-server system in which unbounded, ﬁnite but unknown, number of clients request for service from the server. The system is passive as there is no further interaction between sendrequest and receive-response. We give an automata based model for such systems and a temporal logic to frame speciﬁcations. We show that the satisﬁability and model checking problems for the logic are decidable. Keywords: temporal logic, web services, client-server systems, decidability, model checking.

1

Introduction

In [DSVZ06], the authors consider a Loan Approval Service [TC03], which consists of Web Services, called peers, that interact with each other via asynchronous message exchanges. One of the peers is designated as Loan Oﬃcer, the loan disbursal authority. It receives a loan request, say for 5000 from a customer, checks her credit rating from a third party and approves or rejects the request according to some lending policy. The loan approval problem becomes doubly interesting when the disbursal oﬃcer is confronted with a number of customers asking for loans of diﬀerent sizes, say ranging from 5000 to 500,000. In such a scenario, with a bounded pool of money to loan out, it may be possible that a high loan request may be accepted when there is no other pending request, or may be rejected when accompanied with pending requests of lower loan sizes. This is an example of a system composed of unboundedly many agents: how many processes are active at any system state is not known at design time but determined only at run time. Thus, though at any point of time, only ﬁnitely many agents may be participating, there is no uniform bound on the number of agents. Design and veriﬁcation of such systems are becoming increasingly important in distributed computing, especially in the context of Web Services. Since services handling unknown clients need to make decisions based upon request patterns that are not pre-decided, they need to conform to speciﬁc service policies that are articulated at design time. Due to concurrency and unbounded state information, the design and implementation of such services becomes complex and hence subject to logical ﬂaws. Thus, there is a need for formal methods in specifying service policies and verifying that systems implement them correctly. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 179–190, 2011. c Springer-Verlag Berlin Heidelberg 2011

180

S. Sheerazuddin

Formal methods come in many ﬂavours. One special technique is that of model checking[CGP00]. The system to be checked is modelled as a ﬁnite state system, and properties to be veriﬁed are expressed as constraints on the possible computations of the model. This facilitates algorithmic tools to be employed in verifying that the model is indeed correct with respect to those properties. When we ﬁnd violations, we re-examine the ﬁnite-state abstraction, leading to a ﬁner model, perhaps also reﬁne the speciﬁcations and repeat the process. Finding good ﬁnite state abstractions forms an important part of the approach. Modelling systems with unboundedly many clients is fraught with diﬃculties. Since we have no bound on the number of active processes, the state space is inﬁnite. A ﬁnite state abstraction may kill the very feature we wish to check, since service policies, we are interested in, involve unbounded numbers of clients. On the other hand, ﬁnite presentation of such systems with inﬁnite spaces and/or their computations comes with considerable amount of diﬃculty. Propositional temporal logics have been extensively used for specifying safety and liveness requirements of reactive systems. Backed by a set of tools with theorem proving and model checking capabilities, temporal logic is a natural candidate for specifying service policies. In the context of Web Services, they have been extended with mechanisms for specifying message exchange between agents. There are several candidate temporal logics for message passing systems, but these work with a priori ﬁxed number of agents, and for any message, the identity of the sender and the receiver are ﬁxed at design time. We need to extend such logics with means for referring to agents in some more abstract manner (than by name). On the other hand, the client-server interaction needs far simpler communication facility than what is typically considered in any peerto-peer communication model. A natural and direct approach to refer to unknown clients is to use logical variables: rather than work with atomic propositions p, we use monadic predicates p(x) to refer to property p being true of client x. We can quantify over such x existentially and universally to specify policies relating to clients. We are thus naturally lead to the realm of Monadic First Order Temporal Logics (M F OT L)[GHR94]. In fact, it is easily seen that M F OT L is expressive enough to frame almost every requirement speciﬁcation of client-server systems of the kind discussed above. Unfortunately, M F OT L is undecidable [HWZ00], and we need to limit the expressiveness so that we have decidable veriﬁcation problem. We propose a fragment of M F OT L for which satisﬁability and model checking are decidable. Admittedly, this language is weak in expressive power but our claim is that reasoning in such a logic is already suﬃcient to express a broad range of service policies in systems with unboundedly many clients. Thus, we need to come up with a formal model that ﬁnitely presents inﬁnite state spaces and a speciﬁcation language that involves quantiﬁcation and temporal modlities, while ensuring that model checking can be done. This is the issue we address in this paper, by simplifying the setting a little. We consider the case where there is only one server dealing with unboundedly many clients that do not

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

181

communicate with each other and propose a class of automaton model: for passive clientele. The client is passive as it simply sends a request to the server and waits for the service (or an answer that the service cannot be provided). The client has no further interaction with the server. We suggest that it suﬃces to specify such clients by boolean formulas over unary predicates. State formulas of the server are then monadic ﬁrst order sentences over such predicates, and server properties are speciﬁed by temporal formulas built from such sentences. In the literature [CHK+ 01], these services with passive clientele are called discrete services. We call them Service with Passive Clients (SPS). Before we proceed to technicalities, we wish to emphasize that what is proposed in this paper is in the spirit of framework rather than a deﬁnitive temporal logic for services with unbounded clientele. The combination of modalities as well as the structure of models should ﬁnally be decided only on the basis of applications. Even though essentially abstract, our paradigm continues the research on Web Service composition [BHL+ 02], [NM02], work on Web Service programming languages [FGK02] and the AZTEC prototype [CHK+ 01]. There have been many automaton based models for Web Services but, as far as we know, none of them incorporate unboundedly many clients. [BFHS03] models Web Services as Mealy machines, [FBS04] models Web Services as Buchi automata and focus on message passing between them. The Roman model [BCG+ 03] focusses on an abstract notion of activities, and in essence models Web Services as ﬁnite state machines with transitions labelled by these activities. The Colombo model [BCG+ 03] combines the elements of [FBS04] and [BCG+ 03] alongwith the OWL-S model [NM02] of Web Services and accounts for data in messages too. Decidable fragments of M F OT L are few and far between. As far as we know, monodic fragment [HWZ00], [HWZ01] is the only decidable one found in the literature. The decidability crucially hinges on the fact that there is at most one free variable in the scope of temporal modalities. Later, it was found that the packed monodic fragment with equality is decidable too [Hod02]. In the realm of branching time logics with ﬁrst-order extensions, it is shown in [HWZ02] that, by restricting applications of ﬁrst-order quantiﬁers to state (path-independent) formulas, and applications of temporal operators and path quantiﬁers to formulas with at most one free variable, we can obtain decidable fragments.

2

The Client Server Model

Fix CN , a countable set of client names. In general, this set would be recursively generated using a naming scheme, for instance using sequence numbers and timestamps generated by processes. We choose to ignore this structure for the sake of technical simplicity. We will use a, b etc. with or without subscripts to denote elements of CN .

182

S. Sheerazuddin

2.1

Passive Clients

Fix Γ0 , a ﬁnite service alphabet. We use u, v etc. to denote elements of Γ0 , and they are thought of as types of services provided by a server. An extended alphabet is a set Γ = {requ , ansu | u ∈ Γ0 } ∪ {τ }. These refer to requests for such service and answers to such requests, as well as ”silent” internal action τ . Elements of Γ0 represent logical types of services that the server is willing to provide. This means that when two clients ask for a service of the same type, given by an element of Γ0 it can tell them apart only by their name. We could in fact then insist that server’s behaviour be identical towards both, but we do not make such an assumption, to allow for generality. We deﬁne below systems of services that handle passive clientele. Servers are modelled as state transition systems which identify clients only by the type of service they are associated with. Thus, transitions are associated with client types rather than client names. Definition 2.1. A Service for Passive Clients (SPS) is a tupleA = (S, δ, I, F ) where S is a ﬁnite set of states, δ ⊆ (S × Γ × S) is a server transition relation, I ⊆ S is the set of initial states and F the set of ﬁnal states of A. Without loss of generality we assume that in every SPS, the transition relation δ is such that for every s ∈ S, there exists r ∈ Γ such that for some s ∈ S, (s, r, s ) ∈ δ. The use of silent action τ makes this an easy assumption. Note that an SPS is a ﬁnite state description. A transition of the form (s, requ , s ) refers implicitly to a new client of type u rather than to any speciﬁc client name. The meaning of this is provided in the run generation mechanism described below. A conﬁguration of an SPS A is a triple (s, C, χ) where s ∈ S, C is a ﬁnite subset of CN and χ : C → Γ0 . Thus a conﬁguration speciﬁes the control state of the server, as well as the ﬁnite set of active clients at that conﬁguration and their types. We use the convention that when C = ∅, the graph of χ is empty set as well. Let ΩA denote the set of all conﬁguration of A; note that it is this inﬁnite conﬁguration space that is navigated by behaviours of A. We can extend r the transition relation δ to conﬁguration =⇒ ⊆ (ΩA × Γ × ΩA ) as follows. r (s, C, χ)=⇒(s , C , χ ) iﬀ (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C and χ = χ ; – when r = requ , C = C ∪ {a}, χ (a) = u and χ C = χ, where a is the least element of CN − C; – when r = ansu , X = {a ∈ C | χ(a) = u} = ∅, C = C − {a} where a is the least in the enumeration of X, and χ = χC . A conﬁguration (s, C, χ) is said to be initial if s ∈ I and C = ∅. A run of an SPS A is an inﬁnite sequence of conﬁgurations ρ = c0 r1 c1 · · · rn cn · · · , where c0 r is initial, and for all j > 0, cj−1 =⇒cj . Let RA denote the set of runs of A. Note that runs have considerable structure. For instance, A have an inﬁnite path generated by a self-loop of the form (s, reqx , s ) in δ which corresponds to an inﬁnite sequence of service requests of a particular type. Thus, these systems

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

183

have interesting reachability properties. But, as we shall see, our main use of these systems are as models of a temporal logic, and since the logic is rather weak, information present in the runs will be under-utilized. Language of an SPS: Given a run ρ = c0 r1 c1 r2 · · · rn cn · · · we deﬁne inf (ρ) as those states which occur inﬁnitely many times on the run. That is, inf (ρ) = {q ∈ S | ∃∞ i ∈ ω, ci = (q, ∅, ∅)}. A run ρ is good if inf (ρ) ∩ F = ∅. The language of A, Lang(A) ⊆ Γ ω is then deﬁned as follows. Lang(A) = {r1 r2 · · · rn · · · | there is a good run ρ = c0 r1 c1 r2 · · · rn cn · · · } Once, we have ﬁxed goodness properties for runs RA of a given system A, it is trivially seen that SPS are closed under union and intersection. Also, it can be observed that once an external bound on CN is assumed, the size of conﬁguration set ΩA becomes bounded and all the decision algorithms for A become decidable. q0 ansh ansl reql

reqh

ansl

q3 q1

q5 reql

reql

ansl q2

ansl

q6

reqh

ansh

q4

Fig. 1. An SPS for Loan Approval Web Service System: A1

3

Loan Approval Service

We give an example SPS for automated Loan Approval Web Service System which is a kind of discrete service. In this Composite system, there is a designated Web Service acting as Loan Oﬃcer which admits loan requests of various sizes, say h depicting high (large) amount and l depicting low (small) amounts. Depending on the number of loan requests (high and low) and according to an apriori ﬁxed loan disbursal policy, the loan oﬃcer accepts or rejects the pending requests. The behaviour of the Loan Oﬃcer is captured as SPS as follows. Let Γ0 = {h, l}, where h denotes high end loan and l denotes low size loan, and the corresponding alphabet Γ = {reqh , reql , ansh , ansl }, the Loan Approval System can be modelled as an SPS A1 = (q1 , δ1 , I1 ) as shown in Figure 1. Here, we brieﬂy describe the working of the automaton. A1 , starting from q0 , keeps

184

S. Sheerazuddin

track of at most two low-amount requests. q1 is the state with one pending request whereas q4 is the state with two pending requests. Whenever the system gets a high amount request, it seeks to dispose it at the earliest and tries avoiding to take up a low requst as long as a high one is pending with it. But, it may not succeed all the time, i.e, when the automaton reaches q6 , it is possible that it can loop back to initial state q0 , with one or more high pending requests, and then take up low requests. It is not diﬃcult to see that there are runs of A1 satisfy the following property, ψ1 , and there are those which don’t. ψ1 is expressed in english as ”whenever there is a request of type low there is an answer of type low in the next instant”. Now, suppose there is another property ψ2 described as ”there is no request of type low taken up as long as there is a high request pending”. If we want to avoid ψ2 in the Loan Approval System then we need to modify A1 and deﬁne A2 = (S2 , δ2 , I2 ) as in Figure 2. q0 ansh ansl

reql

reqh

ansl

q3 q1

q5 reql

reql

ansl q2

q6 reqh

ansl q4

ansh q7

Fig. 2. A modiﬁed SPS for Loan Approval Web Service System: A2

Furthermore, if we want to make sure that there are no two low requests pending at any time, i.e., our model satisﬁes ψ1 , we modify A2 and describe A3 in Figure 3 as follows. We shall see later that these properties can be described easily is a decidable logic which we call LSP S . Notice, that, in SPS the customer (user or client) simply sends a request of some particular type and waits for an answer. Things become interesting when the customer seeks to interact with the loan oﬃcer (server) between the sendrequest and receive-response. In this case, the complex patterns of interaction between the client and the server have to be captured by a stronger automaton model. We shall tackle this issue in a separate paper.

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

185

q0 ansh ansl

reql

reqh

q3 q1 reql

ansl q2

q6 reqh

ansh q7

Fig. 3. Another modiﬁed SPS for Loan Approval Web Service System: A3

4

LSP S

In this section we describe a logical language to specify and verify SPS-like systems. Such a language would have two mutually exclusive dimensions. One, captured by M F O fragment, talking about the plurality of clients asking for a variety of services. The other, captured by LT L fragment, which talks about the temporal variations of services being rendered. Furthermore, the M F O fragment has to be multi-sorted to cover the multilplicity of service types. Keeping these issues in mind, we frame a logical language, which we call LSP S . LSP S is a cross between LT L and multi-sorted M F O. In the case of LTL, atomic formulas are propositional constants which have no further structure. In LSP S , there are two kind of atomic formulas, basic server properties from Ps , and M F Osentences over client properties Pc . Consequently, these formulas are interpreted over sequences of MFO-structures juxtaposed with LTL-models. 4.1

Syntax and Semantics

At the outset, we ﬁx Γ0 , a ﬁnite set of client types. The set of Client formulas are deﬁned over a countable set of atomic client predicates Pc which are composed of disjoint predicates of type u Pcu , for each u ∈ Γ0 . Also, let V ar be a countable supply of variable symbols and CN be a countable set of client names. CN is divided into disjoint sets of types from Γ0 via λ : CN → Γ0 . Similarly, V ar is divided using Π : V ar → Γ0 . We use x, y to denote elements in V ar and a, b for elements in CN . Formally, the set of client formulas Φ is: α, β ∈ Φ ::= p(x : u), p ∈ Pcu | x = y, x, y ∈ V aru | ¬α | α ∨ β | (∃x : u)α.

186

S. Sheerazuddin

Let SΦ be the set of all sentences in Φ, then, the Server formulas are deﬁned as follows: ψ ∈ Ψ ::= q ∈ Ps | ϕ ∈ SΦ | ¬ψ | ψ1 ∨ ψ2 | ψ | ψ Uψ . This logic is interpreted over sequences of M F O models composed with LT L models. Formally, a model is a triple M = (ν, D, I) where – ν = ν0 ν1 · · · , where ∀i ∈ ω, νi ⊂f in Ps , gives the local properties of the server at instance i, – D = D0 D1 D2 · · · , where ∀i ∈ ω, Di = (Diu )u∈Γ0 where Diu ⊂f in CNu , gives the identity of the clients of each type being served at instance iuand – I = I0 I1 I2 · · · , where ∀i ∈ ω, Ii = (Iiu )u∈Γ0 and Iiu : Diu → 2Pc gives the properties satisﬁed by each live agent at ith instance, in other words, the corresponding states of live agents. Alternatively, Iiu can be given as Iiu : Diu × Pcu → {, ⊥}, an equivalent form. Satisfiability Relations |=, |=Φ Let M = (ν, D, I) be a valid model and π : V ar → CN be a partial map consistent with respect to λ and Π. Then, the relations |= and |=Φ can be deﬁned, via induction over the structure of ψ and α, respectively, as follows. – – – – – –

M, i |= q iﬀ q ∈ νi . M, i |= ϕ iﬀ M, ∅, i |=Φ ϕ. M, i |= ¬ψ iﬀ M, i |= ψ. M, i |= ψ ∨ ψ iﬀ M, i |= ψ or M, i |= ψ . M, i |= ψ iﬀ M, i + 1 |= ψ. M, i |= ψ Uψ iﬀ ∃j ≥ i, M, j |= ψ and ∀i ≤ i < j, M, i |= ψ.

– – – – –

M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ

5

p(x : u) iﬀ π(x) ∈ Diu and Ii (π(x), p) = . x = y iﬀ π(x) = π(y). ¬α iﬀ M, π, i |=Φ α. α ∨ β iﬀ M, π, i |=Φ α or M, π, i |=Φ β. (∃x : u)α iﬀ ∃a ∈ Diu and M, π[x → a], i |=Φ α.

Specification Examples Using LSP S

As claimed in the previous section, we would like to show that our logic LSP S adequately captures many of the facets of SPS-like systems. We consider the Loan Approval Web Service, which has already been explained with a number of examples, and frame a number of speciﬁcations to demonstrate the use of LSP S . For Loan Approval System we shall have client types as Γ0 = {h, l} and client properties as Pc = {reqh , reql , ansh , ansl } where h means a loan request of type high and l means a loan request of type low. For this system, we can write a few simple speciﬁcations viz. initially there are no pending requests, whenever there is a request of type low there is an approval for type low in the next instant, there is no request of type low taken up as long as there is a high request pending and there is at least one request of each type pending all the time. In LSP S these can be framed as follows:

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

– – – –

ψ0 ψ1 ψ2 ψ3

187

= ¬ (∃x : h)reqh (x) ∨ (∃x : l)reql (x) , = 2(∃x : l)reql (x) ⊃ (∃y : l)ansl (y) , = 2(∃x : h)reqh (x) ⊃ ¬(∃y : l)reql (y) , = 2 (∃x : l)reql (x) ∨ (∃y : h)reqh (y) .

Note, that, none of these formulas make use of equality (=) predicate. Using =, we can make stronger statements like at all times there is exactly one pending request of type high and at all times there is at most one pending request of type high. These can be expressed in LSP S as follows: – ψ4 = 2 (∃x : h)reqh (x) ∧ (∀y: h) reqh (y) ⊃ x = y , . – ψ5 = 2 ¬(∃x : h)reqh (x) ∨ (∃x : h)reqh (x)∧(∀y : h) reqh (y) ⊃ x = y In the same vein, using =, we can count the requests of each type and say more interesting things. For example, if ϕ2h asserted at a point means there are at most 2 requests of type h pending then we can frame the following formula: ψ5 = 2(ϕ2h ⊃ (ϕ2h ⊃ 2ϕ2h )) which means, if there are at most two pending requests of type high at successive instants then thereafter the number stabilizes. Unfortunately, owing to a lack of provision for free variables in the scope of temporal modalities, we can’t write which seek to match requests speciﬁcations and approvals. Here is a sample: 2 (∀x) requ (x) ⊃ 3ansu (x) which means, if there is a request of type u at some point of time then the same is approved some time in future. If we allow indiscriminate applications of quantiﬁcations over temporal modalities, it will lead to undecidable logics. As we are aware, even two free variables in the scope of temporal modalities allow us to encode undecidable tiling problems. The challenge is to come up with appropriate constraints on speciﬁcations which allow us to express interesting properties as well as remain decidable to verify.

6

Satisfiability and Model Checking for LSP S

We settle the satisﬁability issue for LSP S using the automata theoretic techniques, ﬁrst proposed by Vardi and Wolper [VW86]. Let ψ0 be an LSP S -formula. We compute a formula automaton Aψ0 , such that the following holds. Lemma 6.1. ψ0 is satisﬁable iﬀ Lang(Aψ0 ) is non-empty. From the given LSP S formula ψ0 , we can obtain Pψ0 the set of all M F O predicates occuring in ψ0 and V arψ0 the set of all variable symbols occuring in ψ0 . Using these two sets we can generate all possible M F O models at the atom level. Now, these complex atoms, which incorporate M F O valuations as well as LT L valuations, are used to construct a Buchi automaton in the standard manner to generate all possible models of ψ0 . Then, the following is immediate. Lemma 6.2. Given a LSP S -formula ψ0 with |ψ0 | = n, the satisﬁability of ψ0 k can be checked in time 2O(n·r·2 ) , where r is the number of variable symbols occuring in ψ0 and k is the number predicate symbols occuring in ψ0 .

188

S. Sheerazuddin

In order to specify SPS, in which clients do nothing but send a request of type u and wait for an answer, the most we can say about a client x is whether a request from x is pending or not. So the set of client properties are Pc = {requ , ansu | u ∈ Γ0 }. When requ (x) holds at some instant i, it means there is a pending request of type u from x at i. When ansu (x) holds at i, it means either there was no request from x or the request from x has already been def answered. That is, ansu (x) = ¬requ (x). For the above sublogic, that is LSP S with Pc = {requ | u ∈ Γ0 }, we assert the following theorem, which can be inferred directly from lemma 6.2. Theorem 6.3. Let ψ0 be an LSP S formula with |Vψ0 | = r and |Γ0 | = k and k |ψ0 | = n. Then, satisﬁability of ψ0 can be checked in time O(2n·r·2 ). 6.1

Model Checking Problem for LSP S

The goal of this section is to formulate the model checking problem for LSP S and show that it is decidable. We again solve this problem using the so called automata-theoretic approach. In such a setting, the client-server system is modelled as an SP S, A, and the speciﬁcation is given by a formula ψ0 in LSP S . The model checking problem is to check if the system A satisﬁes the speciﬁcation ψ0 , denoted by A |= ψ0 . In order to do this we bound the SPS using ψ0 and deﬁne an interpreted version. Bounded Interpreted SPS: Let A = (S, δ, I, F ) be an SPS and ψ0 be a speciﬁcation in LSP S . From ψ0 we get Vu (ψ0 ), for each u ∈ Γ0 . Now, let n = (Σu ru ) · k where |Γ0 | = k and |Vu (ψ0 )| = ru . n is the bound for SPS M . Now, for each u ∈ Γ0 CNu = {(i, u) | 1 ≤ i ≤ ru , u ∈ Γ0 } and CN = u CNu . For each u, deﬁne CNu = {{(j, u) | 1 ≤ j ≤ i} |1 ≤ i ≤ ru } ∪ {∅}. Thereafter, deﬁne CN = Πu∈Γ0 CNu . Now, we have CN = C∈CN C. Now, we are in a position to deﬁne an interepreted form of bounded SPS. The interpreted SPS is a tuple A = (Ω, ⇒, I, F , V al), where Ω = S × CN , I = {(s, C) | s ∈ I, C = ∅}, F = {(s, C) | s ∈ F, C = ∅}, V al : Ω → (2Ps × CN ) and ⇒⊆ Ω × Γ × Ω is given r as follows: (s, C)=⇒(s , C ) iﬀ (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C , – when r = requ , CN − C = ∅, if a ∈ CNu − C is the least in the enumeration then C = C ∪ {a}, – when r = ansu , X = C ∩ CNu = ∅, C = C − {a} where a ∈ X is the least in the enumeration. Note, that, |CN | = Πu∈Γ0 (ru ) < rk . Now, if, |S| = l, then |Ω| = O(l·rk ). Now, we can deﬁne the language of interpreted SPS A as Lang(A) = {V al(c0 )V al(c1 ) · · · | c0 r1 c1 r2 c2 · · · is a good run in A}. We say that A satisﬁes ψ0 if Lang(A) ⊆ Aψ0 , where Aψ0 is the formula automaton of ψ0 . This holds when Lang(A) ∩ Lang(A¬ψ0 ) = ∅. Therefore, the

Temporal Speciﬁcations for Services with Unboundedly Many Passive Clients

189

complexity to check emptiness of the product automaton, is linear in the product of the sizes of A and Aψ0 . k

Theorem 6.4. A |= ψ0 can be checked in time O(l · rk · 2n·r·2 ).

7

Discussion

To conclude, we gave an automaton model for unbounded agent server-client systems for discrete services [CHK+ 01] and a temporal logic to specify such services and presented an automata based decidability argument for satisﬁability and model checking of the logic. We shall extend the SPS to model session-oriented client server systems in a subsequent paper. We shall also take up the task of framing appropriate temporal logics to specify such services. This forces us into the realm of M F OT L with free variables in the scope of temporal modalities [Hod02]. We know that too many of those are fatal [HWZ00]. The challenge is to deﬁne suitable fragments of M F OT L, which are suﬃciently expressive as well as decidable. As this paper lacks an automata theory of SPS, we need to explore whether inﬁnite-state reachability techniques such that [BEM97] could be used. An extension of the work in this paper would be to deﬁne models and logics for systems with multiple servers, say n, together, serving unbounded clients. An orthogonal exercise could be development of tools to eﬃciently implement the model checking problem for the system SPS against LSP S speciﬁcations, a la MONA [HJJ+ 95][KMS00] or SPIN [Hol97],[RH04].

References [BCG+ 03]

[BEM97]

[BFHS03] [BHL+ 02]

[CGP00]

Berardi, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Mecella, M.: Automatic composition of E-services that export their behavior. In: Orlowska, M.E., Weerawarana, S., Papazoglou, M.P., Yang, J. (eds.) ICSOC 2003. LNCS, vol. 2910, pp. 43–58. Springer, Heidelberg (2003) Bouajjani, A., Esparza, J., Maler, O.: Reachability analysis of pushdown automata: Application to model-checking. In: Mazurkiewicz, A., Winkowski, J. (eds.) CONCUR 1997. LNCS, vol. 1243, pp. 135–150. Springer, Heidelberg (1997) Bultan, T., Fu, X., Hull, R., Su, J.: Conversation speciﬁcation: a new approach to design and analysis of e-service composition. In: WWW, pp. 403–410 (2003) Burstein, M.H., Hobbs, J.R., Lassila, O., Martin, D.L., McDermott, D.V., McIlraith, S.A., Narayanan, S., Paolucci, M., Payne, T.R., Sycara, K.P.: Daml-s: Web service description for the semantic web. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 348–363. Springer, Heidelberg (2002) Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (2000)

190

S. Sheerazuddin

[CHK+ 01]

[DSVZ06] [FBS04]

[FGK02]

[GHR94] [HJJ+ 95]

[Hod02] [Hol97] [HWZ00]

[HWZ01]

[HWZ02]

[KMS00]

[NM02] [RH04]

[TC03]

[VW86]

Christophides, V., Hull, R., Karvounarakis, G., Kumar, A., Tong, G., Xiong, M.: Beyond discrete E-services: Composing session-oriented services in telecommunications. In: Casati, F., Georgakopoulos, D., Shan, M.-C. (eds.) TES 2001. LNCS, vol. 2193, pp. 58–73. Springer, Heidelberg (2001) Deutsch, A., Sui, L., Vianu, V., Zhou, D.: Veriﬁcation of communicating data-driven web services. In: PODS, pp. 90–99 (2006) Fu, X., Bultan, T., Su, J.: Conversation protocols: a formalism for speciﬁcation and veriﬁcation of reactive electronic services. Theor. Comput. Sci. 328(1-2), 19–37 (2004) Florescu, D., Gr¨ unhagen, A., Kossmann, D.: Xl: an xml programming language for web service speciﬁcation and composition. In: WWW, pp. 65–76 (2002) Gabbay, D.M., Hodkinson, I.M., Reynolds, M.A.: Temporal Logic. Part 1. Clarendon Press (1994) Henriksen, J.G., Jensen, J.L., Jørgensen, M.E., Klarlund, N., Paige, R., Rauhe, T., Sandholm, A.: Mona: Monadic second-order logic in practice. In: Brinksma, E., Steﬀen, B., Cleaveland, W.R., Larsen, K.G., Margaria, T. (eds.) TACAS 1995. LNCS, vol. 1019, pp. 89–110. Springer, Heidelberg (1995) Hodkinson, I.M.: Monodic packed fragment with equality is decidable. Studia Logica 72(2), 185–197 (2002) Holzmann, G.J.: The model checker spin. IEEE Trans. Software Eng. 23(5), 279–295 (1997) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable fragment of ﬁrst-order temporal logics. Ann. Pure Appl. Logic 106(1-3), 85–134 (2000) Hodkinson, I., Wolter, F., Zakharyaschev, M.: Monodic fragments of ﬁrst-order temporal logics: 2000-2001 A.D. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 1–23. Springer, Heidelberg (2001) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable and undecidable fragments of ﬁrst-order branching temporal logics. In: LICS, pp. 393–402 (2002) Klarlund, N., Møller, A., Schwartzbach, M.I.: Mona implementation secrets. In: Yu, S., P˘ aun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 182– 194. Springer, Heidelberg (2001) Narayanan, S., McIlraith, S.A.: Simulation, veriﬁcation and automated composition of web services. In: WWW, pp. 77–88 (2002) Ruys, T.C., Holzmann, G.J.: Advanced SPIN tutorial. In: Graf, S., Mounier, L. (eds.) SPIN 2004. LNCS, vol. 2989, pp. 304–305. Springer, Heidelberg (2004) IBM Web Services Business Process Execution Language (WSBPEL) TC. Web services business process execution language version 1.1. Technical report (2003), http://www.ibm.com/developerworks/library/ws-bpel Vardi, M.Y., Wolper, P.: An automata-theoretic approach to automatic program veriﬁcation (preliminary report). In: LICS, pp. 332–344 (1986)

Relating L-Resilience and Wait-Freedom via Hitting Sets Eli Gafni1 and Petr Kuznetsov2 1 2

Computer Science Department, UCLA Deutsche Telekom Laboratories/TU Berlin

Abstract. The condition of t-resilience stipulates that an n-process program is only obliged to make progress when at least n − t processes are correct. Put another way, the live sets, the collection of process sets such that progress is required if all the processes in one of these sets are correct, are all sets with at least n − t processes. We show that the ability of arbitrary collection of live sets L to solve distributed tasks is tightly related to the minimum hitting set of L, a minimum cardinality subset of processes that has a non-empty intersection with every live set. Thus, ﬁnding the computing power of L is N P -complete. For the special case of colorless tasks that allow participating processes to adopt input or output values of each other, we use a simple simulation to show that a task can be solved L-resiliently if and only if it can be solved (h − 1)-resiliently, where h is the size of the minimum hitting set of L. For general tasks, we characterize L-resilient solvability of tasks with respect to a limited notion of weak solvability: in every execution where all processes in some set in L are correct, outputs must be produced for every process in some (possibly diﬀerent) participating set in L. Given a task T , we construct another task TL such that T is solvable weakly L-resiliently if and only if TL is solvable weakly wait-free.

1

Introduction

One of the most intriguing questions in distributed computing is how to distinguish solvable from the unsolvable. Consider, for instance, the question of wait-free solvability of distributed tasks. Wait-freedom does not impose any restrictions on the scope of considered executions, i.e., a wait-free solution to a task requires every correct processes to output in every execution. However, most interesting distributed tasks cannot be solved in a wait-free manner [6,19]. Therefore, much research is devoted to understanding how the power of solving a task increases as the scope of considered executions decreases. For example, t-resilience considers only executions where at least n − t processes are correct (take inﬁnitely many steps), where n is the number of processes in the system. This provides for solving a larger set of tasks than wait-freedom, since in executions in which less than n − t processes are correct, no correct process is required to output. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 191–202, 2011. c Springer-Verlag Berlin Heidelberg 2011

192

E. Gafni and P. Kuznetsov

What tasks are solvable t-resiliently? It is known that this question is undecidable even with respect to wait-free solvability, let alone t-resilient [9,14]. But is the question about t-resilient solvability in any sense diﬀerent than the question about wait-free solvability? If we agree that we “understand” wait-freedom [16], do we understand t-resilience to a lesser degree? The answer should be a resounding no if, in the sense of solving tasks, the models can be reduced to each other. That is, if for every task T we can ﬁnd a task Tt which is solvable wait-free if and only if T is solvable t-resiliently. Indeed, [2,4,8] established that t-resilience can be reduced to wait-freedom. Consequently, the two models are uniﬁed with respect to task solvability. In this paper, we consider a generalization of t-resilience, called L-resilience. Here L stands for a collection of subsets of processes. A set in L is referred to as a live set. In the model of L-resilience, a correct process is only obliged to produce outputs if all the processes in some live set are correct. Therefore, the notion of L-resilience represents a restricted class of adversaries introduced by Delporte et al. [5], described as collections of exact correct sets. L-resilience describes adversaries that are closed under the superset operation: if a correct set is in an adversary, then every superset of it is also in the adversary. We show that the key to understanding L-resilience is the notion of a minimum hitting set of L (called simply hitting set in the rest of the paper). Given a set system (Π, L) where Π is a set of processes and L is a set of subsets of Π, H is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. Intuitively, in every L-resilient execution, i.e., in every execution in which at least one set in L is correct, not all processes in a hitting set of L can fail. Thus, under L-resilience, we can solve the k-set agreement task among the processes in Π where k is the hitting set size of (Π, L). In k-set agreement, the processes start with private inputs and the set of outputs is a subset of inputs of size at most k. Indeed, ﬁx a hitting set H of (Π, L) of size k. Every process in H simply posts its input value in the shared memory, and every other process returns the ﬁrst value it witnesses to be posted by a process in H. Moreover, using a simple simulation based on [2,4], we derive that L does not allow solving (k − 1)-set agreement or any other colorless task that cannot be solved (k − 1)resiliently. Thus, we can decompose superset-closed adversaries into equivalence classes, one for each hitting set size, where each class agrees on the set of colorless tasks it allows for solving. Informally, colorless tasks allow a process to adopt an input or output value of any other participating process. This restriction gives rise to simulation techniques in which dedicated simulators independently “install” inputs for other, possibly non-participating processes, and then take steps on their behalf so that the resulting outputs are still correct and can be adopted by any participant [2,4]. The ability to do this is a strong simplifying assumption when solvability is analyzed. For the case of general tasks, where inputs cannot be installed independently, the situation is less trivial. We address general tasks by considering a restricted notion of weak solvability, that requires every execution where all the processes in

Relating L-Resilience and Wait-Freedom via Hitting Sets

193

some set in L are correct to produce outputs for every process in some (possibly diﬀerent) participating set in L. Note that for colorless tasks, weak solvability is equivalent to regular solvability that requires every correct process to output. We relate between wait-free solvability and L-resilient solvability. Given a task T and a collection of live sets L, we deﬁne a task TL such that T is weakly solvable L-resiliently if and only if TL is weakly solvable wait-free. Therefore, we characterize L-resilient weak solvability, as wait-free solvability has already been characterized in [16]. Not surprisingly, the notion of a hitting set is crucial in determining TL . The simulations that relate T and TL are interesting in their own right. We describe an agreement protocol, called Resolver Agreement Protocol (or RAP), by which an agreement is immediately achieved if all processes propose the same value, and otherwise it is achieved if eventually a single correct process considers itself a dedicated resolver. This agreement protocol allows for a novel execution model of wait-free read-write protocols. The model guarantees that an arbitrary number of simulators starting with j distinct initial views should appear as j independent simulators and thus a (j − 1)-resilient execution can be simulated. The rest of the paper is organized as follows. Section 2 brieﬂy describes our system model. Section 3 presents a simple categorization of colorless tasks. Section 4 formally deﬁnes the wait-free counterpart TL to every task T . Section 5 describes RAP, the technical core of our main result. Sections 6 and 7 present two directions of our equivalence result: from wait-freedom to L-resilience and back. Section 8 overviews the related work, and Section 9 concludes the paper by discussing implications of our results and open questions. Most proofs are delegated to the technical report [10].

2

Model

We adopt the conventional shared memory model [12], and only describe necessary details. Processes and objects. We consider a distributed system composed of a set Π of n processes {p1 , . . . , pn } (n ≥ 2). Processes communicate by applying atomic operations on a collection of shared objects. In this paper, we assume that the shared objects are registers that export only atomic read-write operations. The shared memory can be accessed using atomic snapshot operations [1]. An execution is a pair (I, σ) where I is an initial state and σ is a sequence of process ids. A process that takes at least one step in an execution is called participating. A process that takes inﬁnitely many steps in an execution is said to be correct, otherwise, the process is faulty. Distributed tasks. A task is deﬁned through a set I of input n-vectors (one input value for each process, where the value is ⊥ for a non-participating process), a set O of output n-vectors (one output value for each process, ⊥ for non-terminated processes) and a total relation Δ that associates each input vector with a set of possible output vectors. A protocol wait-free solves a task T

194

E. Gafni and P. Kuznetsov

if in every execution, every correct process eventually outputs, and all outputs respect the speciﬁcation of T . Live sets. The correct set of an execution e, denoted correct (e) is the set of processes that appear inﬁnitely often in e. For a given collection of live sets L, we say that an execution e is L-resilient if for some L ∈ L, L ⊆ correct (e). We consider protocols which allow each process to produce output values for every other participating process in the system by posting the values in the shared memory. We say that a process terminates when its output value is posted (possibly by a diﬀerent process). Hitting sets. Given a set system (Π, L) where L is a set of subsets of Π, a set H ⊆ Π is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. We denote the set of hitting sets of (Π, L) by HS(Π, L), and the size of a hitting set of (Π, L) by h(Π, L). By (Π , L), Π ⊆ Π we denote the set system that consists of the elements S ∈ L, such that S ⊆ Π . The BG-simulation technique. In a colorless task (also called convergence tasks [4]) processes are free to use each others’ input and output values, so the task can be deﬁned in terms of input and output sets instead of vectors. BG-simulation is a technique by which k + 1 processes q1 , . . ., qk+1 , called simulators, can wait-free simulate a k-resilient execution of any asynchronous n-process protocol [2,4] solving a colorless task. The simulation guarantees that each simulated step of every process pj is either eventually agreed on by all simulators, or the step is blocked forever and one less simulator participates further in the simulation. Thus, as long there is a live simulator, at least n − k simulated processes accept inﬁnitely many simulated steps. The technique has been later extended to tasks beyond colorless [8]. Weak L-resilience. An execution is L-resilient if some set in L contains only correct processes. We say that a protocol solves a task T weakly L-resiliently if in every L-resilient execution, every process in some participating set L ∈ L eventually terminates, and all posted outputs respect the speciﬁcation of T . In the wait-free case, when L consists of all n singletons, weak L-resilient solvability stipulates that at least one participating process must be given an output value in every execution. Weak solvability is suﬃcient to (strongly) solve every colorless task. For general tasks, however, weak solvability does not automatically implies strong solvability, since it only allows processes to adopt the output value of any terminated process, and does not impose any conditions on the inputs.

3

Colorless Tasks

Theorem 1. A colorless task T is L-resiliently solvable if and only if T is (h(Π, L) − 1)-resiliently solvable Theorem 1 implies that L-resilient adversaries can be categorized into n equivalence classes, class h corresponding to hitting sets of size h. Note that two

Relating L-Resilience and Wait-Freedom via Hitting Sets

195

adversaries that belong to the same class h agree on the set of colorless tasks they are able to solve, and the set includes h-set agreement.

4

Relating L-Resilience and Wait-Freedom: Definitions

Consider a set system (Π, L) and a task T = (I, O, Δ), where I is a set of input vectors, O is a set of output vectors, and Δ is a total binary relation between them. In this section, we deﬁne the “wait-free” task TL = (I , O , Δ ) that characterizes L-resilient solvability of T . The task TL is also deﬁned for n processes. We call the processes solving TL simulators and denote them by s1 , . . . , sn . Let X and X be two n-vectors, and Z1 , . . . , Zn be subsets of Π. We say that X is an image of X with respect to Z1 , . . . , Zn if ∀i, such that X [i] = ⊥, we have X [i] = {(j, X[j])}j∈Zi . Now TL = (I , O , Δ ) guarantees that for all (I , O ) ∈ Δ , there exist (I, O) ∈ Δ such that: (1) ∃S1 , . . . , Sn ⊆ Π, each containing a set in L: (1a) I is an image of I with respect to S1 , . . . , Sn . (1b) |{I [i]}i − {⊥}| = m ⇒ h(∪i,I [i]=⊥ Si , L) ≥ m. In other words, every process participating in TL obtains, as an input, a set of inputs of T for some live set, and all these inputs are consistent with some input vector I of T . Also, if the number of distinct non-⊥ inputs to TL is m, then the hitting set size of the set of processes that are given inputs of T is at least m. (2) ∃U1 , . . . , Un , each containing a set in L: O is an image of O with respect to U1 , . . . , U n . In other words, the outputs of TL produced for input vector I should be consistent with O ∈ O such that (I, O) ∈ Δ. Intuitively, every group of simulators that share the same input value will act as a single process. According to the assumptions on the inputs to TL , the existence of m distinct inputs implies a hitting set of size at least m. The asynchrony among the m groups will be manifested as at most m − 1 failures. The failures of at most m − 1 processes cannot prevent all live sets from terminating, as otherwise the hitting set in (1b) is of size at most m − 1.

5

Resolver Agreement Protocol

We describe the principal building block of our constructions: the resolver agreement protocol (RAP). RAP is similar to consensus, though it is neither always safe nor always live. To improve liveness, some process may at some point become a resolver, i.e., take the responsibility of making sure that every correct process outputs. Moreover, if there is at most one resolver, then all outputs are the same.

196

E. Gafni and P. Kuznetsov

Shared variables: D, initially ⊥ Local variables: resolver , initially false propose(v) 1 (flag, est ) := CA.propose(v) 2 if flag = commit then 3 D := est ; return(est ) 4 repeat 5 if resolver then D := est 6 until D = ⊥ 7 return(D) resolve() 8 resolver := true Fig. 1. Resolver agreement protocol: code for each process

Formally, the protocol accepts values in some set V as inputs and exports operations propose(v), v ∈ V , and resolve() that, once called by a process, indicates that the process becomes a resolver for RAP. The propose operation returns some value in V , and the following guarantees are provided: (i) Every returned value is a proposed value; (ii) If all processes start with the same input value or some process returns, then every correct process returns; (iii) If a correct process becomes a resolver, then every correct process returns; (iv) If at most one process becomes a resolver, then at most one value is returned. A protocol that solves RAP is presented in Figure 1. The protocol uses the commit-adopt abstraction (CA) [7] exporting one operation propose(v) that returns (commit, v ) or (adopt , v ), for v, v ∈ V , and guarantees that (a) every returned value is a proposed value, (b) if only one value is proposed then this value must be committed, (c) if a process commits a value v, then every process that returns adopts v or commits v, and (d) every correct process returns. The commit-adopt abstraction can be implemented wait-free [7]. In the protocol, a process that is not a resolver takes a ﬁnite number of steps and then either returns with a value, or waits on one posted in register D by another process or by a resolver. A process that waits for an output (lines 4-6) considers the agreement protocol stuck. An agreement protocol for which a value was posted in D is called resolved. Lemma 1. The algorithm in Figure 1 implements RAP.

6

From Wait-Freedom to L-Resilience

Suppose that TL is weakly wait-free solvable and let AL be the corresponding wait-free protocol. We show that weak wait-free solvability of TL implies weak L-resilient solvability of T by presenting an algorithm A that uses AL to solve T in every L-resilient execution.

Relating L-Resilience and Wait-Freedom via Hitting Sets

197

Shared variables: Rj , j = 1, . . . , n, initially ⊥ Local variables: Sj , j = 1, . . . , h(Π, L), initially ∅ j , j = 1, . . . , h(Π, L), initially 0 9 Ri := input value of T 10 wait until snapshot (R1 , . . . , Rn ) contains inputs for some set in L 11 while true do 12 S := {pi ∈ P, Ri = ⊥} {the current participating set} 13 if pi ∈ H S then {H S is deterministically chosen in HS(S, L)} 14 m := the index of pi in H S 15 RAP mm .resolve () 16 for each j = 1, . . . , |H S | do 17 if j = 0 then 18 Sj := S 19 take one more step of RAP jj .propose (Sj ) j 20 if RAP j .propose (Sj ) returns v then 21 (flag, Sj ) := CAjj .propose (v) 22 if (flag = commit) then 23 return ({(s, Rs )}ps ∈Sj ) {return the set of inputs of processes in Sj } 24 j := j + 1 Fig. 2. The doorway protocol: the code for each process pi

First we describe the doorway protocol (DW), the only L-dependent part of our transformation. The responsibility of DW is to collect at each process a subset of the inputs of T so that all the collected subsets constitute a legitimate input vector for task TL (property (1) in Section 4). The doorway protocol does not require the knowledge of T or TL and depends only on L. In contrast, the second part of the transformation described in Section 6.2 does not depend on L and is implemented by simply invoking the wait-free task TL with the inputs provided by DW. 6.1

The Doorway Protocol

Formally, a DW protocol ensures that in every L-resilient execution with an input vector I ∈ I, every correct participant eventually obtains a set of inputs of T so that the resulting input vector I ∈ TL complies with property (1) in Section 4 with respect to I. The algorithm implementing DW is presented in Figure 2. Initially, each process pi waits until it collects inputs for a set of participating processes that includes at least one live set. Note that diﬀerent processes may observe diﬀerent participating sets. Every participating set S is associated with H S ∈ HS(S, L), some deterministically chosen hitting set of (S, L). We say that H S is a resolver set for Π: if S is the participating set, then we initiate |H S | parallel sequences of agreement protocols with resolvers. Each sequence of agreement protocols can

198

E. Gafni and P. Kuznetsov

return at most one value and we guarantee that, eventually, every sequence is associated with a distinct resolver in H S . In every such sequence j, each process pi sequentially goes through an alternation of RAPs and CAs (see Section 5): RAP 1j , CA1j , RAP 2j , CA2j , . . .. The ﬁrst RAP is invoked with the initially observed set of participants, and each next CA (resp., RAP) takes the output of the previous RAP (resp., CA) as an input. If some CAj returns (commit , v), then pi returns v as an output of the doorway protocol. Lemma 2. In every L-resilient execution of the algorithm in Figure 2 starting with an input vector I, every correct process pi terminates with an output value I [i], and the resulting vector I complies with property (1) in Section 4 with respect to I. 6.2

Solving T through the Doorway

Given the DW protocol described above, it is straightforward to solve T by simply invoking AL with the inputs provided by DW. Thus: Theorem 2. Task T is weakly L-resiliently solvable if TL is weakly wait-free solvable.

From L-Resilience to Wait-Freedom

7

Suppose T is weakly L-resiliently solvable, and let A be the corresponding protocol. We describe a protocol AL that solves TL by wait-free simulating an L-resilient execution of A. For pedagogical reasons, we ﬁrst present a simple abstract simulation (AS) technique. AS captures the intuition that a group of simulators sharing the initial view of the set of participating simulated codes should appear as a single simulator. Therefore, an arbitrary number of simulators starting with j distinct initial views should be able to simulate a (j − 1)-resilient execution. Then we describe our speciﬁc simulation and show that it is an instance of AS, and thus it indeed generates a (j − 1)-resilient execution of L, where j is the number of distinct inputs of TL . By the properties of TL , we immediately obtain a desired L-resilient execution of A. 7.1

Abstract Simulation

Suppose that we want to simulate a given n-process protocol, with the set of codes {code1 , . . . , coden }. Every instruction of the simulated codes (read or write) is associated with a unique position in N. E.g., we can enumerate the instructions as follows: the ﬁrst instructions of each simulated code, then the second instructions of each simulated code, etc.1 1

In fact, only read instructions of a read-write protocol need to be simulated since these are the only steps that may trigger more than one state transition of the invoking process [2,4].

Relating L-Resilience and Wait-Freedom via Hitting Sets

199

A state of the simulation is a map of the set of positions to colors {U, IP, V }, every position can have one of three colors: U (unvisited ), IP (in progress), or V (visited ). Initially, every position is unvisited. The simulators share a function next that maps every state to the next unvisited position to simulate. Accessing an unvisited position by a simulator results in changing its color to IP or V . diﬀerent states concurrently proposed

IP

adversary

U adversary, identical states concurrently proposed

V

Fig. 3. State transitions of a position in AS

The state transitions of a position are summarized in Figure 3, and the rules the simulation follows are described below: (AS1) Each process takes an atomic snapshot of the current state s and goes to position next (s) proposing state s. For each state s, the color of next (s) in state s is U . - If an unvisited position is concurrently accessed by two processes proposing diﬀerent states, then it is assigned color IP . - If an unvisited position is accessed by every process proposing the same state, it may only change its color to V . - If the accessed position is already V (a faster process accessed it before), then the process leaves the position unchanged, takes a new snapshot, and proceeds to the next position. (AS2) At any point in the simulation, the adversary may take an in-progress (IP ) position and atomically turn it into V or take a set of unvisited (U ) positions and atomically turn them into V . (AS3) Initially, every position is assigned color U . The simulation starts when the adversary changes colors of some positions to V .

We measure the progress of the simulation by the number of positions turning from U to V . Note that by changing U or IP positions to V , the adversary can potentially hamper the simulation, by causing some U positions to be accessed with diﬀerent states and thus changing their colors to IP . However, the following invariant is preserved: Lemma 3. If the adversary is allowed at any state to change the colors of arbitrarily many IP positions to V , and throughout the simulation has j chances to atomically change any set of U positions to V , then at any time there are at most j − 1 IP positions.

200

E. Gafni and P. Kuznetsov

7.2

Solving TL through AS

Now we show how to solve TL by simulating a protocol A that weakly Lresiliently solves T . First, we describe our simulation and show that it instantiates AS, which allows us to apply Lemma 3. Every simulator si ∈ {s1 , . . . , sn } posts its input in the shared memory and then continuously simulates participating codes in {code1 , . . . , coden } of algorithm A in the breadth-ﬁrst manner: the ﬁrst command of every participating code, the second command of every participating code, etc. (A code is considered participating if its input value has been posted by at least one simulator.) The procedure is similar to BG-simulation, except that the result of every read command in the code is agreed upon through a distinct RAP instance. Simulator si is statically assigned to be the only resolver of every read command in codei . The simulated read commands (and associated RAPs) are treated as positions of AS. Initially, all positions are U (unvisited). The outcome of accessing a RAP instance of a position determines its color. If the RAP is resolved (a value was posted in D in line 3 or 5), then it is given color V (visited). If the RAP is found stuck (waiting for an output in lines 4-6) by some process, then it is given color IP (in progress). Note that no RAP accessed with identical proposals can get stuck (property (ii) in Section 5). After accessing a position, the simulator chooses the ﬁrst not-yet executed command of the next participating code in the round-robin manner (function next). For the next simulated command, the simulator proposes its current view of the simulated state, i.e., the snapshot of the results of all commands simulated so far (AS1). Further, if a RAP of codei is observed stuck by a simulator (and thus is assigned color IP ), but later gets resolved by si , we model it as the adversary spontaneously changing the position’s color from IP to V . Finally, by the properties of RAP, a position can get color IP only if it is concurrently accessed with diverging states (AS2). We also have n positions corresponding to the input values of the codes, initially unvisited. If an input for a simulated process pi is posted by a simulator, the initial position of codei turns into V . This is modeled as the intrusion of the adversary, and if simulators start with j distinct inputs, then the adversary is given j chances to atomically change sets of U positions to V . The simulation starts when the ﬁrst set of simulators post their inputs concurrently take identical snapshots (AS3). Therefore, our simulation is an instance of AS, and thus we can apply Lemma 3 to prove the following result: Lemma 4. If the number of distinct values in the input vector of TL is j, then the simulation above blocks at most j − 1 simulated codes. The simulated execution terminates when some simulator observes outputs of T for at least one participating live set. Finally, using the properties of the inputs to task TL (Section 4), we derive that eventually, some participating live set of simulated processes obtain outputs. Thus, using Theorem 2, we obtain:

Relating L-Resilience and Wait-Freedom via Hitting Sets

201

Theorem 3. T is weakly L-resiliently solvable if and only if TL is weakly waitfree solvable.

8

Related Work

The equivalence between t-resilient task solvability and wait-free task solvability has been initially established for colorless tasks in [2,4], and then extended to all tasks in [8]. In this paper, we consider a wider class of assumptions than simply t-resilience, which can be seen as a strict generalization of [8]. Generalizing t-resilience, Janqueira and Marzullo [18] considered the case of dependent failures and proposed describing the allowed executions through cores and survivor sets which roughly translate to our hitting sets and live sets. Note that the set of survivor sets (or, equivalently, cores) exhaustively describe only supersetclosed adversaries. More general adversaries introduced by Delporte et al. [5] are deﬁned as a set of exact correct sets. It is shown in [5] that the power of an adversary A to solve colorless tasks is characterized by A’s disagreement power, the highest k such that k-set agreement cannot be solved assuming A: a colorless task T is solvable with adversary A of disagreement power k if and only if it is solvable k-resiliently. Herlihy and Rajsbaum [15] (concurrently and independently of this paper) derived this result for a restricted set of superset-closed adversaries with a given core size using elements of modern combinatorial topology. Theorem 1 in this paper derives this result directly, using very simple algorithmic arguments. Considering only colorless tasks is a strong restriction, since such tasks allow for deﬁnitions that only depend on sets of inputs and sets of outputs, regardless of which processes actually participate. (Recall that for colorless tasks, solvability and our weak solvability are equivalent.) The results of this paper hold for all tasks. On the other hand, as [15], we only consider the class of superset-closed adversaries. This ﬁlters out some popular liveness properties, such as obstructionfreedom [13].Thus, our contributions complement but do not contain the results in [5]. A protocol similar to our RAP was earlier proposed in [17].

9

Side Remarks and Open Questions

Doorways and iterated phases. Our characterization shows an interesting property of weak L-resilient solvability: To solve a task T weakly L-resiliently, we can proceed in two logically synchronous phases. In the ﬁrst phase, processes wait to collect “enough” input values, as prescribed by L, without knowing anything about T . Logically, they all ﬁnish the waiting phase simultaneously. In the second phase, they all proceed wait-free to produce a solution. As a result, no process is waiting on another process that already proceeded to the waitfree phase. Such phases are usually referred to as iterated phases [3]. In [8], some processes are waiting on others to produce an output and consequently the characterization in [8] does not have the iterated structure. L-resilience and general adversaries. The power of a general adversary of [5] is not exhaustively captured by its hitting set. In a companion paper [11], we propose a simple characterization of the set consensus power of a general adversary

202

E. Gafni and P. Kuznetsov

A based on the hitting set sizes of its recursively proper subsets. Extending our equivalence result to general adversaries and getting rid of the weak solvability assumption are two challenging open questions.

References 1. Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., Shavit, N.: Atomic snapshots of shared memory. J. ACM 40(4), 873–890 (1993) 2. Borowsky, E., Gafni, E.: Generalized FLP impossibility result for t-resilient asynchronous computations. In: STOC, pp. 91–100. ACM Press, New York (May 1993) 3. Borowsky, E., Gafni, E.: A simple algorithmically reasoned characterization of wait-free computation (extended abstract). In: PODC 1997: Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 189–198. ACM Press, New York (1997) 4. Borowsky, E., Gafni, E., Lynch, N.A., Rajsbaum, S.: The BG distributed simulation algorithm. Distributed Computing 14(3), 127–146 (2001) 5. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009) 6. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985) 7. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 8. Gafni, E.: The extended BG-simulation and the characterization of t-resiliency. In: STOC, pp. 85–92 (2009) 9. Gafni, E., Koutsoupias, E.: Three-processor tasks are undecidable. SIAM J. Comput. 28(3), 970–983 (1999) 10. Gafni, E., Kuznetsov, P.: L-resilient adversaries and hitting sets. CoRR, abs/1004.4701 (2010), http://arxiv.org/abs/1004.4701 11. Gafni, E., Kuznetsov, P.: Turning adversaries into friends: Simpliﬁed, made constructive, and extended. In: OPODIS (2011) 12. Herlihy, M.: Wait-free synchronization. ACM Trans. Prog. Lang. Syst. 13(1), 123– 149 (1991) 13. Herlihy, M., Luchangco, V., Moir, M.: Obstruction-free synchronization: Doubleended queues as an example. In: ICDCS, pp. 522–529 (2003) 14. Herlihy, M., Rajsbaum, S.: The decidability of distributed decision tasks (extended abstract). In: STOC, pp. 589–598 (1997) 15. Herlihy, M., Rajsbaum, S.: The topology of shared-memory adversaries. In: PODC (2010) 16. Herlihy, M., Shavit, N.: The topological structure of asynchronous computability. J. ACM 46(2), 858–923 (1999) 17. Imbs, D., Raynal, M.: Visiting gafni’s reduction land: From the bg simulation to the extended bg simulation. In: SSS, pp. 369–383 (2009) 18. Junqueira, F., Marzullo, K.: A framework for the design of dependent-failure algorithms. Concurrency and Computation: Practice and Experience 19(17), 2255–2269 (2007) 19. Loui, M., Abu-Amara, H.: Memory requirements for agreement among unreliable asynchronous processes. Advances in Computing Research 4, 163–183 (1987)

Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information Valerie King1 , Steven Lonargan1, Jared Saia2 , and Amitabh Trehan1 1

Department of Computer Science, University of Victoria, P.O. Box 3055, Victoria, BC, Canada V8W 3P6 [email protected], {sdlonergan,amitabh.trehaan}@gmail.com 2 Department of Computer Science, University of New Mexico, Albuquerque, NM 87131-1386 [email protected] Abstract. We address the problem of designing distributed algorithms for large scale networks that are robust to Byzantine faults. We consider a message passing, full information model: the adversary is malicious, controls a constant fraction of processors, and can view all messages in a round before sending out its own messages for that round. Furthermore, each bad processor may send an unlimited number of messages. The only constraint on the adversary is that it must choose its corrupt processors at the start, without knowledge of the processors’ private random bits. A good quorum is a set of O(log n) processors, which contains a majority of good processors. In this paper, we give a synchronous algorithm ˜ √n) bits of communication per which uses polylogarithmic time and O( processor to bring all processors to agreement on a collection of n good quorums, solving Byzantine agreement as well. The collection is balanced in that no processor is in more than O(log n) quorums. This yields the first solution to Byzantine agreement which is both scalable and loadbalanced in the full information model. The technique which involves going from situation where slightly more than 1/2 fraction of processors are good and and agree on a short string with a constant fraction of random bits to a situation where all good processors agree on n good quorums can be done in a fully asynchronous model as well, providing an approach for extending the Byzantine agreement result to this model.

1

Introduction

The last ﬁfteen years have seen computer scientists slowly come to terms with the following alarming fact: not all users of the Internet can be trusted. While this fact is hardly surprising, it is alarming. If the size of the Internet is unprecedented in the history of engineered systems, then how can we hope to address the challenging problem of scalability and also the challenging problem of resistance to malicious users?

This research was partially supported by NSF CAREER Award 0644058, NSF CCR0313160, AFOSR MURI grant FA9550-07-1-0532, and NSERC.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 203–214, 2011. c Springer-Verlag Berlin Heidelberg 2011

204

V. King et al.

Recent work attempts to address both of these problems concurrently. In the last few years, almost everywhere Byzantine agreement, i.e., coordination between all but a o(1) fraction of processors, was shown to be possible with no more than polylogarithmic bits of communication per processor and polylogarithmic time [13]. More recently, scalable everywhere agreement was shown to be possible if a small set of processors took on the brunt of each communicating Ω(n3/2 ) bits to the remaining processors [11], or if private channels are assumed [12]. In this paper, we give the ﬁrst load-balanced, scalable method for agreeing on a bit in the synchronous, full information model. In particular, our algorithm ˜ √n) bits. Our technique also yields an requires each processor to send only O( agreement on a collection of n good quorum gateways (referred to as quorums from now on), that is, sets of processors of size O(log n) each of which contains a majority of good processors, and a 1-1 mapping of processors to quorums. The collection is balanced in that no processor is in more than O(log n) quorums. Our usage of the quorum terminology is similar to that in the peer-to-peer literature [17,6,1,3,5,8], where quorums are of O(log n) size each having a majority of good processors, and allow for containment of adversarial behavior via majority ﬁltering. Quorums are useful in an environment with malicious processors as they can act as a gateway to ﬁlter messages from by bad processors. For example, a bad processor x can be limited in the number of messages it sends if other processors only accept messages sent by a majority of processors in x’s quorum, and the majority only agree to forward a limited number of messages from x. The number of bits of communication required per processor is polylogarith˜ √n) per processor for mic to bring all but o(1) processors to agreement and O( everywhere agreement on the composition of the n quorums. Our result is with an adversary that controls up to a 1/3 − fraction of processors, for any ﬁxed > 0, and which has full information, i.e., it knows the content of all messages passed between good processors. However, the adversary is non-adaptive, that is, it cannot choose dynamically which processors to corrupt based on its observations of the protocol’s execution. Bad processors are allowed to send an unlimited number of bits and messages, and defense against a denial of service attack is one of the features of our protocol. As an additional result, we present an asynchronous algorithm that can go from a situation where, for any positive constant γ, 1/2+γ fraction of processors are good and agree on a single string of length O(log n) with a constant fraction of random bits to a situation where all good processors agree on n good quorums. This algorithm is load-balanced in that it requires each processor to send only ˜ √n) bits, and the resulting quorums are balanced in that no processor is in O( more than O(log n) quorums. 1.1

Methodology

Our synchronous protocol builds on a previous protocol which brings all but o(1) processors to agreement on a set of s = O(log n) processors of which no more than a 1/3 − fraction are bad, using a sparse overlay network [14]. Being few in number, these processors can run a heavyweight protocol requiring all-

Load Balanced Scalable Byzantine Agreement

205

to-all communication to also agree on a string globalstr which contains a bit (or multiple bits) from each processor, such that a 2/3 + fraction of the bits are randomly set. This string can be communicated scalably to almost all processors using a communication tree formed as a byproduct of the protocol (See [13,14]). When a clear majority of good processors agree on a value, a processor should be able to learn that value, with high probability, by polling O(log n) processors. However the bad processors can thwart this approach by ﬂooding all processors with requests. Even if there are few bad processors, in the full information model, the bad processors can target processors on speciﬁc good processors’ poll lists to isolate these processors. To address this problem, we use the globalstr to build quorums to limit the number of eﬀective requests. We also restrict the design of poll lists, preserving enough randomness that they are reliable, but limit the adversary’s ability to target. Key to our work here is that we show the existence of an averaging sampler type function, H, which is known at the start by all the processors and which with high probability, when given an O(log n) length string with a constant fraction of random bits, and a processor ID, produces a good quorum for every ID. Our protocol then uses the fact that almost all processors agree on a collection of good quorums to bring all processors to agree on the string in a load balanced manner, and hence the collection of quorums. Similarly, to solve Byzantine agreement, a single bit agreed to by the initial small set can be agreed to by all the processors. We also show the existence of a function J which uses individually generated random strings and a processor’s ID to output a O(log n) poll list, so that the distribution of poll lists has desired properties. These techniques can be extended to the asynchronous model assuming a scalable implementation of [10]. That work shows that a set of size O(log log n) processors with 2/3 + good processors can be agreed to almost everywhere with probability 1 − o(1). Bringing these processors to agreement on a string with some random bits is trickier in the asynchronous full information model, where the adversary can prevent a fraction of the good processors from being heard based on their random bits. However, [10] shows that it is possible to bring such a set to agreement on a string with some randomness, which we show is enough to provide a good input to H. 1.2

Related Work

Several papers are mentioned above with related results. Most closely related is the algorithm in [11] which similarly starts with almost everywhere agreement on a bit and a small representative set of processors from [13,14] and produces everywhere agreement. However it is not load balanced, and does not create quorums or require the use of specially designed functions H and J. With private channels, load balancing in the presence of an adaptive adversary is achievable ˜ √n) bits of communication per processor [12]. with O( Awerbuch and Scheidler have done important work in the area of maintaining quorums [3,4,5,6]. They show how to scalably support a distributed hash table (DHT) using quorums of size O(log n), where processors are joining and leaving,

206

V. King et al.

a functionality our method does not support. The adversary they consider is nonadaptive in the sense that processors cannot spontaneously be corrupted; the adversary can only decide to cause a good processor to drop out and decide if an entering processor is good or bad. A critical diﬀerence between their results and ours is that while they can maintain a system that starts in a good conﬁguration, they cannot initialize such a system unless the starting processors are all good. This is because an entering processor must start by contacting a good processor in a good quorum. The quorum uses secret sharing to produce a random number to assign or reassign new positions in a sparse overlay network (using the cuckoo rule [15]). These numbers and positions are created using a method for secret sharing involving private channels and cryptographic hardness assumptions. In older work, Upfal, Dwork, Peleg and Pippenger addressed the problem of solving almost-everywhere agreement on a bounded degree network [16,7]. However, the algorithms described in these papers are not scalable. In particular, both algorithms require each processor to send at least a linear number of bits (and sometimes an exponential number). 1.3

Model

We assume a fully connected network of n processors, whose IDs are common knowledge. Each processor has a private coin. Communication channels are authenticated, in the sense that whenever a processor sends a message directly to another, the identity of the sender is known to the recipient, but we otherwise make no cryptographic assumptions. We assume a nonadaptive (sometimes called static) adversary. That is, the adversary chooses the set of tn bad processors at the start of the protocol, where t is a constant fraction, namely, 1/3 − for any positive constant . The adversary is malicious: bad processors can engage in any kind of deviations from the protocol, including false messages and collusion, or crash failures, and bad processors can send any number of messages. Moreover, the adversary chooses the input bits of every processor. The good processors are those that follow the protocol. We consider both synchronous and asynchronous models of communication. In the synchronous model, communication proceeds in rounds; messages are all sent out at the same time at the start of the round, and then received at the same time at the end of the same round; all processors have synchronized clocks. The time complexity is given by the number of rounds. In the asynchronous model, each communication can take an arbitrary and unknown amount of time, and there is no assumption of a joint clock as in the synchronous model. The adversary can determine the delay of each message and the order in which they are received. We follow [2] in deﬁning the running time of an asynchronous protocol as the time of execution, where the maximum delay of a message between the time it is sent and the time it is processed is assumed to be one unit. We assume full information: in the synchronous model, the adversary is rushing, that is, it can view all messages sent by the good processors in a round before the bad processors send their messages in the same round. In the case of the asynchronous model, the adversary can view any sent message before its delay is determined.

Load Balanced Scalable Byzantine Agreement

1.4

207

Results

We use the phrase with high probability to mean that an event happens with probability at least 1 − 1/nc , for any constant c and suﬃciently large n. We show: Theorem 1 (Synchronous Byzantine Agreement). Let n be the number of processors in a synchronous full information message passing model with a nonadaptive, rushing adversary that controls less than 1/3 − fraction of processors. For any positive constant , there exists a protocol which √ w.h.p. computes ˜ n) bits of comByzantine agreement, runs in polylogarithmic time, and uses O( munication per processor. This result follows from the application of the load balanced protocol in [14], followed by the synchronous protocol introduced in Section 3 of this paper. Theorem 2 (Almost everywhere to everywhere–asynchronous). Let n be the number of processors in a fully asynchronous full information message passing model with a nonadaptive adversary. Assume that (1/2 + γ)n good processors agree on a string of length O(log n) which has a constant fraction of random bits, and where the remaining bits are ﬁxed by a malicious adversary after seeing the random bits. Then for any positive constant γ, there exists a protocol which w.h.p. brings all good processors to agreement on n good quorums; runs ˜ √n) bits of communication per processor. in polylogarithmic time; and uses O( Furthermore, if we assume that same set of good processors have agreed on an input bit (to the Byzantine agreement problem) then this same protocol can bring all good processors to agreement on that bit. A scalable implementation of the protocol in [10] following the lines of [14] would create the conditions in the assumptions of this theorem with probability 1 − O(1/ log n) in polylogarithmic time and bits per processor with an adversary that controls less than 1/3 − fraction of processors. Then this theorem would yield an algorithm to solve asynchronous Byzantine agreement with probability 1 − O(1/ log n). The protocol is introduced in Section 4 of this paper.

2

Combinatorial Lemmas

Before presenting our protocol, we discuss here the properties of some combinatorial objects we shall use in our protocol. Let [r] denote the set of integers {1, . . . , r}, and [s]d the multisets of size d consisting of elements of [s]. Let H : [r] → [s]d be a function assigning multisets of size d to integers. We deﬁne the intersection of a multiset A and a set B to be the number of elements of A which are in B. H is a (θ, δ) sampler if at most a δ fraction of all inputs x have |H(x)∩S| > |S| + θ. d s c+1 c Let r = n . Let i ∈ [n ] and j ∈ [n]. Then we deﬁne H(i, j) to be H(in + j) and H(i, ∗) to be the collection of subsets H(i + 1), H(i + 2), ..., H(i + n).

208

V. King et al.

Lemma 1 ([[9], Lemma 4.7], [[18], Proposition 2.20]). For every s, θ, δ > 0 and r ≥ s/δ, there is a (θ, δ) sampler H : [r] → [s]d with d = O(log(1/δ)/θ2 ). A corollary of the proof of this lemma shows that if one increases the constant in the expression of d by a factor of c, we get the following: Corollary 1. Let H[r] be constructed by randomly selecting with replacement d elements of [s]. For every s, θ, δ, c > 0 and r ≥ s/δ, for d = O(log(1/δ)/θ 2 ), H(r) is a (θ, δ) sampler H : [r] → [s]d with probability 1 − 1/nc . Lemma 2. Let r = nc+1 and s = n. Let H : [r] → [s]d be constructed by randomly selecting with replacement d elements of [s]. Call an element y ∈ [s] overloaded by H if its inverse image under H contains more than a.d elements, for some ﬁxed element a ≥ 6. The probability that any y ∈ [s] is overloaded by any H(i, ∗) is less than 1/2, for d = O(log n) and a = O(1). Proof. Fix i. The probability that the size of the inverse image of y ∈ [s] ∈ H(i, ∗) is a times its expected size of d is less than 2−ad , for a ≥ 6, by a standard Chernoﬀ bound. The probability that for any i that any y ∈ [s] is overloaded is less than n(nc )2−ad < 1/2, by a union bound over all y ∈ [s] and all i for d = O(log n). Let S be any subset of [n]. A quorum or poll list is a subset of [n] of size O(log n) and a good quorum (resp., poll list) with respect to S ∈ [n] is a quorum (resp., poll list) with greater than 1/2 elements in S. Taking the union bound over the probabilities of the events given in the preceding corollary and lemma, and applying the probabilistic method yields the existence of a mapping with the desired properties: Lemma 3. For any constant c, there exists a mapping H : [nc+1 ] → [n]d such that for every i the inverse image of every element under H[i, ∗] is O(log n) and for any choice of any subset S ⊂ n of size at least 1/2 + n, with probability 1 − 1/nc over the choice of random numbers i ∈ [nc ], H[i, ∗] contains all good quorums. The following lemma is needed to show results about the polling lists, which are subsets of size O(log n) just like quorums, but are used for a diﬀerent purpose in the protocol. Lemma 4. There exists a mapping J : [1..., nc+1 ] → [n]d such that for any set of 1/2 + fraction of good processors in [1 . . . n]: 1. At least nc+1 − n elements of L are mapped to a good PollList. 2 2. For any L ⊂ [nc+1 ], |L | ≤ n, let R be any subset of [r], |R | ≤ |L |/e and let L be the inverse image of R under J. Then x∈L |J(x)∩R | < d|L |/2. Hence |L |/2 pollLists contain fewer than d elements in R . Proof. Part 1: The probability that a randomly constructed J has this property with probability greater than 1/2 follows from Lemma 3. Part 2: Let J be constructed randomly as in the previous proofs. Fix L , ﬁx R .

Load Balanced Scalable Byzantine Agreement

209

d|L | d|L| P r[ x∈L |J(x)∩R | ≥ d|L |] = d|L ≤ [(e/)(|R |/n)]d|L | ≤ | (|R |/n)

e−d|L | , for |R | ≤ n/e2 The number of ways of choosing a subset of size x and y from [nc ] and [n], resp., is bounded above by (enc /x)x ∗ (en/y)y = ex(c log n−log x+1)+y(log n−log y+1) < e2|L |c log n . The union bound over all sizes of x ≤ n and y is less than 1/2 for d > (2c log n)/ + 1/|L| Hence with probability less than 1/2, x∈L |J(x)∩R | > d|L | for all subsets L of size n or less in [nc ] and all subsets R of size |L |/e2 . Finally, by the union bound, a random J has both properites (1) and (2) with probability greater than 0. By the probabilistic method, there exists a function J with properties (1) and (2). 2.1

Using the Almost-Everywhere Agreement Protocol in [13,14]

We observe that this protocol which uses polylogarithmic bits of communication generates a representative set S of O(log n) processors which is agreed upon by all but O(1/ log n) fraction of good processors, and any message agreed upon by the processors is learned by all but O(1/ log n) fraction of good processors. Hence we start in our current work from the point where there is an b log n bit string globalstr agreed upon by all but O(1/ log n) fraction of good processors such that 2/3+ fraction of good processor in S have each generated c /b random bits (see below), and the remaining bits are generated by bad processors after seeing the bits of good processors. The ordering of the bits is independent of their value and is given by processor ID. globalstr is random enough: Lemma 5. With probability at least 1 − 1/nc , for suﬃciently large constant c and d = O log n), there is an H : [nc +1 ] → [n]d such that H(globalstr , ∗) is a collection of all good quorums. Proof. By Lemma 3 there are nc good choices for globalstr and n bad choices. We choose c to be a multiple of b which is greater than (3/2)c. Fix one bad choice string. The probability of the random bits matching this string is less than 2−(2/3c log n) and by a union bound, the probability of it matching any of the n bad strings is less than 1/nc .

3

Algorithm

In this section, we describe the protocol (Protocol 3.1) that reaches everywhere agreement from almost everywhere agreement. 3.1

Description of the Algorithm

Precondition: Each processor p starts with a hypothesis of the global string, candstr p ; this hypothesis may or may not equal globalstr . However, we make

210

V. King et al.

Given: Functions H (as described in Lemma 3), and J (as described in Lemma 4). Part I: Setting up Candidate Lists. 1: for each processor p: do 2: Select uniformly at random a subset, samplelist p , of processor IDs where √ |samplelist p | = c n log(n). 3: p.send (samplelist p , < candstr p >). 4: Set candlist p ← candstr p . 5: For each processor √ r that sent < candstr r > to p, add candstr r to candlist p with probability 1/ n. Part II: Setting up Requests through quorums. 1: for each processor p: do 2: p generates a random string rstr p . 3: For each candidate string s ∈ candlist p , p.send (H(s, p), < rstr p >). 4: Let polllist p ← J(rstr p , p) 5: if processor z ∈ H(candstr z , p) ) and z.accept(p, < rstr p >) then 6: for each processor y ∈ polllist p do 7: z.send (H(candstr z , y), < p → y >) 8: for Processor t ∈ H(candstr t , y) for any processor y do 9: Requestst (y) = {< p → y > | received from p ’s quorum H(candstr t , p)} Part III: Propagating globalstr to every processor. 1: for log n rounds in parallel do 2: if 0 < |Requestst (y)| < c log(n) then 3: for < p → y >∈ Requestst (y) do 4: t.send (y, < p → y >) 5: set Requestst (y) ← ∅. 6: if y.accept(H(candstr y , y), < p → y >) then 7: y.send (p, < candstr y >) 8: y.send (H(candstr y , p), < candstr y >) 9: when for processor p, count of processors in polllist p sending candidate string s over all rounds reaches a majority: Set candstr p ← s. 10: if when for processor z ∈ H(candstr z , p), count of processors in polllist p sending string s over all rounds reaches a majority then 11: for Processor y ∈ polllist p such that y did not yet respond do 12: z.send (H(candstr y , z), < Abort, p >) 13: if t ∈ H(candstr t , y) and t.accept(H(candstr t , p), < Abort, p > then 14: < p → y > is removed from Requestst (y).

Protocol 3. 1. Load balanced almost everywhere to everywhere

a critical assumption that at least 1/2 + γ fraction of processors are good and knowledgable i.e. their candstr equals globalstr . Actually we can ensure that 2/3 + − O(1/ log n) fraction of processors are good and knowledgeable using the almost-everywhere protocol from [13,14], but we need only have 1/2 + fraction for our protocol to work. Let candlist p be a list of candidate strings that p collects during the algorithm. Further, we call H(candstr q , p) a quorum of p (or p’s quorum) according to q. If

Load Balanced Scalable Byzantine Agreement

211

a processor p is sending to a quorum for x then it is assumed to mean that this is the quorum according to p, unless otherwise stated. Similarly, if t is sending conditional on its being in a particular quorum, then we mean this quorum according to t. Often, we shall denote a message within arrow brackets ( ), in particular < p → y > is the message that p has requested information from y. We call a quorum a legitimate quorum of p if it is generated by the globalstr i.e. H(globalstr , p). We also deﬁne the following primitives: v.send (X, m): Processor v sends message m to all processors in set X. v.accept (X, m): Processor v accepts the message m received from a majority of the processors in the set X (which could be a singleton set), otherwise It rejects it. Rejection of excess: Every processor will reject messages received in excess of the number of those messages dictated by the protocol in that round or stage of execution of the protocol. We assume each processor knows H and J. The key to achieving reliable communication channels through quorums is to use the globalstr √ . To begin, each processor p sends its candidate string candstr p directly to c n log n randomly selected processors (the samplelist p ). It then generates its own list of candidates candlist p for the √ globalstr including candstr p and every received string with probability 1/ n. This ensures that p has at least one globalstr in its list. The key to everywhere agreement is to be able to poll enough processors reliably so as to be able to learn globalstr . Part II sets up these polling requests. Each processor p generates a random string rstr p , which is used to generate p’s poll list polllist p using the function J by both p and its quorums. All the processors in the poll list are then contacted by p for their candidate string. In line 2, p determines its quorum for each of the strings in its candlist p and sends rstr p to the processors in the quorums. To prevent the adversary from targeting groups of processors, the quorums do not accept the poll list but rather the random string and then generate the poll list themselves. The important thing to note here is that even if p sent a message to its quorum the processors in the quorum will not accept the messages unless according to their own candidate string, they are in p’s quorum. Hence, it is important to note that w.h.p. at least one of these quorums is a legitimate quorum. Since p sends to at least one legitimate quorum, and the processors in this quorum will accept p’s request, this request will be forwarded. p’s quorum in turn contacts processor y’s quorum for each y that was in p’s poll list. The processors in y’s legitimate quorum gather all the requests meant for y in preparation for the next part of the protocol. Part III proceeds in log n rounds. The processors in y’s quorum only forward the received requests if they number less than c log n for some ﬁxed constant c . This prevents any processor from being overloaded. Once y accepts the requests (in accordance with y.accept), it will send its candidate string directly to p and also to p’ s quorum. When p gets the same string from a majority of processors in

212

V. King et al.

its poll list, it sets its own candidate string to this string. This new string w.h.p. is globalstr . There may be overloaded processors which have not yet answered p’s requests. To release the congestion, p will send abort messages to these quorums, which will now take the request oﬀ p’s request oﬀ their list. In each round, the number of satisﬁed processors falls by at least half, so that no more than log n rounds are needed. In this way, w.h.p. each processor decides the globalstr . 3.2

Proof of Correctness

The conditions for the correctness of the protocol given in Protocol 3.1 are stated as Lemma 10. To prove that, ﬁrst we show the following supporting lemmas. Lemma 6. W.h.p., at least one string in the candlist p of processor p is the globalstr . Proof. The proof of this √ follows from the birthday paradox. If there are n possible birthdays and O( n) children, two will likely share a birthday. Adding an O(log n) factor increases the probability for this to happen n times w.h.p. Lemma 7. For processor p and its random string rstr p , a majority of the processors y in polllist p are good and knowledgable, and they receive the request < p → y >. Proof. The poll list for processor p, polllist p is generated by the sampler J using p’s random string rstr p and p’s ID. By Lemma 4, a majority of polllist p is good and knowledgable. From Lemmas 5 and 6, processor p will send its message for its poll lists to at least one legitimate quorum. Since a majority of these are good and knowledgable, they will forward the message < p → y > for each processor y ∈ polllist p = J(rstr p , p) to at least one legitimate quorum of y. By Lemma 9, y shall accept the message. Observation 1. The messages sent by the bad processors, or good but not knowledgable processors (having candstr = globalstr ) do not aﬀect the outcome of the protocol. Proof. All communication in Parts 2 and 3 is veriﬁed by a processor against its quorums or poll list. Any communication received through the quorum or poll list is inﬂuential if only a majority of processors in them have sent it (either using the accept primitive or by counting total messages received). By Lemmas 6 and 7, majority of these lists are good and knowledgable. ˜ √n) bits. Lemma 8. For the protocol, any processor sends no more than O( Proof. Consider a good and knowledgeable processor p. In Part-I, line 3, p sends √ c n log n messages. For part II of the algorithm, consider p is in the quorum of a processor z; p forwards O(log2 n) messages to the quorums of z’s poll list. In part III, p forwards only O(log n) requests to z. The cost of aborting is no more than the cost of sending. In addition, z answers no more than the number of requests that its quorum forwards. By the rejection of excess primitive, no extra messages ˜ √n) bits over a run of the whole protocol. are sent. Thus, p sends at most O(

Load Balanced Scalable Byzantine Agreement

213

Lemma 9. By the end of part III, for each p, a majority of p’s poll list have received p’s request to respond from their legitimate quorums. Proof. Quorums will forward requests provided their processors are not overloaded. We show by induction that if in round i, there were x processors making requests to overloaded processors, there are no more than x/2 requests to overloaded processors in round i + 1, and thus in log n rounds, there shall be no overloaded processors. Hence every processor will answer its requests. Refer to Lemma 4: let Ri be the set of overloaded processors in round i (those that have more than (4/)d requests). Consider the set Li of processors which made these requests; |Li | ≥ 8/|Ri |. By part 2 of the lemma, half the processors in Li contain less than fraction of their PollLists in Ri , and their requests will be satisﬁed in the current round by a majority of good processors. Thus, there are now no more than |Li |/2 such processors making requests to processors in Ri , and hence to overloaded processors in round i + 1. Lemma 10. Let n be the number of processors in a synchronous full information message passing model with a nonadaptive rushing malicious adversary which controls less than a 13 − fraction of processors, and more than 12 + γ fraction of processors are good and knowledgable. For any positive constants , γ there exists a protocol w.h.p. such that: 1) At the end of the protocol, each good processor is also knowledgable, 2) The protocol takes no more than O(log n) rounds in ˜ √n) messages per processor. parallel, using no more than O( Proof. Part 1 follows from Lemmas 7 and observation 1; processor p hears back from its poll list and becomes knowledgable. Part 2 follows directly from lemmas 9 (Protocol is completed in O(log n) rounds) and 8.

4

Asynchronous Version

The asynchronous protocol for Byzantine agreement relies on the globalstr being generated by a scalable version of [10]. Such a string would have a reduced constant fraction of random bits but there would still be suﬃcient randomness to guarantee the properties needed. Note that the reduction in the fraction of random bits needed in the string can be compensated for by increasing the length of the string in the proof of Lemma 5. The asynchronous protocol to bring all processors to agreement on the globalstr can be constructed from the synchronous protocol by using the primitive asynch accept instead of accept and by changes to Part III. The primitive v.asynchaccept (X, m) is deﬁned as : Processor v waits until |X|/2 + 1 messages which agree on m are received and then takes their value. In Part III, since there are no rounds, there is instead an end-of-round signal for each “round” which is determined when enough processors have decided. the quorums are organized in a tree structure which allows them to simulate the synchronous rounds by explicitly counting the number of processors that become knowledgable. Round number is determined by the count of quorums which have received n/2 + 1 answers to requests of their

214

V. King et al.

processor. The quorum of a processor monitors the number of requests received and only forward the requests to a processor when the current number of requests received in a round is suﬃciently small. The asynchronous protocol incurs an additional overhead of a log n factor in the number of messages.

References 1. Aspnes, J., Shah, G.: Skip graphs. In: SODA, pp. 384–393 (2003) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley & Sons, Chichester (2004) 3. Awerbuch, B., Scheideler, C.: Provably secure distributed name service. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009. LNCS, vol. 5556, Springer, Heidelberg (2009) 4. Awerbuch, B., Scheideler, C.: Robust distributed name service. In: Voelker, G.M., Shenker, S. (eds.) IPTPS 2004. LNCS, vol. 3279, pp. 237–249. Springer, Heidelberg (2005) 5. Awerbuch, B., Scheideler, C.: Towards a Scalable and Robust DHT. In: SPAA, pp. 318–327 (2006) 6. Awerbuch, B., Scheideler, C.: Towards a scalable and robust DHT. Theory Comput. Syst. 45(2), 234–260 (2009) 7. Dwork, C., Peleg, D., Pippenger, N., Upfal, E.: Fault tolerance in networks of bounded degree. In: STOC, pp. 370–379 (1986) 8. Fiat, A., Saia, J., Young, M.: Making chord robust to byzantine attacks. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 803–814. Springer, Heidelberg (2005) 9. Gradwohl, R., Vadhan, S.P., Zuckerman, D.: Random selection with an adversarial majority. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 409–426. Springer, Heidelberg (2006) 10. Kapron, B.M., Kempe, D., King, V., Saia, J., Sanwalani, V.: Fast asynchronous byzantine agreement and leader election with full information. In: SODA, pp. 1038– 1047 (2008) 11. King, V., Saia, J.: From almost everywhere to everywhere: Byzantine agreement ˜ 3/2 ) bits. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 464–478. with O(n Springer, Heidelberg (2009) 12. King, V., Saia, J.: Breaking the O(n2 ) bit barrier: Scalable byzantine agreement with an adaptive adversary. In: PODC, pp. 420–429 (2010) 13. King, V., Saia, J., Sanwalani, V., Vee, E.: Scalable leader election. In: SODA, pp. 990–999 (2006) 14. King, V., Saia, J., Sanwalani, V., Vee, E.: Towards secure and scalable computation in peer-to-peer networks. In: FOCS, pp. 87–98 (2006) 15. Scheideler, C.: How to Spread Adversarial Nodes? Rotate! In: STOC, pp. 704–713 (2005) 16. Upfal, E.: Tolerating linear number of faults in networks of bounded degree. In: PODC, pp. 83–89 (1992) 17. Young, M., Kate, A., Goldberg, I., Karsten, M.: Practical robust communication in DHTs tolerating a byzantine adversary. In: ICDCS, pp. 263–272. IEEE, Los Alamitos (2010) 18. Zuckerman, D.: Randomness-optimal oblivious sampling. Random Struct. Algorithms 11(4), 345–367 (1997)

A Necessary and Sufficient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks Olivier Baldellon1 , Achour Most´efaoui2, and Michel Raynal2 1 LAAS-CNRS, 31077 Toulouse, France IRISA, Universit´e de Rennes 1, 35042 Rennes, France [email protected], {achour,raynal}@irisa.fr 2

Abstract. Solving the consensus problem requires in one way or another that the underlying system satisfies synchrony assumptions. Considering a system of n processes where up to t < n/3 may commit Byzantine failures, this paper investigates the synchrony assumptions that are required to solve consensus. It presents a corresponding necessary and sufficient condition. Such a condition is formulated with the notions of a symmetric synchrony property and property ambiguity. A symmetric synchrony property is a set of graphs, where each graph corresponds to a set of bi-directional eventually synchronous links among correct processes. Intuitively, a property is ambiguous if it contains a graph whose connected components are such that it is impossible to distinguish a connected component that contains correct processes only from a connected component that contains faulty processes only. The paper connects then the notion of a symmetric synchrony property with the notion of eventual bi-source, and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus in presence of up to t Byzantine processes in systems with bi-directional links and message authentication. Finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remains open (and very challenging) problems. Keywords: Asynchronous message system, Byzantine consensus, Eventually synchronous link, Lower bound, Signature, Symmetric Synchrony property.

1 Introduction Byzantine consensus. A process has a Byzantine behavior when it behaves arbitrarily [15]. This bad behavior can be intentional (malicious behavior, e.g., due to intrusion) or simply the result of a transient fault that altered the local state of a process, thereby modifying its behavior in an unpredictable way. We are interested here in the consensus problem in distributed systems prone to Byzantine process failures whatever their origin. Consensus is an agreement problem in which each process first proposes a value and then decides on a value [15]. In a Byzantine failure context, the consensus problem is defined by the following properties: every non-faulty process decides a value (termination), no two non-faulty processes decide different values (agreement), and if all non-faulty processes propose the same M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 215–226, 2011. c Springer-Verlag Berlin Heidelberg 2011

216

O. Baldellon, A. Most´efaoui, and M. Raynal

value, that value is decided (validity). (See [14] for a short introduction to Byzantine consensus.) Aim of the paper. A synchronous distributed system is characterized by the fact that both processes and communication links are synchronous (or timely) [2,13,16]. This means that there are known bounds on process speed and message transfer delays. Let t denote the maximum number of processes that can be faulty in a system made up of n processes. In a synchronous system, consensus can be solved (a) for any value of t (i.e., t < n) in the crash failure model, (b) for t < n/2 in the general omission failure model, and (c) for t < n/3 in the Byzantine failure model [12,15]. Moreover, these bounds are tight. On the contrary, when all links are asynchronous (i.e., when there is no bound on message transfer delays), it is impossible to solve consensus even if we consider the weakest failure model (namely, the process crash failure model) and assume that at most one process may be faulty (i.e., t = 1) [7]. It trivially follows that Byzantine consensus is impossible to solve in an asynchronous distributed system. As Byzantine consensus can be solved in a synchronous system and cannot in an asynchronous system, a natural question that comes to mind is the following “When considering the synchrony-to-asynchrony axis, which is the weakest synchrony assumption that allows Byzantine consensus to be solved?” This is the question addressed in the paper. To that end, the paper considers the synchrony captured by the structure and the number of eventually synchronous links among correct processes. Related work. Several approaches to solve Byzantine consensus have been proposed. We consider here only deterministic approaches. One consists in enriching the asynchronous system (hence the system is no longer fully asynchronous) with a failure detector, namely, a device that provides processes with hints on failures [4]. Basically, in one way or another, a failure detector encapsulates synchrony assumptions. Failure detectors suited to Byzantine behavior have been proposed and used to solved Byzantine consensus (e.g., [3,8,9]). Another approach proposed to solve Byzantine consensus consists in directly assuming that some links satisfy a synchrony property (“directly” means that the synchrony property is not hidden inside a failure detector abstraction). This approach relies on the notion of a 3[x + 1]bi-source (read “3” as “eventual”) that has been introduced in [1]. Intuitively, this notion states that there is a correct process that has x bidirectional input/outputs links with other correct processes and these links eventually behave synchronously [5,6]. (Our definition of a 3[x + 1]bi-source is slightly different from the original definition introduced in [1]. The main difference is that it considers only eventual synchronous links connecting correct processes. It is precisely defined in Section 61 .) Considering asynchronous systems with Byzantine processes without message authentication, it is shown in [1] that Byzantine consensus can be solved if the system has 1

We consider eventually synchronous links connecting correct processes only for the following reason. This is because, due to Byzantine behavior, a synchronous link connecting a correct process and a Byzantine process can always appear to the correct process as being an asynchronous link.

A Symmetric Synchrony Condition for Solving Byzantine Consensus

217

a 3[n − t]bi-source (all other links being possibly fully asynchronous). Moreover, the 3[n − t]bi-source can never be explicitly known. This result has been refined in [11] where is presented a Byzantine consensus algorithm for an asynchronous system that has a 3[2t + 1]bi-source. Considering systems with message authentication, a Byzantine consensus algorithm is presented in [10] that requires a 3[t + 1]bi-source only. As for Byzantine consensus in synchronous systems, all these algorithms assume t < n/3. ⇒

Synchrony property S is ambiguous

Theorem 2 Section 5

Consensus cannot be solved with property S

⇒

⇒

Theorem 1 Section 4

∃ G ∈ S and ∀C ∈ G : |C| ≤ t

⇒ Contrapositive of Ref. [10]

Lemma 3 Section 6

∃ G ∈ S with no virtual 3[t + 1]bi-source

Fig. 1. The proof of the necessary and sufficient condition (Theorem 3)

Content of the paper. The contribution of the paper is the definition of a symmetric synchrony property that is necessary and sufficient to solve Byzantine consensus in asynchronous systems with message authentication. From a concrete point of view, this property is the existence of what we call a virtual 3[t + 1]bi-source. A symmetric synchrony property S is a set of communication graphs, such that (a) each graph specifies a set of eventually synchronous bi-directional links connecting correct processes and (b) this set of graphs satisfies specific additional properties that give S a particular structure. A synchrony property can be or not ambiguous. Intuitively, it is ambiguous if it contains a graph whose connected components are such that there are executions in which it is impossible to distinguish a component with correct processes only from a connected component with faulty processes only. (These notions are formally defined in the paper). A synchrony property S for a system of n processes where at most t processes may be faulty is called (n, t)-synchrony property. The paper shows first that, assuming a property S, it is impossible to solve consensus if S is ambiguous. It is then shown that, if consensus can be solved when the actual communication graph is any graph of S (we then say “S is satisfied”), then any graph of S has at least one connected component whose size is at least t + 1. The paper then relates the ambiguity of an (n, t)-synchrony property S with the size x of a virtual 3[x]bi-source. These results are schematically represented in Figure 1 from which follows the fact that a synchrony property S allows Byzantine consensus to be solved despite up to t Byzantine processes in a system with message authentication if and only if S is not ambiguous. Road map. The paper is made up of 7 sections. Section 2 presents the underlying asynchronous Byzantine computation model. Section 3 defines the notion of a synchrony property S and the associated notion of ambiguity. As already indicated, a synchrony

218

O. Baldellon, A. Most´efaoui, and M. Raynal

property is on the structure of eventually synchronous links connecting correct processes. Then, Section 4 shows that an ambiguous synchrony property S does not allow consensus to be solved (Theorem 1). Section 5 relates the size of connected components of the graphs of an (n, t)-synchrony property S with the ambiguity of S (Theorem 2). Section 6 establishes the main result of the paper, namely a necessary and sufficient condition for solving Byzantine consensus in system with message authentication. Finally, Section 7 concludes the paper.

2 Computation Model Processes. The system is made up of a finite set Π = {p1 , . . . , pn } of n > 1 processes that communicate by exchanging messages through a communication network. Processes are assumed to be synchronous in the sense that local computation times are negligible with respect to message transfer delays. Local processing times are considered as being equal to 0. Failure model. Up to t < n/3 processes can exhibit a Byzantine behavior. A Byzantine process is a process that behaves arbitrarily: it can crash, fail to send or receive messages, send arbitrary messages, start in an arbitrary state, perform arbitrary state transition, etc. Moreover, Byzantine processes can collude to “pollute” the computation. Yet, it is assumed that they do not control the network. This means that they cannot corrupt the messages sent by non-Byzantine processes, and the schedule of message delivery is uncorrelated to Byzantine behavior. A process that exhibits a Byzantine behavior is called faulty. Otherwise, it is correct or non-faulty. Communication network. Each pair of processes pi and pj is connected by a reliable bidirectional link denoted (pi , pj ). This means that, when a process receives a message, it knows which is its sender. A link can be fully asynchronous or eventually synchronous. The bi-directional link connecting a pair of processes pi and pj is eventually synchronous if there is a finite (but unknown) time τ after which there is an upper bound on the time duration that elapses between the sending and the reception of a message sent on that link (hence an eventually synchronous link is eventually synchronous in both directions). If such a bound does not exist the link is fully asynchronous. If τ = 0 and the bound is known, then the link is synchronous. Message authentication. When the system provides the processes with message authentication, a Byzantine process can fail to relay messages or send bad messages only. When it forwards a message received from another process it cannot alter its content. Notation. Given a set of processes that defines which are the correct processes, let H ⊆ Π × Π denote the set of eventually synchronous bi-directional links connecting these correct processes. (This means that this communication graph has no incident edge to a faulty process; moreover it is possible that some pair of correct processes be not connected by an eventually synchronous bi-directional link).

A Symmetric Synchrony Condition for Solving Byzantine Consensus

219

Given a set of correct processes and an associated graph H as defined above, the previous system model is denoted AS n,t [H]. More generally let S = {H1 ,..., H } be a set of sets of eventually synchronous bidirectional links connecting correct processes. AS n,t [S] denotes the system model (set of runs, see below in Section 3.2) in which the correct processes and the eventually synchronous bi-directional links connecting them are defined by H1 , or H2 , etc., or H .

3 Definitions We consider only undirected graphs in the following. The aim of this section is to state a property that will be used to prove an impossibility result. Intuitively, a vertex represents a process, while an edge is used to represent an eventually synchronous bi-directional link. Hence the set of vertices of a graph G is Π and its set of edges is included in Π × Π. 3.1 (n, x)-Synchrony Property and Ambiguity The formal definitions given in this section will be related to processes and links of a system in the next section. Definition 1. Let G = (Π, E) be a graph. A permutation π on Π defines a permuted graph, denoted π(G) = (Π, E ), i.e., ∀ a, b ∈ Π : (a, b) ∈ E ⇔ (π(a), π(b)) ∈ E . All permuted graphs of G have the same structure as G, they differ only in the names of vertices. Definition 2. Let G1 = (Π, E1) and G2 = (Π, E2). G1 is included in G2 (denoted G1 ⊆ G2 ) if E1 ⊆ E2. Definition 3. An (n, x)-synchrony property S is a set of graphs with n vertices such that ∀G1 ∈ S we have: – Permutation stability. If G2 is a permuted graph of G1 , then G2 ∈ S. – Inclusion stability. ∀ G2 such that G1 ⊆ G2 then G2 ∈ S. – x-Resilience. ∃ G0 ∈ S such that G0 ⊆ G1 and G0 has at least x isolated vertices (an isolated vertex is a vertex without neighbors). The aim of an (n, x)-synchrony property is to capture a property on eventually synchronous bi-directional links. It is independent from process identities (permutation stability). Moreover, adding eventually synchronous links to a graph of an (n, x)-synchrony property S does not falsify it (inclusion stability). Finally, the fact that up to x processes are faulty cannot invalidate it (x-resilience). As an example, assuming n−t ≥ 3, “there are 3 eventually synchronous bi-directional links connecting correct processes” is an (n, x)-synchrony property. It includes all the graphs G of n vertices that have 3 edges and x isolated vertices plus, for every such G, all graphs obtained by adding any number of edges to G. Given a graph G = (Π, E) and a set of vertices C ⊂ Π, G \ C denotes the graph from which edges (pi , pj ) with pi or pj ∈ C have been removed.

220

O. Baldellon, A. Most´efaoui, and M. Raynal

Definition 4. Let S be an (n, x)-synchrony property. S is ambiguous if it contains a graph G = (Π, E) whose every connected component C is such that (i) |C| ≤ x and (ii) G \ C ∈ S. Such a graph G is said to be S-ambiguous. Intuitively, an (n, x)-synchrony property S is ambiguous if it contains a graph G that satisfies the property S in all runs where all processes of any connected component of G could be faulty (recall that at most x processes are faulty). 3.2 Algorithm and Runs Satisfying an (n, x)-Synchrony Property Definition 5. An n-process algorithm A is a set of n automata, such that a deterministic automaton is associated with each correct process. A transition of an automaton defines a step. A step corresponds to an atomic action. During a step a correct process may send/receive a message and change its state according to its previous steps and the current state of its automaton. The steps of a faulty process can be arbitrary. Definition 6. A run of an algorithm A in AS n,t [G] (a system of n processes with at most t faulty processes and for which G defines the graph of eventually timely channels among correct processes) is a triple I, R, T t,G where I defines the initial state of each correct process, R is a (possibly infinite) sequence of steps of A (where at most t processes are faulty) and T is the increasing sequence of time values indicating the time instants at which the steps of R occurred. The sequence R is such that, for any message m, the reception of m occurs after its sending and the steps issued by every process occur in R in their issuing order, and for any correct process pi the steps of pi are as defined by its automaton. Definition 7. Et,G (A) denotes the set of runs of algorithm A in AS n,t [G]. Definition 8. Given an (n, x)-synchrony property S, let ES (A) = t≤x,G∈S Et,G (A) (i.e., ES (A) is the set of runs r = I, R, T t,G of A such that t ≤ x and G ∈ S). Let us remind that a synchrony property S is a set of graphs on Π. Definition 9. An (n, x)-synchrony property S allows an algorithm A to solve the consensus problem in AS n,x [S] if every run in ES (A) satisfies the validity, agreement and termination properties that define the Byzantine consensus problem.

4 An Impossibility Result Given an (n, t)-synchrony property S, this section shows that there is no algorithm A that solves the consensus problem in AS n,t [H] if H is an S-ambiguous graph of S. This means that the synchrony assumptions captured by S are not powerful enough to allow consensus to be solved despite up to t faulty processes. There is no algorithm A that would solve consensus for any underlying synchrony graph of an ambiguous synchrony property S.

A Symmetric Synchrony Condition for Solving Byzantine Consensus

221

4.1 A Set of Specific Runs This section defines the set of runs in which the connected components (as defined by the eventually synchronous communication graph H) are asynchronous the ones with respect to the others, and (if any) the set of faulty processes corresponds to a single connected component. The corresponding set of runs, denoted F (A, H), will then be used to prove the impossibility result. Definition 10. Let A be an n-process algorithm and H be a graph whose n vertices are processes and every connected component contains at most t processes. Let F (A, H) be the set of runs of A that satisfy the following properties: – If pi and pj belong to the same connected component of H, then the bi-directional link (pi , pj ) is eventually synchronous. – If pi and pj belong to the same connected component of H, then both are correct or both are faulty. – If pi and pj belong to distinct connected components of H, then, if pi is faulty, pj is correct. 4.2 An Impossibility Let S be an ambiguous (n, t)-synchrony property, A be an algorithm and H be the graph defining the eventually synchronous links among processes. The lemma that follows states that, if H is S-ambiguous, all runs r in F (A, H) belong to ES (A). Lemma 1. Let S be an (n, t)-synchrony property and H ∈ S. If H is S-ambiguous, then F (A, H) ⊆ ES (A). Proof. As it is S-ambiguous, H contains only connected components with at most t processes. It follows that the set F (A, H) is well-defined. Let r ∈ F (A, H). Let C1 , . . . , Cm be the connected components of H. We can then define H0 = H (when no process are faulty) and for any i with 1 ≤ i ≤ m, Hi = H \ Ci (when the set of faulty processes correspond to Ci ). If in run r, all processes are correct, we have r ∈ Et,H (A). Moreover, if there is a faulty process in run r, by definition of F (A, H), the set of faulty processes correspond to a connected component. Let Ci be this connected component. We then have r ∈ Et,Hi (A). We just showed that F (A, H) ⊆ 0≤i≤m Et,Hi (A). As H is S-ambiguous, for any 1 ≤ i ≤ m we have Hi ∈ S. Moreover, due to the definition of H0 and the lemma assumption, we also have H0 = H ∈ S. Finally, as ES (A) = X ∈S Et,X (A), we have F (A, H) ⊆ ES (A) which prove the lemma.

Lemma 2. Let S be an ambiguous (n, t)-synchrony property and H an S-ambiguous graph. Whatever the algorithm A, there is a run r ∈ F (A, H) that does not solve consensus. Proof. The proof is a reduction to the FLP impossibility result [7] (impossibility to solve consensus despite even only one faulty process in a system in which all links are asynchronous). To that end, let us assume by contradiction that there is an algorithm

222

O. Baldellon, A. Most´efaoui, and M. Raynal

A that solves consensus among n processes p1 , . . . , pn despite the fact that up to t of them may be faulty, when the underlying eventually synchronous communication graph belongs to S (for example an S-ambiguous graph H). This means that, by assumption, all runs r ∈ ES (A) satisfy the validity, agreement and termination properties that define the Byzantine consensus problem.

C1

C2 q1 q2

⇔ (. . .) Cm

Processes p1 , . . . , pn

...

qm

Processes q1 , . . . , qm

Fig. 2. A reduction to the FLP impossibility result

Let C1 , . . . , Cm be the connected components of H and q1 , . . . , qm a set of m processes (called simulators in the following). The proof consists in constructing a simulation in which the simulators q1 , . . . , qm solve consensus despite the fact they are connected by asynchronous links and one of them may be faulty, thereby contradicting the FLP result (Figure 2). To that end, each simulator qj , 1 ≤ j ≤ m, simulates the processes of the connected component Cj it is associated with. Moreover, without loss of generality, let us assume that, for every component Cj made up of correct processes, these processes propose the same value vj . Such a simulation2 of the processes p1 , . . . , pn (executing the Byzantine consensus algorithm A) by the simulators q1 , . . . , qm results in a run r ∈ F (A, H) (from the point of view of the processes p1 , . . . , pn ). As (by definition) the algorithm A is correct, the correct processes decide in run r. As (a) H ∈ S, (b) S is ambiguous, and (c) r ∈ F(A, H), it follows from Lemma 1 that r ∈ ES (A), which means that r is a run in which the correct processes pi decide the same value v (and, if they all have proposed the very same value w, we have v = w). It follows that, simulating the processes p1 , . . . , pn that execute the consensus algorithm A, the m asynchronous processes q1 , . . . , qm (qj proposing value vj ) solve consensus despite the fact that one of them is faulty (the one associated with the faulty component Cj , if any). Hence, the simulators q1 , . . . , qm solve consensus despite the fact that one of them may be faulty contradicting the FLP impossibility result, which concludes the proof of the theorem.

2

The simulation, which is only sketched, is a very classical one. A similar simulation is presented in [15], in the context of synchronous systems, that extends the impossibility to solve Byzantine consensus from a set of n = 3 synchronous processes where one (t = 1) is a Byzantine process to a set of n ≤ 3t processes. A similar simulation is also described in [16].

A Symmetric Synchrony Condition for Solving Byzantine Consensus

223

The following theorem is an immediate consequence of lemmas 1 and 2. Theorem 1. No ambiguous (n, t)-synchrony property allows Byzantine consensus to be solved in a system of n processes where up to t processes can be faulty. Remark 1. Let us observe that the proof of the previous theorem does not depend on the fact that messages are signed or not. Hence, the theorem is valid for systems with or without message authentication. Remark 2. The impossibility to solve consensus despite one faulty process in an asynchronous system [7] corresponds to the case where S is the (n, 1)-synchrony property that contains the edge-less graph.

5 Relating the Size of Connected Components and Ambiguity Assuming a system with message authentication, let S be an (n, t)-synchrony property that allows consensus to be solved despite up to t Byzantine processes. This means that consensus can be solved for any eventually synchronous communication graph in S. It follows from Theorem 1 that S is not ambiguous. This section shows that if an eventual synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component C whose size is greater than t (|C| > t). Theorem 2. Let S be an (n, t)-synchrony property. If there is a graph G ∈ S such that none of its connected components has more than t vertices, then S is ambiguous. Proof. Let G ∈ S such that no connected component of G has more than t vertices. It follows from the t-resilience property of S that there is a graph G included in G (i.e., both have the same vertices and the edges of G are also in G) that has at least t isolated vertices. Let us observe that G can be decomposed into m+ t connected components C1 , . . . , Cm , γ1 , . . . , γt where each Ci contains at most t vertices and each γi contains a single vertex (top of Figure 3). Let us construct a graph G as follows. G is made up of the m connected components C1 , . . . , Cm plus another connected component denoted Cm+1 including the t vertices γ1 , . . . , γt (bottom of Figure 3). Moreover, G contains all edges of G plus

G :

C1

C2

...

Cm

G :

C1

C2

...

Cm

γ1

Fig. 3. Construction of the graph G

...

Cm+1

γt

224

O. Baldellon, A. Most´efaoui, and M. Raynal

the new edges needed in order that the connected component Cm+1 be a clique (i.e., a graph whose any pair of distinct vertices is connected by an edge). As G ∈ S and G ⊆ G , it follows from the stability property of S that G ∈ S. The rest of the proof consists in showing that G is S-ambiguous (from which ambiguity of S follows). – Let us first observe that, due to its very construction, each connected component C of G contains at most t vertices. – Let us now show that for any connected component C of G , we have G \ C ∈ S. (Let us recall that G \ C is G from which all edges incident to vertices of C have been removed.) We consider two cases. • Case C = Cm+1 . We then have G \ C = G . The fact that G ∈ S concludes the proof of the case.

G :

C1

...

Ci

...

Cm

...

...

γ1 . . . γd

γd+1 . . . γt

δ1 . . . δ d Gi :

C1

...

...

...

Cm

Cm+1

Fig. 4. Using a permutation

• Case C = Ci for 1 ≤ i ≤ m. Let δ1 , . . . , δd be the vertices of Ci and let Gi = G \ C. According to the permutation stability property of S there is a permutation π of the vertices of Gi such that G ⊆ π(Gi ) (Figure 4). It then follows from the fact that S is a synchrony property that π(Gi ) ∈ S and consequently Gi ∈ S, which concludes the proof of the case and the proof of the theorem.

Taking the contrapositive of Theorem 1 and Theorem 2, we obtain the following corollary. Corollary 1. If an (n, t)-synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component whose size is at least t + 1.

6 A Necessary and Sufficient Condition This section introduces the notion of a virtual 3[x + 1]bi-source and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve

A Symmetric Synchrony Condition for Solving Byzantine Consensus

225

the consensus problem in a system with message signatures and where up to t processes can commit Byzantine failures. Definition 11. A 3[x + 1]bi-source is a correct process that has an eventually synchronous bi-directional link with x correct processes (not including itself). From a structural point of view, a 3[x + 1]bi-source is a star made up of correct processes. (As already noticed, this definition differs from the usual one, in the sense that it considers only correct processes.) Lemma 3. If a graph G has a connected component C of size x+1, a 3[x+1]bi-source can be built inside C. Proof. Given a graph G that represents the eventually synchronous bi-directional links connecting correct processes, let us assume that G has a connected component C such that |C| ≥ x + 1. A star (3[x + 1]bi-source) can be easily built as follows. When a process p receives a message for the first time, it forwards it to all. Let us remember that, as messages are signed, a faulty process cannot corrupt the content of the messages it forwards; it can only omit to forward them. Let λ be the diameter of C and δ the eventual synchrony bound for message transfer delays. This means that, when we consider any two processes p, q ∈ C, λ × δ is an eventual synchrony bound for any message communicated inside the component C. Moreover, considering any process p ∈ C, the processes of C define a star structure centered at p, and such that, for any q ∈ C \ {p}, there is a virtual eventually synchronous link (bound λ × δ) that is made up of eventually synchronous links and correct processes of C, which concludes the proof of the lemma.

The following definition gives a more general definition of a 3[x + 1]bi-source. Definition 12. A communication graph G has a virtual 3[x + 1]bi-source if has a connected component C of size x + 1. Theorem 3. An (n, t)-synchrony property S allows consensus to be solved in an asynchronous system with message authentication, despite up to t Byzantine processes, if and only if any graph of S contains a virtual 3[t + 1]bi-source. Proof. The proof of the sufficiency side follows from the algorithm described in [10] that presents and proves correct a consensus algorithm for asynchronous systems made up of n processes where (a) up to t processes may be Byzantine, (b) messages are signed, and (c) there is a 3[t + 1]bi-source (a 3[t + 1]bi-source in our terminology is a 3[2t]bi-source in the parlance of [1,10]). When considering the necessity side, let S be synchrony property such that none of its graphs contains a virtual 3[x + 1]bi-source. It follows from the contrapositive of Corollary 1 that S does not allow Byzantine consensus to be solved.

The following corollary is an immediate consequence of the previous theorem. Corollary 2. The existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus (with message authentication) in presence of up to t Byzantine processes.

226

O. Baldellon, A. Most´efaoui, and M. Raynal

7 Conclusion This paper has presented a synchrony condition that is necessary and sufficient for solving consensus despite Byzantine processes in systems equipped with message authentication. This synchrony condition is symmetric in the sense that some links have to be eventually timely in both directions. Last but not least, finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remain open (and very challenging) problems.

References 1. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Consensus with Byzantine Failures and Little System Synchrony. In: Int’l Conference on Dependable Systems and Networks (DSN 2006), pp. 147–155. IEEE Computer Press, Los Alamitos (2006) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn., 414 pages. Wiley-Interscience, Hoboken (2004) 3. Cachin, C., Kursawe, K., Shoup, V.: Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement using Cryptography. In: Proc. 19th ACM Symposium on Principles of Distributed Computing (PODC 2000), pp. 123–132 (2000) 4. Chandra, T., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996) 5. Delporte-Gallet, C., Devismes, S., Fauconnnier, H., Larrea, M.: Algorithms for Extracting Timeliness Graphs. In: Patt-Shamir, B., Ekim, T. (eds.) SIROCCO 2010. LNCS, vol. 6058, pp. 127–141. Springer, Heidelberg (2010) 6. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. Journal of the ACM 35(2), 288–323 (1988) 7. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985) 8. Friedman, R., Most´efaoui, A., Raynal, M.: Pmute -Based Consensus for Asynchronous Byzantine Systems. Parallel Processing Letters 15(1-2), 162–182 (2005) 9. Friedman, R., Most´efaoui, A., Raynal, M.: Simple and Efficient Oracle-Based Consensus Protocols for Asynchronous Byzantine Systems. IEEE Transactions on Dependable and Secure Computing 2(1), 46–56 (2005) 10. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus with Few Synchronous Links. In: Tovar, E., Tsigas, P., Fouchal, H. (eds.) OPODIS 2007. LNCS, vol. 4878, pp. 76– 89. Springer, Heidelberg (2007) 11. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus in Signature-free Systems. Submitted to publication 12. Lamport, L., Shostack, R., Pease, M.: The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3), 382–401 (1982) 13. Lynch, N.A.: Distributed Algorithms, 872 pages. Morgan Kaufmann Pub., San Francisco (1996) 14. Okun, M.: Byzantine Agreement. Springer Encyclopedia of Algorithms, pp. 116–119 (2008) 15. Pease, M., Shostak, R., Lamport, L.: Reaching Agreement in the Presence of Faults. JACM 27, 228–234 (1980) 16. Raynal, M.: Fault-tolerant Agreement in Synchronous Message-passing Systems. In: Morgan & Claypool, 167 pages (September 2010)

GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays Anwitaman Datta and Rajesh Sharma School of Computer Engineering, Nanyang Technological University, Singapore {Anwitaman,raje0014}@ntu.edu.sg

Abstract. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism (GoDisco) to disseminate information in online social community networks, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. Such a designed dissemination scheme using gossiping over a egocentric social network is unique and is arguably a concept whose time has arrived, emulating word of mouth behavior and can have interesting applications like probabilistic publish/subscribe, decentralized recommendation and contextual advertisement systems, to name a few. Simulation based experiments show that despite using only local knowledge and contacts, the system has good global coverage and behavior. Keywords: gossip algorithm, community networks, selective dissemination, social network, egocentric.

1

Introduction

Many modern internet-scale distributed systems are projections of real-world social relations, inheriting also the semantic and community contexts from real world. This is explicit in some scenarios like online social networks, while implicit for others such as the ones derived from interactions among individuals (such as email exchanges), traditional ﬁle-sharing peer-to-peer systems where people with similar interests self-organize into a peer-to-peer overlay which is semantically clustered according to the tastes of these people [6,13]. Recently, various eﬀorts to build peer-to-peer online social networks (P2P OSNs) [3] are also underway. Likewise, also in massively multiplayer online games (MMOGs), virtual communities are formed. In many of these social information systems, it is often necessary to have mechanisms to disseminate information (i) eﬀectively - reaching the relevant people who would be interested in the information while not bothering others who won’t be, doing so (ii) eﬃciently - avoiding duplication or latency, in a (iii) decentralized environment - scaling without global infrastructure, knowledge and coordination, and in a (iv) reliable manner - dealing with temporary failures of a subset of the population, and ensuring information quality. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 227–238, 2011. c Springer-Verlag Berlin Heidelberg 2011

228

A. Datta and R. Sharma

Consider a hypothetical scenario involving circulation of call for papers (CFP) to academic peers. One often posts such a CFP to relevant mailing lists. Such an approach is essentially analogous to a very simplistic publish/subscribe (pub/sub) system. If the scopes of the mailing lists are very restricted, then many such mailing lists would be necessary. However people who have not explicitly subscribed to such a mailing list will miss out on the information. Likewise, depending on the scope of topics discussed in the mailing list, there may be too many posts which are actually not relevant. CFP is also propagated by personal email among colleagues and collaborators one knows. The later approach is unstructured and does not require any speciﬁc infrastructure (unlike the mailing list) nor any explicit notion of subscription. Individuals make autonomous decisions based on their perception of what their friends’ interests may be. Likewise, individuals may build local trust metrics to decide which friends typically forward useful or useless information, providing a personal, subjective context to determine the quality of information. On the downside, such an unstructured, word of mouth (gossip/epidemic) approach does not guarantee full coverage, and may also generate many duplicates. Such redundancy can however also make the dissemination process robust against failures. Consequently, epidemic algorithms are used in many distributed systems and applications [9,2]. Despite the well recognized role of word of mouth approaches in information dissemination in a selective manner - in real life or over the internet, such as by emails or in online social networks, there has not been any algorithmic (designed) information dissemination mechanism leveraging on the community structures and semantics available in social information systems. This is arguably partly because of the novelty of the systems such as P2P OSNs, [3] where the mechanism can be naturally applied. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism to disseminate information in a distributed setting, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. We explore the trade-oﬀs of coverage, spam and message duplicates; and evaluate our mechanisms over synthetic and real social communities. These designed mechanisms can be useful not only for the distributed social information systems for which we develop them, but may also have wider impact in the longer run - such as in engineering word of mouth marketing strategies, as well as understanding naturally occurring dissemination processes which had inspired us the algorithm designs at the ﬁrst instance. Experiments on synthetic and real networks show that our proposed mechanism is eﬀective and moderately eﬃcient, though the performance deteriorates in sparse graphs. The rest of the paper is organized as follows. In section 2, we present related work. We present our approach including various concepts and algorithm in section 3. Section 4 evaluates the approach using synthetic and real network. We conclude with several future directions in section 5.

GoDisco: Selective Gossip Based Dissemination of Information

2

229

Related Works

Targeted dissemination of information can be carried out eﬃciently using a structured approach as is the case with publish/subscribe systems [10] that rely on infrastructure like overlay based application layer multicasting [12] or gossiping techniques [2] to spread the information within a well-deﬁned group. Well deﬁned groups are not always practical, and alternative approaches to propagate information in unstructured environments are appealing. Broadcasting information to everyone ensures that relevant people get it (high recall), but is undesirable, since it spams many others who are uninterested (poor precision). In mobile ad-hoc and delay tolerant networking scenarios, selective gossiping techniques relying on user proﬁle and context to determine whether to propagate the information have been used. This is analogous to how multiple diseases can spread in a viral manner - aﬀecting only “susceptible” subset of the population. Autonomous gossiping [8] and SocialCast [7] are few such approaches, which rely on the serendipity of interactions with like minded users, and depends on users’ physical mobility that leads to such interactions. There are also many works studying naturally occurring (not speciﬁcally designed) information spread in community networks - including on the blogosphere [15,11] as well as in online social network communities [5], and simulation based models to understand such dissemination mechanisms [1] and human behavior [19]. Our work, GoDisco lies at the crossroads of these works - we speciﬁcally design algorithms to eﬀectively (ideally high recall and precision, low latency) and eﬃciently (low spam and duplication) disseminate information in collaborative social network communities by using some simple and basic mechanisms motivated by naturally occurring human behaviors (similar to mouth of word) using local information. Our work belongs to the family of epidemic and gossip algorithms [18]. However unlike GoDisco where we carry out targeted and selective dissemination, traditional approaches [14,2,9,16] are designed to broadcast information within a well deﬁned group, and typically assumes a fully connected graph - so any node can communicate directly with any other node. In GoDisco, nodes can communicate directly only with other nodes with whom they have a social relation. While this constraint seems restrictive, it actually helps directing the propagation within a community of users with shared interest, as well as limiting the spread to uninterested parties. We speciﬁcally aim to use only locally available social links, rather than forming semantic clusters, because we want to leverage on the natural trust and disincentives to propagate bogus messages that exist among neighbors in an acquaintance/social graph. Similar to GoDisco, design of JetStream [16] is also motivated by social behavior - but of a diﬀerent kind, that of reciprocity. JetStream is designed to enhance the reliability and relative determinism of the information propagation using gossip, but it is not designed for selective dissemination of information to an interest sub-group. A recent work, GO [20], also performs selective dissemination, but in explicit and well deﬁned subgroups, and again assuming a fully connected underlying graph. The emphasis of GO is to piggyback multiple

230

A. Datta and R. Sharma

messages (if possible) in order to reduce the overall bandwidth consumption of the system. Some of the ideas from these works [16,20] may be applicable as optimizations in GoDisco.

3

Approach

We ﬁrst summarize various concepts which are integral part of proposed solution followed by detail description of GoDisco algorithm. 3.1

Concepts

System Model: Let G(N, E) be a graph, where N represents nodes (or users) of the network and E represents the set of edges between any two nodes ni and nj ∈ N . Edge eij ∈ E exist in the network, if node ni and nj know each other. We deﬁne the neighborhood of node ni as the set ℵi where nj ∈ ℵi iﬀ eij ∈ E. Community and Information Agent: Let I be the set of interests, representing a collection of all the diﬀerent interests of all the nodes in a network. We assume each node has at least one interest, and Inj represents the collection of all the diﬀerent interests of node nj . We consider a community to be a set of people with some shared interest. The social link between these people deﬁne the connectivity graph of the community. This graph may be connected, or it may comprise of several connected components. Such a graph, representing a speciﬁc community, is a subgraph G (N , E ) where N ⊆ N and E ⊆ E such that ∃x ∈ I; ∀nj ∈ N , x ∈ Inj . According to this deﬁnition, a node can (and real data sets show, often does) belong to diﬀerent communities at the same instance if it has varied interest. Since the subgraph G (N , E ) may not be connected, it may be necessary to rely on some nodes from outside the community to pass messages between these mutually isolated subsets. We however need to minimize messages to such nodes outside the community. To that end, we should try to ﬁnd potential forwarders who can help in forwarding the message to such isolated but relevant nodes. We identify suitable forwarders based on several factors: (i) interest similarity, (ii) node’s history as forwarder, (iii) degree of the node and (iv) activeness of the node. These forwarders, which we call information agents (IAs), help spreading a message faster, even if not personally interested in the message. History as Forwarder: This parameter calculates how good forwarder a node is. The rationale behind this approach is the principle of reciprocity from social sciences. Reciprocity is used also in JetStream [16], and we reuse the same approach. Activeness of a Node: An active node in the network can play a crucial role in quick dissemination of information, making it a good potential information agent. One way to measure user activeness is in terms of frequency of visits to an online social network, which can be inferred from any kind of activity the user does - e.g., post messages or update status, etc.

GoDisco: Selective Gossip Based Dissemination of Information

231

Duplication avoidance: We use social triads [17] concept to avoid message duplication. In a social network nodes can typically be aware of all the neighbors of all its neighboring node. This knowledge can be explored in reducing duplication. Interest classiﬁcation and node categorizations: Interests can typically be classiﬁed using a hierarchical taxonomy. For simplicity, as well as to avoid too sparse and poorly connected communities, we restrict it two levels and name them as (i) Main Category (MC) & (ii) Subcategory (SC). For example, web mining and data mining are closer to machine learning rather than communication networks. So data-mining, web mining and machine learning would belong within one main category, while networks related topics will belong to a diﬀerent main category. Within a main category, similarity of interest among diﬀerent categories can vary again by diﬀerent degree, for example web-mining and data mining are relatively similar as compared to machine learning. So at next level of categorization we put data mining and web mining under one sub category and machine learning under another subcategory. To summarize, two diﬀerent main categories are more distant as compared to two diﬀerent sub categories within same main category. Information Agent Classiﬁcation: Based on interest similarity, we categories diﬀerent levels of IAs with respect to tolerance for spam message. For a particular message if a message’s interest is falling under SC11 of M C1 , we classify levels of IA in the following manner: Level 1: Nodes having interest in diﬀerent subcategory under same main category (e.g., a node having interest in M C1 SC12 ). Ideally such nodes should not be considered as spammed nodes as they have, high similar interest and possibly they might also be interested in this kind of messages. Level 2: For nodes having good history as a forwarder, an irrelevant message can be considered as spam. However they have high tolerance for such messages - that is why they have good forwarding history. These nodes will typically be cooperating with others based on the principle of reciprocity. Level 3: Highly active nodes with no common interest can play an important role in quick dissemination, but their high level of activity does not imply that they would be tolerant to spam, and such information agents should be chosen with lowest priority. While selecting an IA from non-similar communities maximum nodes should be chosen from level 1 since they are more likely to be able to connect back to other interested users, then from level 2 and 3. 3.2

GoDisco

There are two logically-independent modules of GoDisco - the control phase runs in the background to determine system parameters and self-tune the system, while the dissemination phase carrying out the actual dissemination of the messages.

232

A. Datta and R. Sharma

1. Control Phase. Each node regularly transmits its interest and its degree information to its neighbors. Each node also monitors (infers) its neighbor’s activeness and forwarding behavior. The latter two parameters are updated after each dissemination phase. Neighboring nodes who spreads/forwards further to more number of neighbors are rewarded in future (based on reciprocity principle). Also nodes that are more active are considered better IA (potential forwarders) as compared to less active nodes. Every node maintains a tuple of for each of its neighbors, reﬂecting their history as forwarder, degree and activeness respectively. During the dissemination phase non-relevant nodes are ranked according to a weighted sum hα + dβ + aγ to determine potential information agents (IAs) where α , β , γ are parameters to set the priority of the three variables such that α + β + γ =1. 2. Dissemination Phase. We assume that the originator of the message provides necessary meta-information - a vector of interest categories that the message ﬁts (msgproﬁle), as well as dissemination parameters tuple (their use is explained below). The message payload is thus: <message, msgproﬁle, parameters, dampingﬂag>. dampingf lag is a ﬂag used to dampen the message propagation when no more relevant nodes are found. Time to live could also be used instead. Algorithm 1 illustrates the logic behind sending message to relevant nodes and collecting non-relevant nodes, ranking their suitability and using them as information agents (IAs) for disseminating a message based on multiple criterion. We have adopted an optimistic approach for forwarding a message to neighboring nodes. A node forwards a message to all its neighbors with at least some matching interest in the message. For example, a node with interests in data mining and peer-to-peer systems will get a message intended for people with interests in data mining or web-mining. Of the nodes who are not directly relevant, we determine nodes with common interest within the main categories from message and users’ proﬁles (using M C(.)), helping identify Level 1 IAs, and forward the message to these Level 1 IAs probabilistically with probability p1 which is a free parameter. For level 1 IAs, the message though possibly not directly relevant, is still somewhat relevant, and the “perception of spam” is expected to be weaker. If all the neighbors of a node are totally non-relevant, then the dissemination stops, since such a node is at the boundary of the community. However, existence of some isolated islands or even a large community with same interests is possible. To alleviate the same, some boundary nodes can probabilistically send random walkers. We implement such random walkers with preference to neighbors with higher score (hα + dβ + aγ) and limit the random walks with limited timeout (in the experiments we chose time-out equal to the network’s diameter). If a random-walker revisits a speciﬁc node, then it is forwarded to a diﬀerent neighbor than in the past, with a ceiling to the number of times such revisits are forwarded (in experiments, this was set to two).

4

Evaluation

We evaluate GoDisco based information dissemination on both real and synthetically generated social networks. User behavior vis-a-vis the degree of activeness

GoDisco: Selective Gossip Based Dissemination of Information

233

and history of forwarding were assigned uniformly at random from some arbitrary scale (and normalized to a value between 0-1). 1. Synthetic graphs. We used preferential attachment based Barabassi graph generator1 to generate a synthetic network of ∼20000 Nodes and a diameter of 7 to 8 (calculated using ANF [4]).

Algorithm 1. GoDisco: Actions of node ni (which has relevant proﬁle, i.e., msgprof ile Ini = ∅) with neighbors ℵi upon receiving message payload <message, msgproﬁle, parameters, dampingﬂag> from node nk 1: for ∀nj s.t. nj ∈ ℵi nj ∈ ℵk do 2: if msgprof ile Inj = ∅ then 3: DelvM sgT o(nj ); {Forward message to neighbors with some matching interest in the message} 4: else 5: if M C(msgprof ile) M C(Inj ) = ∅ then 6: With p1 probability DelvM sgT o(nj ); {Message is forwarded with probability p1 to Level 1 IAs.} 7: else 8: N onRelv ← nj ; {Append to a list of nodes with no (apparent) common interest} 9: end if 10: end if 11: end for 12: if |N onRelv| == |ℵi | then 13: if dampingf lag == T RU E then 14: Break; {Stop and do not execute any of the below steps. Note: In experimental evaluation, the “no damping” scenario corresponds to not having this IF statement/not bothering about the damping ﬂag at all.} 15: else 16: dampingf lag := T RU E; {Set the damping ﬂag} 17: Sort N onRelv in descending order of their score using hα+dβ +aγ; { obtained from control phase, from parameters set by message originator.} 18: IAN odes := Top X% of the sorted nodes; 19: for ∀nj ∈ IAN odes do 20: DelvM sgT o(nj ); {Carry out control phase tasks in the background} 21: end for 22: end if 23: else 24: dampingf lag := F ALSE {Reset the damping ﬂag} 25: end if

Random cliques: People often share some common interests with their immediate friends and form cliques, but do not form a contiguous community. For example, all soccer fans in a city do not necessarily form one community, instead smaller 1

http://www.cs.ucr.edu/~ ddreier/barabasi.html

234

A. Datta and R. Sharma

bunch of buddies pursue the interest together. To emulate such a behavior, we pick random nodes but relatively smaller number of neighbors (between 50-200) and assign these cliques some common interest. Associativity based interest assignment: People often form a contiguous subgraph within the population where all (most) members of the subgraph share some common interest, and form a community. This is particularly true in professional social networks. To emulate such a scenario, we randomly picked a node (center node) and applied a restricted breadth ﬁrst algorithm covering ∼ 1000 nodes and assigned the interest of these nodes to be similar to that of the center node, fading the associativity with distance from the center node. Totally random: Interest of the nodes were assigned uniformly at random. 2. Real network - DBLP network: We use the giant connected component of co-authorship graph from DBLP record of papers published in 1169 unique conferences between 2004 to 2008, which comprises of 284528 unique authors. We classiﬁed the category of the conferences (e.g., data mining, distributed systems, etc.) to determine the interest proﬁle of the authors. 4.1

Results

In this section we describe various metrics we observed in our experiments. The default parameter values were p1 = 0.5, X = 10%, α=0.50, β=0.30 and γ=0.20. Other choices provided similar qualitative results. We also expose results from limited exploration of parameters in a brief discussion later (see Figure 3). 1. Message dissemination: Figure 1 shows the spread of dissemination over time in the various networks. The plots compare three diﬀerent mechanism of propagation of information - (i) with damping (D), (ii) without damping (ND), and with the additional use of (iii) random walkers (RW) in the case with damping and plots the number of relevant nodes (R) and total nodes including non-relevant nodes (T) who receive the message. With the use of the damping mechanism, the number of non-relevant nodes dropped sharply but with only a small loss in coverage of relevant nodes. This shows the eﬀectiveness of damping mechanism approach to reduce spam. Using random walkers provide better coverage of relevant nodes, and only marginally more non-relevant nodes receive the message. This eﬀect is most pronounced in the case of the DBLP graph. Associativity based synthetic graph better resembles real networks. Gossiping based mechanism is also expected to work better in communities which are not heavily fragmented. So we will mostly conﬁne our results to the associativity based and DBLP graphs due to space constraints even though we used all the networks for all the experiments described next. If qualitatively diﬀerent results are observed for the other graphs, we will mention the same as and when necessary. 2. Recall: Recall is the ratio of the number of relevant nodes who get a message to the total number of relevant nodes in the graph. We compare the recall for damping (D) vs non-damping (ND) mechanisms shown in Figure 2(a) and 2(b)

GoDisco: Selective Gossip Based Dissemination of Information

(a) Rand. Cliq.

(b) Tot. Rand.

(c) Asst.

235

(d) DBLP

Fig. 1. Message dissemination (a) Random Cliques (Rand. Cliq.) (b) Total Random (Tot. Rand.) (c) Associativity (Asst.) (d) DBLP

(a) R Asst

(b) R DBLP

(c) P Asst

(d) P DBLP

Fig. 2. Recall (R) & Precision (P) for DBLP and Associativity (Asst)

for Associativity based and DBLP respectively. Use of damping mechanism leads to slight decrease in recall. In associativity based interest network, recall value reaches very close to 1. Even in random cliques, the recall value reached very close to one, but it was relatively slower than in the associativity based network, but in totally random assignment of individuals’ interests, recall value of only upto 0.9 could be achieved (non-damping), while random walkers provide relatively more improvement (recall of around 0.8) than the case of using damping but no random walkers (recall of roughly 0.7), demonstrating the limitations of a gossip based approach if the audience is highly fragmented as well as the eﬀectiveness of random walkers in reaching some of such isolated groups at a low overhead. In the case of DBLP network (Figure 2(b)) the recall value is reasonably good. Since the DBLP graph comprises of an order of magnitude more total nodes (around fourteen times more) than the synthetic graphs. Consequently, the absolute numbers observed in the DBLP graph can not directly be compared to the results observed in the synthetic graphs. The use of random walker provides signiﬁcant improvements in the dissemination process. 3. Precision: Precision is the ratio of number of relevant nodes who get a message to the total number of nodes (relevant plus irrelevant) who get the message. Figure 2(c) and 2(d) shows the precision value measured in the diﬀerent networks for associativity based and DBLP respectively. We notice that in the DBLP network, with real semantic information, the precision is in fact relatively better than what is achieved in the synthetic associativity based network. From

236

A. Datta and R. Sharma

(a) Msg. Diss.

(b) P & R

(c) R DBLP

(d) P DBLP

Fig. 3. α and γ comparison on the random clique based network, Eﬀect of Feedback on Recall (R) & Precision (P)

this we infer that in real networks, the associated semantic information in fact enables a more targeted dissemination. We also observe the usual trade-oﬀ with the achieved recall. 4. Parameter exploration: In associativity based network, because of a tightly coupled contiguous community, the quality of dissemination is not so sensitive to the parameter choice, while in scattered networks like random cliques it is. To evaluate the eﬀect of γ in the network, we perform experiments on the random cliques based network with γ=0.60, β=0.30 and α=0.10, and compare these with the scenarios with the default values of α=0.50, β=0.30 and γ=0.20. A greater value of γ puts a greater emphasis to highly active users as potential IAs, who can help improve the speed of dissemination, but at the cost of spamming more uninterested users. Results shown in Figure 3(a) and 3(b) conﬁrm this intuition. Figure 3(a) shows the total nodes (T) and relevant nodes (R) receiving a message. Figure 3(b) compares the recall and precision for the two choices of parameters. 5. Message duplication: Nodes may receive duplicates of the same message. We use proximity to reduce such duplicates. Figures 4(a) and 4(c) show for associativity based and DBLP networks respectively the number of duplicates avoided during the dissemination process for both nodes for whom the message is relevant (Relv) as well as for nodes for whom it is irrelevant (IrRelv). Interesting to note is that with damping, the absolute number of irrelevant nodes getting the message is already low, so the savings in duplication is also naturally low. The results show that the use of proximity is an eﬀective mechanism to signiﬁcantly reduce such unnecessary network traﬃc. 6. Volume of duplicates: Figure 4(b) and 4(d) measures the volume of duplicates received by individual nodes for associativity based and DBLP networks respectively during the dissemination process. The observed trade-oﬀs of using random walkers with damping, or the case of not using damping is intuitive. 7. Eﬀect of Feedback: Figure 3(c) shows the eﬀect of feedback on recall for DBLP network. In case of associativity based network community members are very tightly coupled with few exceptions, so there is not much visible improvement with feedback (not shown). However in case of DBLP, for non-damping

GoDisco: Selective Gossip Based Dissemination of Information

(a) DS Asst.

(b) DR Asst.

(c) DS DBLP

237

(d) DR DBLP

Fig. 4. Duplication saved using proximity (DS) & Duplicates received (DR) for DBLP and Associativity (Asst)

as well as random walk based schemes where non-relevant nodes are leveraged for disseminating the information, we observe signiﬁcant improvement in recall when comparing the spread of ﬁrst message in the system with respect to the thousandth message (identiﬁed with an ‘F’ in the legends to indicate the scenarios with feedback), as the feedback mechanism helps the individual nodes self-organize and choose better information agents over time, which accelerates the process of connecting fragmented communities. Interestingly, this improvement in recall does not compromise the precision, which in fact even improves slightly, further conﬁrming that the feedback steers the dissemination process to use more relevant information agents.

5

Conclusion and Future Work

We have described and evaluated a selective gossiping based information dissemination mechanism (namely GoDisco), which is constrained by communication among only nodes socially connected to each other, and leverage users’ interests and the fact that users with similar interests often form a community - in order to do directed dissemination of information. GoDisco is nevertheless a ﬁrst work of its kind, leveraging on interest communities for directed information dissemination using exclusively social links. We have a multidirectional plan for future extensions of this work. We plan to apply GoDisco in various kinds of application, including in information dissemination in peer-to-peer online social networks [3] - for example, for probabilistic publish/subscribe systems, or to contextually advertise products to sell to other users of a social network. We are also planning to improve existing schemes in various manner, like incorporating various security mechanisms and disincentives for antisocial behavior.

References 1. Apolloni, A., Channakeshava, K., Durbeck, L., Khan, M., Kuhlman, C., Lewis, B., Swarup, S.: A study of information diﬀusion over a realistic social network model. In: Int. Conf. on Computational Science and Engineering (2009)

238

A. Datta and R. Sharma

2. Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal multicast. ACM Trans. Comput. Syst. 17(2) (1999) 3. Buchegger, S., Schi¨ oberg, D., Vu, L.H., Datta, A.: PeerSoN: P2P social networking - early experiences and insights. In: Proc. of the 2nd ACM Workshop on Social Network Systems (2009) 4. Palmer, P.G.C., Faloutsos, C.: Anf: A fast and scalable tool for data mining in massive graphs. In: KDD (2002) 5. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the ﬂickr social network. In: WWW (2009) 6. Cholvi, V., Felber, P., Biersack, E.W.: Eﬃcient search in unstructured peer-to-peer networks. In: SPAA (2004) 7. Costa, P., Mascolo, C., Musolesi, M., Picco, G.P.: Socially-aware routing for publish-subscribe in delay-tolerant mobile ad hoc networks. IEEE Journal on Selected Areas in Communications 26(5) (June 2008) 8. Datta, A., Quarteroni, S., Aberer, K.: Autonomous gossiping: A self-organizing epidemic algorithm for selective information dissemination in wireless mobile adhoc networks. Semantics of a Networked World (2004) 9. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. In: PODC. ACM, New York (1987) 10. Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.-M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2) (2003) 11. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkinsi, A.: Information diﬀusion through blogspace. In: WWW (2004) 12. Hosseini, M., Ahmed, D.T., Shirmohammadi, S., Georganas, N.D.: A survey of application-layer multicast protocols 9(3) (2007) 13. Iamnitchi, A., Ripeanu, M., Santos-Neto, E., Foster, I.: The small world of ﬁle sharing. IEEE Transactions on Parallel and Distributed Systems 14. Karp, R., Schindelhauer, C., Shenker, S., Vocking, B.: Randomized rumor spreading. In: FOCS. IEEE Comput. Soci., Los Alamitos (2000) 15. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web 8(2) (2005) 16. Patel, J.A., Gupta, I., Contractor, N.: Jetstream: Achieving predictable gossip dissemination by leveraging social network principles. In: Proceedings of the Fifth IEEE Int. Symposium on Network Computing and Applications (2006) 17. Rapoport, A.: Spread of information through a population with sociostructural bias: I. assumption of transitivity. Bulletin of Mathematical Biophysics 15 (1953) 18. Shah, D.: Gossip algorithms. Found. Trends Netw. 3(1) (2009) 19. Song, X., Lin, C.-Y., Tseng, B.L., Sun, M.-T.: Modeling and predicting personal information dissemination behavior. In: KDD 2005: Proceedings of the Eleventh ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, ACM, New York (2005) 20. Vigfusson, Y., Birman, K., Huang, Q., Nataraj, D.P.: Go: Platform support for gossip applications. In: P2P (2009)

Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres Maitreya Natu1 , Vaishali Sadaphal1 , Sangameshwar Patil1 , and Ankit Mehrotra2 1

Tata Research Development and Design Centre, Pune, India 2 SAS R & D India Pvt Limited, Pune , India {maitreya.natu,vaishali.sadaphal,sangameshwar.patil}@tcs.com, [email protected]

Abstract. In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. We argue that identiﬁcation of such frequently occurring sub-graphs can provide many useful insights about the functioning of the system. We demonstrate how the existing frequent sub-graph discovery algorithms can be modiﬁed for the domain of communication graphs in order to provide computationally light-weight and accurate solutions. We present two algorithms for extracting frequent communication sub-graphs and present a detailed experimental evaluation to prove the correctness and eﬃciency of the proposed algorithms.

1

Introduction

With the increasing scale and complexity of today’s data-centers, it is becoming more and more diﬃcult to analyze the as-is state of the system. The data center operators observe the dire need for obtaining insights into the as-is state of the system such as the inter-component dependencies, heavily used resources, heavily used communication patterns, occurrence of changes, etc. Modern enterprises support two broad types of workloads and applications: transactional and batch. The communication patterns in these systems are dynamic and keep changing over time. For instance, in a batch system, the set of jobs executed on a day depends on several factors, including the day of the week, day of the month, addition of new reporting requirements, etc. The communication patterns for both transactional and batch systems can be observed as a temporal sequence of communication graphs. We argue that there is a need to extract and analyze frequently-occurring subgraph derived from a sequence of communication graphs to answer various analysis questions. In scenarios where the discovered frequent subgraph is large in size, the frequent subgraph provides a representative graph of the entire system. Such a representative graph becomes extremely useful in scenarios where the communication graphs change dynamically over time. The representative graph can M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 239–250, 2011. c Springer-Verlag Berlin Heidelberg 2011

240

M. Natu et al.

provide a good insight into the recurring communication pattern of the system. The discovered representative graph can be used in a variety of ways. For instance, (a) It can be used to predict the future communication patterns. (b) It can be used to perform what-if analysis. (c) Given a representative graph, various time-consuming analysis operations such as dependency discovery, building workload-resource utilization models, performing slack analysis, etc., can be done a-priori in an oﬀ-line manner to aid in quicker online analysis in future. In scenarios where the discovered frequent subgraph is small in size, such a subgraph can be used to zoom into the heavily used communication patterns. The components in this graph can be identiﬁed as critical components and further analyzed for appropriate resource provisioning, load balancing, etc. In this paper we present techniques to identify frequently occurring subgraph within a set of communication graphs. The existing graph-mining solutions [5,6,4] for frequent subgraph discovery address a more complex problem. Most of the existing techniques assume that the graph components can have non-unique identiﬁers. For instance, consider a problem of mining chemical compounds to ﬁnd recurrent substructures. Presence of non-unique identiﬁers results in an explosion in the number of possible combinations of sub-graphs. Further, operations such as graph isomorphism become highly computation intensive. The problem of ﬁnding frequent subgraph in a set of communication graphs is a simpler problem as each graph component can be assigned a unique identiﬁer. In this paper, we argue that simpler graph-mining solutions can be developed for this problem. We present two techniques for discovering frequent subgraph. (1) We ﬁrst present a bottom-up approach where we incrementally take combinations of components and compute their support. The key idea behind the proposed modiﬁcation is that the support of subgraphs at the next level can be estimated based on the support of the subgraphs at the lower levels using probability theory. (2) We then present a top-down approach where, instead of analyzing components in a graph, we consider the entire graph as a whole and analyze the set of graphs. We use simple matrix operations to mine frequent subgraph. The key idea behind this algorithm is that the property of unique component identiﬁers can be used to assign a common structure to all graphs. This common structure can be used to process entire graph as a whole in order to identify frequently occurring subgraphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns. (2) We present a modiﬁcation to the existing bottom-up Apriori algorithm to improve eﬃciency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs. (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the eﬀective application areas of the proposed algorithms.

Mining Frequent Subgraphs to Extract Communication Patterns

2

241

Related Work

Graph mining techniques have been applied to varied domains viz. for the purpose of extracting common patterns in chemical compounds, genetic formulae, extracting common structures in Internet, social networks, [3,2] etc. In this paper, we consider extracting commonality in communication patterns given that the communication patterns are changing dynamically. Our work is diﬀerent in that we take advantage of presence of unique node identiﬁers and propose new algorithms to identify frequent subgraph. Frequent subgraph mining techniques proposed in the past [5], [6], [1] use a bottom-up approach. In this paper, we propose a modiﬁcation to the traditional bottom-up approach and present a novel top-down approach for frequent subgraph discovery. Most of the past research assumes non-unique identiﬁers of components and is directed towards solving the issues related to explosion of candidate space due to presence of multiple vertices and edges with same identiﬁer.When applied to the domain of communication graphs, the problem of mining frequent subgraph translates to a much simpler problem due to the fact that no two nodes or links in a communication network have same identiﬁer. [7] propose algorithms for testing graph isomorphism, computing largest common subgraph in the trees or graphs with unique node labels. In this paper, we address the problem of mining frequent graphs in a set of graphs with unique node and link identiﬁers.

3

Problem Description

In this section, we ﬁrst deﬁne various terms used in this paper. We then systematically map the addressed problem to the problem of frequent subgraph discovery. We then present the problem deﬁnition and introduce the proposed solution approaches. 3.1

Terms Used

We ﬁrst deﬁne various terms used in this paper. Size(G): The size of a graph G is deﬁned as the number of edges present in the graph. Subgraph(G): A graph G (V , E ) is a subgraph of a graph G(V, E), if and only if V ⊆ V and E ⊆ E. Support(G , G): Given a set of n graphs G = {G1 , G2 , . . . , Gn }, a subgraph G has a support ns if G is a subgraph of s graphs out of the set {G1 , G2 , . . . , Gn }. Frequent Subgraphs(min support, G): Given a set of graphs G = {G1 , G2 , . . . , Gn } and the required minimum support of min support, a subgraph G is a frequent subgraph if and only if Support(G , G) ≥ min support. Note that the resulting frequent subgraph can be a disconnected graph.

242

3.2

M. Natu et al.

Problem Deﬁnition

We now map the problem of identifying communication patterns to the problem of frequent subgraph discovery. Consider a network topology T represented by a graph GT (VT , ET ) where the vertices VT represent network nodes and edges ET represent network links. Each vertex (and each edge) in the graph G(V, E) can be identiﬁed using a unique identiﬁer. The problem of identifying communication patterns in the network from a set of communication traces can be mapped to the problem of frequent subgraph discovery as follows. 1. A communication trace C consists of the links (and nodes) being used by the system over time. 2. The communication trace Ct , at time t, can be represented as a graph Gt (Vt , Et ) where the set Vt and Et represent the network nodes links being used in time-window t. (Note that, Vt ⊆ VT and Et ⊆ ET .) 3. A set of communication traces C1 , C2 , . . . , Cn can be represented by a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). 4. The problem of identifying frequent communication patterns in a trace C can be mapped to the problem of identifying frequent subgraph in a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). The problem can then be deﬁned as follows: Given a set of graphs G1 , G2 , . . . , Gn and required support of min sup, compute the set F of all frequently occurring subgraphs F1 , F2 , . . . , Fm .

Fig. 1. (a) Example graphs, (b) Frequent subgraphs with minimum support of

2 3

Fig. 2. (a) Execution of Apriori algorithm, (b) Execution of Approx Apriori algorithm

Mining Frequent Subgraphs to Extract Communication Patterns

4

243

Proposed Bottom-Up Approach: Algorithm Approx-Apriori

In the past, the problem of identifying frequent subgraph has been solved using bottom-up approaches. Apriori [4] is a representative algorithm of this category. The traditional Apriori algorithm broadly involves two steps: Step 1: Generate subgraphs of size (k + 1) by identifying pairs of frequent subgraphs of size k that can be combined. Step 2: Compute the support of subgraphs of size (k + 1) to identify frequent subgraphs. We use a running example to explain the Apriori algorithm and other algorithms presented in the next section. Running example: We present an example consisting of a set of three graphs shown in Figure 1a. We consider min support = 23 . We show the execution of the traditional Apriori algorithm on this example. Figure 2a shows the subgraphs of size 1, 2, 3, and 4 generated by the iterations of Apriori algorithm. Consider the subgraphs of size 1. The subgraph with a single edge (c − e) has a support less than 23 and is discarded from further analysis. The remaining subgraphs are used to build subgraphs of size 2. Similar process is continued in subsequent iterations to identify subgraphs of size 3 and 4. This approach results in two kinds of overheads, (1) joining two size-k frequent subgraphs to generate one size-(k + 1) subgraph, (2) counting frequency of these subgraphs. Step (1) involves making all possible combinations of subgraphs of level-k resulting into large number of candidates making the approach computation intensive. The problem becomes more acute when the minimum support required for a subgraph to be identiﬁed as a frequent subgraph is smaller. In this paper, we present a modiﬁcation of the ﬁrst step of the Apriori algorithm by pruning of the search space. We modify the ﬁrst step of the Apriori algorithm by intelligently selecting the size-k subgraphs. In this step, we propose to estimate the support of the generated subgraphs of size-k using the support of the constituent size (k − 1) subgraphs. We use support of subgraph of size (k − 1) as the probability of its occurrence in the given set of graphs. Thus, given the probability of occurrence of two subgraphs of size (k − 1), their product is used as the probability of occurrence of the subgraph of size k. This is used as an estimate of the support of subgraphs of size-k. We prune a size k subgraph if the estimated support is less than desired support, pruning min support. Note that, the above computation assumes that the occurrence of two subgraphs of size k is independent. Thus the estimated support tends to move away from the actual support in situations where the independence assumption does not hold. Furthermore, the error propagates in subsequent iterations as the graphs grow in size which may result in larger inaccuracy. We hence propose to relax the pruning min support with every iteration. The pruning min support is equal to min support in the ﬁrst iteration. We decrease the pruning min support by a constant REDUCTION FACTOR in every iteration. The pruning thus performed narrows down the search space and decreases the execution time. In Section 7, we show through experimental evaluation that

244

M. Natu et al.

with appropriate use of REDUCTION FACTOR, the Approx-Apriori algorithm gives reasonably accurate results. We next present various steps involved in the proposed approach. Input: 1. G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs 2. min support: Minimum support required to declare a subgraph as frequent Output: Set of frequently occurring subgraphs. Initialization: 1. 2. 3. 4.

Generate a set of graphs GS1 of size 1 for each edge in E, E = E1 ∪ . . . ∪ En . S1 Remove Graph GS1 if Support(GS1 i from G i , G) < min support. Set k = 2, where k = size of subgraphs. k = k + 1 with every iteration. pruning min support = min support.

Step 1 - Generate subgraphs of size k by identifying pairs of frequent subgraphs of size k − 1 that can be combined: Two subgraphs GSk and i Sk+1 GSk of size k + 1 if and only j of size k are combined to create a subgraph Gij Sk Sk if k − 1 edges are common in GSk and GSk i j . In other words, |Ei ∩ Ej | = 1. – Estimate the support of the subgraph GSk+1 : ij Estimated Support(GSk+1 ) = Prob(GSk+1 ) ij ij

–

Sk where Prob(GSk+1 ) = Prob(GSk i ) * Prob(Gj ). ij Sk+1 Prune the subgraph Gij if Estimated Support(GSk+1 ) < pruning min support. ij

– Decrease the pruning min support for the next iteration:

pruning min support = pruning min support - REDUCTION FACTOR.

Step 2 - Compute the support of subgraphs of size k to identify frequent subgraphs: Repeat Step 1 and Step 2 until subgraphs of all sizes are explored. Running example: For the the running example of Figure 1a Figure 2b shows the subgraphs and estimated support of size 1, 2, 3, and 4 generated by the iterations of Approx-Apriori algorithm. Unlike Apriori, the Approx-Apriori performs an intelligent selection of pairs using estimated support. If the estimated support of a size k subgraph is less than 23 then the constituent pair of size k − 1 subgraphs are never combined. For example, consider the pair of size 3 subgraphs: (a − b)(b − c)(c − d) and (a − b)(b − c)(b − e) both having a support of 23 . The estimated support of the resulting size-4 subgraph is 49 which is less than 23 . Hence the size-4 subgraph (a − b)(b − c)(c − d)(b − e) is not constructed for analysis.

5

Proposed Top-Down Approach: Matrix-ANDing

While building a top-down approach, we use the entire graph as a whole and use the following two properties of the communication graphs: (1) Each node in a network can be identiﬁed with a unique identiﬁer. (2) The network topology is known apriori. These properties can be exploited to assign a common structure

Mining Frequent Subgraphs to Extract Communication Patterns

245

Fig. 3. (a)Matrix representation of Graphs X, Y, and Z, Mx , My , Mz . (b) Consolidated Matrix, Mc .

to all graphs. This common structure can be used to process entire graphs as a whole in order to identify frequently occurring subgraphs. In the following algorithm, we ﬁrst explain this structure and show all graphs can be represented in a similar manner. We then present a technique to process these structures to extract frequent subgraphs. Input: (1) G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs. (2) min support: Minimum support required to declare a subgraph as frequent. Output: Set of frequently occurring subgraphs. Initialization: (1) We ﬁrst identify the maximal set of vertices VM as discussed above and order these vertices lexicographically. (2) We then represent each graph Gt ∈ GT as a |VT | × |VT | matrix Mt using the standard binary adjacency matrix representation. Note that the nodes in all matrices are in same order. Thus, a cell Mt [i, j] represents the same edge Eij in all the matrices. Processing: 1. Assign a unique prime number pt as an identiﬁer of a graph Gt ∈ G and multiply all the values in its representative matrix Mt with pt . 2. Given the matrices M1 , . . . , Mn , compute a consolidated matrix Mc such that ∀i∈VT ,j∈VT Mc [i, j] = M1 [i, j] ∗ . . . ∗ Mn [i, j] if Mn [i, j] = 0. 3. a set of n graphs G = {G1 , . . . , Gn } and min support, compute Given n combinations of the graphs. Compute an identiﬁer for each comn∗min support bination as a product of the identiﬁers of its constituent graphs. Thus, the identiﬁer for a combination of graphs G1 , . . . , Gk is computed as p1 ∗ . . . ∗ pk . The set Gcomb consists of identiﬁer for each of the n∗min nsupport combinations. – Given an identiﬁer Gcomb [i] for a combination of graphs, we can identify if an edge represented by Mc [i, j] is present in all the graphs represented by Gcomb [i]. By the property of prime numbers, this can be done by simply checking if Mc [i, j] is divisible by Gcomb [i]. 4. For each identiﬁer Gcomb [i] identify the cells Mc [i, j] in the consolidated matrix Mc that are divisible by Gcomb [i]. The edges identiﬁed by these cells represent a frequently occurring subgraph with a support greater than or equal to the min support.

246

M. Natu et al.

Note that each element in a consolidated matrix has a product of prime numbers. Even if there are small number of graphs in a database that have the same edge, the element corresponding to the edge in the consolidated matrix would have a very large value resulting in overﬂow. We propose to use compression techniques to avoid such scenarios. Bit vectors can also be used in such cases. Running example: We assign prime numbers 3, 5, and 7 to the graphs X, Y , and Z from Figure 1a. As the maximal set of nodes in these graphs is {a, b, c, d, e, f, g} we represent all graphs in the form of a 7 × 7 matrix. Figure 3a shows matrix representation Mx , My , and Mz of graphs X, Y , and Z where matrix of each graph has been multiplied by its identifier prime number. The consolidated matrix built from these matrices is shown in Figure 3b. From the given set of 3 graphs, X, Y, and Z, in order to identify frequent subgraphs with min support = 23 , we compute 32 = 3 combinations, viz., (X, Y ), (X, Z), (Y, Z). The identifiers for the combinations (X, Y ), (X, Z), and (Y, Z) are computed as 3 ∗ 5 = 15, 3 ∗ 7 = 21, and 5 ∗ 7 = 35, respectively. With identifier of X,Y equal to 15, the cells in Mc that are divisible by 15 indicate the edges that are present in both graphs X and Y. This in turn also provides a frequent subgraph with a minimum support of 2. Similarly, other frequent subgraphs can be identified by checking for cells with divisibility by 21, and 35.

Fig. 4. (a,b) Sample snapshots of the real-life mainframe batch processing system. (c) Identiﬁed frequent subgraph.

6

Application of Proposed Technique on a Real-Life Example

In this section, we apply the proposed technique on a real-life mainframe batch processing system. This system is used at a leading ﬁnancial service provider for a variety of end-of-day trade result processing. We have 6 months data about the per-day graph of dependence among these jobs. Over the period of 6 months, we observed a total of 516 diﬀerent jobs and 72702 diﬀerent paths. Analyzing the per-day job precedence graphs brings out the fact that the jobs and their dependencies change over time. Figures 4(a,b) show two sample

Mining Frequent Subgraphs to Extract Communication Patterns

247

snapshots of this changing graph. Figure 4(c) shows one of the frequent subgraphs detected by our algorithm (min support = 0.7). On an average, there are about 156 processes and 228 dependence links per graph, whereas the frequently occurring subgraph in the set of these graphs consists of 98 processes and 121 dependence links.

Fig. 5. Execution time for experiments with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support

The frequent subgraphs discovered on the system are found to be large in size. It covers more than 65% of the graph at any time instance. This insight is typically true for most back-oﬃce batch processing systems, where a large portion of communication takes place on a regular basis and few things change seasonally. This communication graph can then be used as a representative graph and various analysis can be performed a-priori on this representative graph in an oﬀ-line manner. This oﬀ-line analysis on the representative graph can then be used to quickly answer various analysis questions on-line about the entire system.

7

Experimental Evaluation

In this section, we present the experiment design for systematic evaluation of the algorithms proposed in this paper. We simulate systems with temporally changing communication patterns and execute the proposed algorithms to identify frequently occurring communication patterns. We generate various network topologies based on the desired number of nodes and average node degree and model each topology as a graph. For each topology, we generate a set of graphs that represents temporally changing communication patterns. We model the level of change in system-activity, c, by controlling the amount of change in links across the graphs. The default values of various experiment parameters are set as follows: Number of nodes = 10; Average node degree = 5; Change in level of activity = 0.5; Number of graphs = 10; min support = 0.3; REDUCTION FACTOR = 0; Each point plotted on the graphs is an average of the results of 10 runs.

248

M. Natu et al.

Fig. 6. False negatives for experiments with changing (a) number of nodes, (b) average node degree, (c) min support

7.1

Comparative Study of Diﬀerent Algorithms to Observe the Eﬀect of Various System Properties

We performed experiments to observe the eﬀect of four system properties viz. number of nodes, average node degree, level of system activity, and desired min support. Figure 5 presents eﬀect on time-taken by the three algorithms. Figure 6 presents the eﬀect on accuracy of the Approx. Apriori algorithm. Number of nodes: We evaluate the algorithm on graphs with number of nodes = 10 to 18 in Figure 5a and Figure 6a. It can be seen that execution time of Apriori algorithms is more sensitive to the number of nodes than the Matrix ANDing algorithm. The execution time of Apriori algorithm is signiﬁcantly larger than that of the Approx. Apriori algorithm. The false negatives of the Approx. Apriori algorithm increase as the size of graph increases. This is because with the increase in the size of the graph the number of candidates increases and the possibility of missing out a frequent subgraph increases. Average node degree: We evaluate the algorithms on the graphs with node degree 3 to 7 in Figure 5b and Figure 6b. The execution time of Matrix-ANDing algorithm is independent of the average node degree of the network. This is because its execution time is mainly dependent on the matrix size (number of nodes) and depends very less on the node degree or the number of links. The execution time of Apriori algorithms increases as the node degree increases because of the increased number of candidate subgraphs. Larger node degree results in a larger candidate space. As a result, increase in node degree results in increase in false negatives. Change in the level of activity in the system: Figure 5c shows the eﬀect of amount of change in the level of activity in the system, c, on the execution time of the algorithm. There is no signiﬁcant eﬀect of amount of change in the level of activity in the system on the false negatives. The execution time of MatrixANDing algorithm is independent of the level of activity in the system. The execution time of the Apriori algorithms decreases with increase in the level of activity. As c increases, the number of frequent subgraphs decreases, resulting in the decrease in execution time of the Apriori algorithms.

Mining Frequent Subgraphs to Extract Communication Patterns

249

Desired min support : Figure 5d shows the eﬀect of min support on the execution time of the algorithms. Figure 6c shows eﬀect of min support on false negatives. The execution time of the Apriori algorithms decreases with increase in the min support value. Small min support results in a large number of graphs resulting in large execution time. The Matrix ANDing algorithm behaves in an interesting manner. The Matrix-ANDing algorithm processes n∗min nsupport combinations of graphs. Note that this number is maximum when min support = 0.5. As a result, Matrix-ANDing takes maximum time to execute when min support = 0.5. The false negatives decrease with increasing min support since the number of frequent graphs decreases. This experiment brings out the strength and weaknesses of the three algorithms. (1) It is interesting to note that the execution time of Apriori and Approx-Apriori algorithm is controlled by graph properties such as number of nodes, average node degree, and the level of activity. However, behaviour of Matrix-ANDing is mainly dependent upon application parameters such as min support. (2) Note that both Apriori and Matrix-ANDing provide optimal solutions. For large values of support Apriori algorithm requires smaller execution time and for small values of support Matrix-ANDing requires smaller execution time. (3) In cases where some false negatives can be tolerated, ApproxApriori algorithm performs fastest with near-optimal solutions for larger values of support. 7.2

Properties of Identiﬁed Frequent Subgraphs

We study the eﬀect of diﬀerent parameters on the properties of the identiﬁed frequent subgraphs viz. number and the size of subgraphs. The number of frequent subgraphs and size of frequent subgraphs identiﬁed in the given set of graphs provides insight into the dynamic nature of the system and size of critical components in the graph.

Fig. 7. Number of frequent subgraphs identiﬁed in the network with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support

Figure 7a and Figure 7b show the eﬀect of increase in number of nodes and node degree in the graph, respectively. The number of nodes or the node degree in the graph increases the number and size of frequent subgraphs increase.

250

M. Natu et al.

Figure 7c shows the eﬀect of minimum support, min support on the number and the size of the identiﬁed frequent subgraphs. The number and the size of frequent subgraph decreases as the minimum support increases. This is because, larger is the value of the minimum support, more stringent are the requirements for a graph to be declared as a frequent subgraph. Figure 7d shows the eﬀect of amount of change in the level of activity in the system on the number and size of the frequent subgraphs. With increase in the level of activity of the system the number and size of frequent subgraphs decrease.

8

Conclusion

In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns, (2) We present a modiﬁcation to the existing bottom-up Apriori algorithm to improve eﬃciency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs, (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the eﬀective application areas of the proposed algorithms.

References 1. Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zueﬂe, A.: Probabilistic frequent itemset mining in uncertain databases. In: KDD (2009) 2. Chen, C., Yanz, X., Zhuy, F., Hany, J.: gapprox: Mining frequent approximate patterns from a massive network. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 445–450. Springer, Heidelberg (2007) 3. Christos Faloutsos and Jimeng Sun. Incremental pattern discovery on streams, graphs and tensors. Technical report at CMU (2007) 4. Mannila, H., Toivonen, H., Verkamo, A.: Eﬃcient algorithms for discovering association rules. In: AAAI Workshop on KDD (1994) 5. Inokuchi, A., Washio, T., Motoda, H.: Complete mining of frequent patterns from graphs. In: Mining Graph Data, Machine Learning (2003) 6. Kuramochi, M., Karypis, G.: An eﬃcient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering (2004) 7. Valiente, G.: Eﬃcient Algorithms on Trees and Graphs with Unique Node Labels. In: Studies in Computational Intelligence, vol. 52, pp. 137–149. Springer, Heidelberg (2007)

On the Hardness of Topology Inference H.B. Acharya1 and M.G. Gouda2 1 2

The University of Texas at Austin, USA [email protected] The National Science Foundation, USA [email protected]

Abstract. Many systems require information about the topology of networks on the Internet, for purposes like management, eﬃciency, testing of new protocols and so on. However, ISPs usually do not share the actual topology maps with outsiders; thus, in order to obtain the topology of a network on the Internet, a system must reconstruct it from publicly observable data. The standard method employs traceroute to obtain paths between nodes; next, a topology is generated such that the observed paths occur in the graph. However, traceroute has the problem that some routers refuse to reveal their addresses, and appear as anonymous nodes in traces. Previous research on the problem of topology inference with anonymous nodes has demonstrated that it is at best NP-complete. In this paper, we improve upon this result. In our previous research, we showed that in the special case where nodes may be anonymous in some traces but not in all traces (so all node identiﬁers are known), there exist trace sets that are generable from multiple topologies. This paper extends our theory of network tracing to the general case (with strictly anonymous nodes), and shows that the problem of computing the network that generated a trace set, given the trace set, has no general solution. The weak version of the problem, which allows an algorithm to output a “small” set of networks- any one of which is the correct one- is also not solvable. Any algorithm guaranteed to output the correct topology outputs at least an exponential number of networks. Our results are surprisingly robust: they hold even when the network is known to have exactly two anonymous nodes, and every node as well as every edge in the network is guaranteed to occur in some trace. On the basis of this result, we suggest that exact reconstruction of network topology requires more powerful tools than traceroute.

1

Introduction

Knowledge of the topology of a network is important for many design decisions. For example, the architecture of an overlay network - how it allocates addresses etc.- may be signiﬁcantly optimized by knowledge of the distribution and connectivity of the nodes on the underlay network that actually carries the traﬃc. Several important systems, such as P4P [9] and RMTP [7], utilize information about the topology of the underlay network for optimization as well as management. Furthermore, knowledge of network topology is useful in research; for M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 251–262, 2011. c Springer-Verlag Berlin Heidelberg 2011

252

H.B. Acharya and M.G. Gouda

example, in evaluating the performance of new protocols. Unfortunately, ISPs do not make maps of the true network topology publicly available. Consequently, a considerable amount of research eﬀort has been devoted to the development of systems that reconstruct the topology of networks in the Internet from publicly available data - [10], [6], and [4]. The usual mechanism for generating the topology of a network is by the use of Traceroute [3]. Traceroute is executed on a node, called the source, by specifying the address of a destination node. This execution produces a sequence of identiﬁers, called a trace, corresponding to the route taken by packets traveling from the source to the destination. A trace set T is generated by repeatedly executing Traceroute over a network N , varying the terminal nodes, i.e. the source and destination. If T contains traces that identify every instance when an edge is incident on a node, it is possible to reconstruct the network exactly. However, practical trace sets do not have this property. The most common problems are incomplete coverage, anonymity (where a node can be detected, but will not state its unique identiﬁer, i.e. its address), and aliasing (nodes may have multiple unique identiﬁers). The situation is further complicated by load balancing, which may cause incorrect traces; tools such as Paris Traceroute [8] attempt to correct this problem. In this paper, we deal with the problem of inferring the correct network topology in the presence of anonymous nodes. The problem posed by anonymous nodes in a trace is that a given anonymous node may or may not be identical to any other anonymous node. Clearly, a topology in which these nodes are distinct is not identical to one in which they are merged into a single node. Thus, there may be multiple topologies for the computed network. Note that all these candidate topologies can generate the observed trace set; no algorithm can tell, given the trace set as input, which of these topologies is correct. To solve this problem, Yao et al. [10] have suggested computing the minimal topology - the topology of the network with the smallest number of anonymous nodes (subject to some constraints - trace preservation and distance preservation) from which the given trace set is generable. They conclude that the problem of computing a minimal network topology from a given set of traces is NP-complete. Accordingly, most later research in the area, such as [6] and [4], has focused on heuristics for the problem. We attack this problem from a diﬀerent direction. In our earlier papers [1] and [2], we introduced a theory of network tracing, i.e. reconstruction of network topology from trace sets. In these papers, we made the problem theoretically tractable by assuming that no node is strictly anonymous. In this theory, a node can be irregular, meaning it is anonymous in some traces, but there must exist at least one trace in which it is not anonymous. This simplifying assumption clearly does not hold in practice; in fact, an anonymous node is almost always consistently anonymous, not irregular. (In practical cases, anonymous nodes correspond to routers that do not respond to ping; irregular nodes are routers that drop ping due to excessive load. Clearly, the usual case is for nodes to be

On the Hardness of Topology Inference

253

consistently anonymous, rather than irregular.) However, it enabled us to develop a theory for the case when the number of nodes in the network is clearly known (equal to the number of unique identiﬁers). In this paper, we develop our theory of network tracing for networks with strictly anonymous nodes. Our initial assumption was that, as irregular nodes are “partially” anonymous, the hardness results in [1] should hold for anonymous nodes. To our surprise, this turned out not to be true; in Theorem 3, we show that networks with one anonymous node are completely speciﬁed by their trace sets, while networks with one irregular node are not [1]. Consequently, we constructed a complete new proof for network tracing in the presence of strict anonymity, presented in Section 3. We show that even under the assumption that the minimal topology is correct, the network tracing problem with anonymous nodes is in fact much harder than NP-complete; it is not just intractable, but unsolvable. Even if we weaken the problem and allow an algorithm to return a “small” number of topologies (one of which is correct) the problem remains unsolvable: an algorithm guaranteed to return the correct topology returns a number of topologies which is at least exponential in the total number of nodes (anonymous and non-anonymous). A very surprising fact is that this result even holds if the number of anonymous nodes is restricted to two. We demonstrate how to construct a trace set which is generable from an exponential number of networks with two anonymous nodes, but not generable from any network with one anonymous node or less. (It is interesting to note that our results are derived under a network model with multiple strong assumptions - stable and symmetric routing, no aliasing, and complete coverage. The reason we choose such friendly conditions for our model is to demonstrate that the problem cannot be made easier using advanced network tracing techniques, such as Paris Traceroute to detect artifact paths, and inference of missing links [5]. We would like to thank Dr Stefan Schmid for this observation.) We would like to clarify our claim that the problem of identifying the network from which a trace set was generated, given only the trace set, is unsolvable. Our proof does not involve a reduction to a known uncomputable problem, such as the halting problem. Instead, we demonstrate that there are many minimal networks - an exponential number of them - that could have generated a given trace set; so, given only the trace set, it is impossible to state with certainty that one particular topology (or even one member of a small set of topologies) represents the network from which the trace set was in fact generated. The earlier proof of NP-completeness (by a reduction to graph coloring) provided by Yao et al. holds for constructing a minimal topology, not the minimal topology from which the trace set was generated. It is NP-complete to ﬁnd a single member of the exponential-sized solution set. Thus, even under the assumption that the true network is minimal in the number of anonymous nodes, trying to reconstruct it is much harder than previously thought. In the next section, we formally deﬁne terms such as network, trace and trace set, so as to be able to develop our mathematical treatment of the problem.

254

2

H.B. Acharya and M.G. Gouda

Minimal Network Tracing

In this section, we present formal deﬁnitions of the terms used in the paper. We also explain our network model and the reasoning underlying our assumptions. Finally, we provide a formal statement of the problem studied. 2.1

Term Definitions

A network N is a connected graph where nodes have unique identiﬁers. However, a node may or may not be labeled with its unique identiﬁer. If a node is labeled with its unique identiﬁer, it is non-anonymous; otherwise, it is anonymous. Further, non-anonymous nodes are either terminal or non-terminal. (These terms are used below.) A trace is a sequence of node identiﬁers. A trace t is said to be generable from a network N iﬀ the following four conditions are satisﬁed: 1. t represents a simple path in N . 2. The ﬁrst and last identiﬁers in t are the unique identiﬁers of terminal nodes in N . 3. If a non-anonymous node “a” in N appears in t, then it appears as “a”. 4. If an anonymous node “∗” in N appears in t, then it appears as “∗i ”. i is a unique integer in t, to distinguish anonymous nodes from each other. A trace set T is generable from a network N iﬀ the following conditions are satisﬁed: 1. Every trace in T is generable from N . 2. For every pair of terminal nodes x, y in N , T has at least one trace between x and y. 3. Every edge in N occurs in at least one trace in T . 4. Every node in N occurs in at least one trace in T . 5. T is consistent: for every two distinct nodes x and y, exactly the same nodes must occur between x and y in every trace in T where both x and y occur. We now discuss the reason why we assume the above conditions. The ﬁrst condition is obviously necessary. The third and fourth conditions are also clearly necessary, as we are interested in the problem of node anonymity, not incomplete coverage. However, the second and ﬁfth conditions are non-trivial; we explain them as follows. Conditions like inconsistent or asymmetric routing may or may not be true. Furthermore, it is possible, using tools like source routing and public traceroute pages, to ensure that a trace set contains traces between every possible pair of terminals. As our primary results are negative, we show their robustness by assuming the worst case: we develop our theory assuming the best possible conditions for the inference algorithm, and prove that the results are still valid.

On the Hardness of Topology Inference

255

In our earlier work, [1] and [2], we developed our theory using another strong condition: no node was anonymous. For a trace set to be generable from a network, we required that the unique identiﬁer of every node in the network appear in at least one trace. However, on further study we learned that routers in a network appear anonymous because they are conﬁgured either to never send ICMP responses, or to use the destination addresses of the traceroute packets instead of their real addresses [10]. Thus, if a node is anonymous in a single trace, it is usually anonymous in all traces in a trace set. This fact reduces our earlier study of network tracing to a theoretical exercise, as clearly its assumptions cannot be satisﬁed. Accordingly, in this paper, we have discarded this condition, and updated our theory of network tracing to include networks with anonymous nodes. The introduction of strictly anonymous nodes leads to a complication in our theory: we no longer have all unique identiﬁers, and cannot be sure of the total number of nodes in the network. Hence we will adopt the same approach as Yao et al. in [10] and attempt to reconstruct a topology with the smallest possible number of anonymous nodes. Accordingly, we adopt a new deﬁnition: A minimal network N from which trace set T is generable is a network with the following properties: 1. T is generable from N . 2. T is not generable from any network N which has fewer nodes than N . Note that, if there are multiple minimal networks from which a trace set T is generable, then they all have the same number of nodes. Further, as all such networks contain every non-anonymous node seen in T , it follows that all minimal networks from which a trace set T is generable also have the same number of anonymous nodes. 2.2

The Minimal Network Tracing Problem

We can now state a formal deﬁnition of the problem studied in this paper. The minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a network N such that T is generable from N and, for any network N = N , at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than N .” The weak minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a small set S = {N1 , .., Nk } of minimal networks such that T is generable from each network in this set and, for any network N ∈ / S, at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than any member of S.”

256

H.B. Acharya and M.G. Gouda

The minimal network tracing problem is clearly a special case of the weak minimal network tracing problem, where we consider only singleton sets to be small. In Section 3, we show that the weak minimal network tracing problem is unsolvable in the presence of anonymous nodes, even if we consider only sets of exponential size to be “not small”; of course, this means that the minimal network tracing problem is also unsolvable.

3

The Hardness of Minimal Network Tracing

In this section, we begin by constructing a very simple trace set with only one trace, T0,0 = {(a, ∗1 , b1 )} which, of course, corresponds to the network in Figure 1.

Fig. 1. Minimal topology for T0,0

We now deﬁne two operations to grow this network, Op1 and Op2. In Op1, we introduce a new non-anonymous node and a new anonymous node; the nonanonymous nodes introduced by Op1 are b-nodes. In Op2, we introduce a nonanonymous node, but may or may not introduce an anonymous node; if we only consider minimal networks, then in Op2 we only introduce non-anonymous nodes. To execute Op1, we introduce a new b-node (say bi ) which is connected to a through a new anonymous node ∗i . We will now explain how we ensure that ∗i is a new anonymous node. Note that our assumption of consistent routing ensures that there are no loops in traces. Thus, we can ensure that ∗i is a “new” anonymous node (and not an “old”, i.e. previously-seen anonymous node) by showing that it occurs on a trace with every old anonymous node. To achieve this, we add traces from bi to each pre-existing b-node bj . These traces are of the form (bi , ∗ii , a, ∗jj , bj ). We then use consistent routing to show that ∗i = ∗ii and ∗j = ∗jj , and (as we intended) ∗i = ∗j . We denote the trace set produced by applying Op1 k times to T0,0 by Tk,0 . For example, after one application of Op1 to T0,0 , we obtain trace set T1,0 : T1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 )} As we assume consistent routing, ∗1 = ∗3 and ∗2 = ∗4 . Furthermore, as ∗3 and ∗4 occur in the same trace, ∗1 = ∗2 .

On the Hardness of Topology Inference

257

Fig. 2. Minimal topology for T1,0

There is exactly one possible network from which this trace set is generable; we present it in Figure 2. We now deﬁne operation “Op2”. In Op2, we introduce a new non-anonymous node (ci ). We add traces such that ci is connected to a through an anonymous node, and is directly connected to all b and c nodes. We denote the trace set produced by applying Op2 l times to Tk,0 by Tk,l . For example, one application of Op2 to the trace set T1,0 produces trace set T1,1 given below. T1,1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 ), (a, ∗5 , c1 ), (b1 , c1 ), (b2 , c1 )} From Figure 3 we see that three topologies are possible: (a) ∗5 is a new node. ∗1 = ∗5 and ∗2 = ∗5 . (b) ∗1 = ∗5 . (c) ∗2 = ∗5 . But network N1,1.1 is not minimal; it has one more anonymous node than the networks N1,1.2 and N1,1.3 . Hence, in future we discard such topologies and only consider the cases where the anonymous nodes introduced by Op2 are “old” (previously-seen) anonymous nodes.

(a) Network N1,1.1

(b) Network N1,1.2 Fig. 3. Topologies for T1,1

(c) Network N1,1.3

258

H.B. Acharya and M.G. Gouda

We are now in a position to prove the following theorem: Theorem 1. For every pair of natural numbers (k, l), there exists a trace set Tk,l that is generable from (k + 1)l minimal networks, and the number of nodes in every such network is 2k + l + 3. Proof. Consider the following construction. Starting with T0,0 , apply Op1 k times successively. This constructs the trace set Tk,0 , which has k + 1 distinct anonymous nodes. Finally, apply Op2 l times in succession to get Tk,l . Now, we show that Op2 indeed has the properties claimed. Note that every time Op2 is applied, it introduces an anonymous identiﬁer. This identiﬁer can correspond to a new node or to a previously-seen anonymous node. As we are considering only minimal networks, we know that this is a previously-seen anonymous node. There are k +1 distinct anonymous nodes, and the newly-introduced identiﬁer can correspond to any one of these. There is no information in the trace set to decide which one to choose. Furthermore, each of these nodes is distinct - it is connected to a diﬀerent (non-anonymous) b-node. In other words, each choice produces a distinct topology from which the constructed trace set is generable. Hence the number of minimal networks from which the trace set Tk,l is generable, is (k + 1)l . Further, there are 3 nodes to begin with. Every execution of Op1 adds two new nodes (b-node and the new ∗-node), and every execution of Op2 adds one new node (the c-node). As the total number of nodes in a minimal network is n, we also have n = 3 + 2k + l. We can see that n grows linearly with k and l. The number of candidate networks from which Tk,l is generable, grows as (k + 1)l . So, for example if we take k = n (n 3 −1) , which is obviously l = n−3 3 , the number of candidate networks is ( 3 ) exponential. In fact, this expression is so strongly exponential that it remains exponential even in the special case where we restrict the number of anonymous nodes to exactly two. Note that, if we execute Op1 exactly once and Op2 l times, then by the formula above the number of minimal networks is 2l = 2n−5 , which is O(2n ) - exponential. We have proved the following theorem: Theorem 2. For any n ≥ 6, there exists a trace set T such that: (a) n is the number of nodes in a minimal network from which T is generable. (b) Every such minimal network has exactly two anonymous nodes. (c) The number of such minimal networks is O(2n ). As an example, Figure 4 shows all 23 = 8 possible networks from which the trace set T1,3 is generable. We are now in a position to state our result about the minimal network tracing problem.

On the Hardness of Topology Inference

259

Theorem 3. Both the minimal network tracing problem and the weak minimal network tracing problem are unsolvable in general, but solvable for the case where the minimal network N , from which trace set T is generable, has exactly one anonymous node. Proof. Consider any algorithm that can take a trace set and return the correct network. If the algorithm is given as input one of the trace sets shown in Theorems 1 and 2, it must return an exponentially large number of networks in the worst case. (If it does not return all networks from which the trace set is generable, it may fail to return the topology of the actual network from which the trace set was generated.) In other words, no algorithm that always returns a “small” number of networks can be guaranteed to have computed the correct network from the trace set; the weak minimal network tracing problem is unsolvable in general. As the minimal network tracing problem is a stricter version of this problem, it is also unsolvable. The case where the minimal network has only one anonymous node is special. If there is only one anonymous node, there is no need to distinguish between anonymous nodes. We assign it some identiﬁer (say x) that is not the unique identiﬁer of any non-anonymous node, and replace all instances of “∗” by this identiﬁer. Now the problem reduces to ﬁnding a network from a trace set with no anonymous (or irregular) nodes, which is of course solvable [1]. As the minimal network tracing problem is solvable in this case, the weak minimal network tracing problem (which is easier) is solvable also.

4

Unsolvable, or NP-Complete?

In Section 3, we demonstrated the hardness of the minimal network tracing problem in the presence of anonymous nodes, and concluded that both the strict and the weak versions of the problem are unsolvable in general. It is natural to ask how we can claim a problem to be unsolvable, unless we reduce it to the halting problem or some other such uncomputable problem. Also, it seems on ﬁrst observation that our ﬁndings conﬂict with the earlier results of Yao et al., who had found the problem of minimal topology inference to be NP-complete; an NP-complete problem lies in the intersection of NP-hard and NP, so it lies in NP and is deﬁnitely not unsolvable! In this section, we will answer these questions and resolve this apparent conﬂict. The problem we study is whether it is possible to identify the true network from which a given trace set T was generated in practice - in other words, to ﬁnd a single network N such that T is generable from N and only from N . As there is not enough information in T to uniquely identify N (because T is generable from many minimal networks), the minimal network tracing problem is not solvable. In fact, even the weak minimal network tracing problem is not solvable, as T only provides enough information for us to identify that N is one member of an exponential-sized set (which is clearly not a small set). Thus, our statement that the problem is not solvable does not depend on proving uncomputability, but on

260

H.B. Acharya and M.G. Gouda

(a) Network N1,3.1

(b) Network N1,3.2

(c) Network N1,3.3

(d) Network N1,3.4

(e) Network N1,3.5

(f) Network N1,3.6

(g) Network N1,3.7

(h) Network N1,3.8

Fig. 4. Minimal Topologies for T1,3 (with two anonymous nodes)

the fact that no algorithm can identify the correct solution out of a large space of solutions, all of which are equally good. We now consider how our work relates to the proof of Yao [10]. The solution to our apparent conﬂict is that Yao et al. claim NP completeness for the decision problem TOP-INF-DEC, which asks “Does there exist a network, from which trace set T is generable, and which has at most k anonymous nodes?” This decision problem is equivalent to the problem of demonstrating any one network from which T is generable, with k or less anonymous nodes.

On the Hardness of Topology Inference

261

Yao et al. implicitly assume that the space of networks, from which a trace set T is generable, is a search space; identifying the smallest network in this space will yield the true network from which T was generated in practice. This is simply not true - the number of minimal networks from which T is generable is at least exponentially large, and as these are all minimal networks we cannot search for an optimum among them (they are all equally good solutions; in fact, they satisfy a stronger equivalence condition than having the same number of nodes - our construction produces networks with the same number of nodes and the same number of edges). Finding one minimal network N from which T is generable does not guarantee that N is actually the network from which T was generated! We say nothing about the diﬃculty of ﬁnding a random minimal network from which a trace set is generable (without regard to whether it is actually the network that generated the trace set). Hence, there is no conﬂict between our results and the results in [10].

5

Conclusion

In our previous work, we derived a theory of network tracing under the assumption that nodes were not consistently anonymous. As we later learned that this assumption is impossible to satisfy, we updated our theory to include networks with strictly anonymous nodes, which we present in this paper. As the introduction of irregularity - a limited form of anonymity - caused the problem to become hard in our previous study, we had expected that it would be even harder when we introduced strict anonymity. To our great surprise, we found a counterexample. Networks with a single anonymous node are completely speciﬁed by their trace sets (Theorem 3), while networks with a single irregular node are not (Figure 1 of [1]). We feel that this example is very interesting, as it disproves the intuition that introducing anonymous nodes should cause more trouble to a network tracing algorithm than introducing irregular (partly anonymous) nodes. In the general case, however, we prove in this paper that both the strict version and the weak versions of the minimal network tracing problem are unsolvable: no algorithm can do better than reporting that the required network is a member of an exponentially large set of networks. This result holds even when the number of anonymous nodes is restricted to two. The question of identifying the particular classes of networks, with the property that any such network can be uniquely identiﬁed from any trace set generable from it (even if the network contains anonymous nodes), is an open problem we will attack in future research.

References 1. Acharya, H.B., Gouda, M.G.: A theory of network tracing. In: 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems (November 2009)

262

H.B. Acharya and M.G. Gouda

2. Acharya, H.B., Gouda, M.G.: The weak network tracing problem. In: International Conference on Distributed Computing and Networking (January 2010) 3. Cheswick, B., Burch, H., Branigan, S.: Mapping and visualizing the internet. In: Proceedings of the USENIX Annual Technical Conference, pp. 1–12. USENIX Association, Berkeley (2000) 4. Gunes, M., Sarac, K.: Resolving anonymous routers in internet topology measurement studies. In: INFOCOM 2008: The 27th Conference on Computer Communications, pp. 1076–1084. IEEE, Los Alamitos (April 2008) 5. Gunes, M.H., Sarac, K.: Inferring subnets in router-level topology collection studies. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, pp. 203–208. ACM, New York (2007) 6. Jin, X., Yiu, W.-P.K., Chan, S.-H.G., Wang, Y.: Network topology inference based on end-to-end measurements. IEEE Journal on Selected Areas in Communications 24(12), 2182–2195 (2006) 7. Paul, S., Sabnani, K.K., Lin, J.C., Bhattacharyya, S.: Reliable multicast transport protocol, rmtp (1996) 8. Viger, F., Augustin, B., Cuvellier, X., Magnien, C., Latapy, M., Friedman, T., Teixeira, R.: Detection, understanding, and prevention of traceroute measurement artifacts. Computer Networks 52(5), 998–1018 (2008) 9. Xie, H., Yang, Y.R., Krishnamurthy, A., Liu, Y.G., Silberschatz, A.: P4p: provider portal for applications. SIGCOMM Computer Communications Review 38(4), 351– 362 (2008) 10. Yao, B., Viswanathan, R., Chang, F., Waddington, D.: Topology inference in the presence of anonymous routers. In: Twenty-Second Annual Joint Conference of the IEEE Computer and Communications Societies, March - April 3, vol. 1, pp. 353–363. IEEE, Los Alamitos (2003)

An Algorithm for Traﬃc Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy Sukanta Bhattacharya1, Tanmay De1 , and Ajit Pal2 1 2

Department of Computer Science and Engineering, NIT Durgapur, India Department of Computer Science and Engineering, IIT Kharagpur, India

Abstract. In wavelength-division multiplexing (WDM) optical networks, the bandwidth request of a traﬃc stream is generally much lower than the capacity of a lightpath. Therefore, to utilize the network resources (such as bandwidth and transceivers) eﬀectively, several low-speed traﬃc streams can be eﬃciently groomed or multiplexed into high-speed lightpaths, thus we can improve the network throughput and reduce the network cost. The traﬃc grooming problem of a static demand is considered as an optimization problem. In this work, we have proposed a traﬃc grooming algorithm to maximize the network throughput and reduce the number of transceivers used for wavelength-routed mesh networks and also proposed a dynamic path selection strategy for routing requests which selects the path such that the load on the network gets distributed throughout. The eﬃciency of our approach has been established through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent bandwidth granularities for diﬀerent network topologies and compared the approach with existing algorithm. Keywords: Lightpath, WDM, Transceiver, Grooming.

1

Introduction

Wavelength division multiplexing (WDM) technology is now being widely used for expanding the capacity of optical networks. It has provided vast bandwidth to the optical ﬁber by allowing simultaneous transmission of traﬃc on many nonoverlapping channels (wavelengths). In a wavelength routed optical network, a lightpath may be established to carry traﬃc from a source node to a destination node. A lightpath is established by selecting a path of physical links between the source and destination nodes, and taking a particular wavelength on each of these links for the path. A lightpath must use the same wavelength on all of its links, if there is no wavelength converter at intermediate nodes, and this restriction is known as wavelength continuity constraint [3], [5]. An essential functionality of WDM networks, referred to as traﬃc grooming, is to aggregate low speed traﬃc connections onto high speed wavelength channels in a resource-eﬃcient way, that is, to maximize the network throughput when the resources are given or to minimize the resource consumption when the M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 263–268, 2011. c Springer-Verlag Berlin Heidelberg 2011

264

S. Bhattacharya, T. De, and A. Pal

traﬃc requests to be satisﬁed are given. Eﬃcient traﬃc grooming techniques (algorithms) can reduce network cost by reducing the number of transceivers and increase the throughput and also save time by accommodating more number of low-speed traﬃc streams on a single lightpath. The work in [4] and [6] investigates the static Traﬃc Grooming problem with the objective of maximizing network throughput. Zhu and Mukherjee [6] investigate the traﬃc grooming problem in WDM mesh networks with the objective of improving network throughput. They present an integer linear programming (ILP) formulation of the traﬃc grooming problem and propose two heuristics namely, Maximizing Single-Hop Traﬃc (MST) and Maximizing Resource Utilization (MRU), to solve the GRWA problem. Subsequently, A global approach for designing reliable WDM networks and grooming the traﬃc is presented in [1] and Traﬃc grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning [2] motivated us to use the concept of reducing the total network cost to solve the GRWA problem, which is presented in the following sections. The grooming problem consists of two interconnected parts: (a) designing light paths, which includes specifying the physical route of each path; (b) assigning each packet stream to a sequence of light paths. The work proposed in this paper is based on the static GRWA problem in WDM mesh networks with a limited number of wavelengths and transceivers and the proposed approach allows singlehop and multi-hop grooming that is similar to [6]. The objective of this work is to maximize the network throughput in terms of total successfully routed traﬃc and reduce the number of transceivers used. The performance of our proposed approach is evaluated through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent granularity for diﬀerent network topologies. The results show that the proposed approach gives better performance with respect to the existing traﬃc grooming algorithm, Maximizing Single-Hop Traﬃc (MST). The problem formulation is presented in Section II. Section III gives the detailed description of proposed algorithm. Section IV contains experimental results and comparison with previous works. Finally, conclusion in Section V.

2

General Problem Statement

Given a network topology which is a directed connected graph G(V, E), where V and E are the sets of optical nodes and bi-directional links (edges) of the network, respectively, a number of transceivers at each node, a number of wavelengths on each ﬁber, the capacity of each wavelength and a set of connection requests with diﬀerent bandwidth granularities, our objective is to set up lightpaths and multiplex low-speed connection requests on the same lightpath such that the network throughput is maximized in terms of total successfully routed lowspeed traﬃc and minimize in terms of number of transceivers used for satisfying the request. Since the Traﬃc Grooming problem is NP-complete. So, eﬃcient heuristic approach gives a better solution. In the next section we have proposed a heuristic approach to solve the Traﬃc Grooming problem.

An Algorithm for Traﬃc Grooming in WDM Mesh Networks

3

265

Proposed Approach

In this section, we propose a Traﬃc Grooming 2 (TG2) algorithm based on dynamic path selection strategy for the GRWA problems. Our proposed approach has two steps similar to [6]. In the ﬁrst step, we construct a virtual topology trying to satisfy given requests (in the decreasing order of request) in single-hop using single lightpath. In the second step, we try to satisfy the leftover blocked requests through multi-hop (in the decreasing order of request). We run the multi-hop on the spare capacity of the virtual topology created in the ﬁrst step. The leftover requests are sorted and we try to satisfy them one by one with single hop. As soon one request is satisﬁed by single hop we try to satisfy all leftover requests by multi hop so that with the new lightpath created in single hop some requests may get satisﬁed by multi hop, thus we can reduce the number of transceivers used. The process is repeated on the leftover requests until all resources gets exhausted or all requests are satisﬁed or no leftover request can be satisﬁed by single hop. 3.1

Alternate Path Selection Strategy

In this work we have used a variant of adaptive routing. At each time when a request between a SD pair is satisﬁed we calculate all possible paths between source (S) and destination (D) and calculate cost of each path using a cost function: 1 C= (α) + L(β) (1) W Where, α and β are constants and C, W , L are cost of the path, Total common free wavelength in the physical path and Length of the path (i.e distance between SD pair) respectively. First parameter is more dominant than the second in determining the cost. This is taken into consideration so that traﬃc load on the network gets distributed and no single path gets more congested. 3.2

Traﬃc Grooming Algorithm

The proposed algorithm selects the minimum cost path dynamically such that the traﬃc load on the network gets distributed and no particular path gets congested. Algorithm TG2 1. Sort the requests in descending order of OC-x 2. Select the sorted requests one by one and start making the virtual topology and satisfy single-hop traﬃc. (a) Find all possible paths from source (S) to destination (D) for a request R. (b) Calculate cost of each path and thus ﬁnd the minimum cost path for the SD pair using equation 1.

266

3. 4. 5.

6. 7. 8. 9. 10. 11.

S. Bhattacharya, T. De, and A. Pal

(c) Find the lowest index wavelength (w) among the common free wavelengths in the edges present in the minimum cost path. (d) Update the virtual topology assigning a lightpath from node (S) to node (D) using wavelength ”w”. (e) Update the request matrix. if Request is fully satisﬁed then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisﬁed) request. end if (f) Update the wavelength status of the corresponding physical edge present in the lightpath. (g) Update the transceiver used status at the nodes. if Node is starting node then Reduce count of Transmitter by 1. end if if Node is ending node then Reduce count of Receiver by 1. end if Repeat Step-2 for all SD pair requests. Sort the blocked requests in descending order. Select the sorted requests one by one and try to satisfy them using multiple lightpaths on the Virtual topology (VT), created in step 2. (a) Update the request matrix. if Request is fully satisﬁed then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisﬁed) request. end if (b) Update the wavelength status of the corresponding physical edges present in the lightpaths. (c) Update the virtual topology. Sort the blocked requests again in descending order. Try to satisfy the requests one by one with single hop in descending order until one of the request is satisﬁed. Update the system as described in step 2(d) to 2(g). When one request is satisﬁed with single hop try to satisfy all remaining requests with multi hop. Update the system as described in step 5(a) to 5(c). Repeat Steps 6 to 10 until all resources gets exhausted or all requests are satisﬁed or no leftover request can be satisﬁed by single hop.

End Algorithm TG2

An Algorithm for Traﬃc Grooming in WDM Mesh Networks

4

267

Experimental Results

We have evaluated the performance of our proposed heuristics TG2 for the GRWA problem using simulation and compared the results with the well-known MST algorithm [6]. We conducted our experiments on diﬀerent network topologies but due to page limitations we have presented results only for 14-node NSFNET shown in Fig.1. The value of α and β in equation 1 is taken to be 10 and 0.3 respectively. We assume that each physical link is bidirectional with the same length. During simulation we have assumed that the capacity of each wavelength is OC-48 and allowed traﬃc bandwidth requests were assumed to be OC-1, OC-3, and OC-12 and are generated randomly. 3000 TxRx = 6, W = 7

11

12 2500

13

10 6

8

9

1 7

Throughput (OC−1 unit)

3

1500

1000

500

14 2

2000

MST TG2

4 0

0

1000

5

2000

3000

4000

5000

5500

Number of Requests (OC−1 unit)

Fig. 1. Node architectures used in simulation

Fig. 2. Relationship between throughput and requested bandwidth

3000

2500 Req = OC−3000, W = 7

Req = OC−3000, TxRx = 7 2500

Throughput (OC−1 unit)

Throughput (OC−1 units)

2000

2000

1500

1000

1500

1000

500 500 MST TG2 0 0

2

4

6

8

10

12

MST TG2 14

Number of Wavelengths per Link

Fig. 3. Relationship between throughput and number of wavelengths per ﬁber link

0

0

1

2

3

4

5

6

7

8

9

Number of Transceivers per Node

Fig. 4. Relationship between throughput and number of transceivers per node

Figure 2 shows the relationship between the network throughput and total requested bandwidth for a 14-node network (Fig.1). Initially, performance (in terms of throughput) of both the algorithms are similar but subsequently TG2 returns a better throughput than MST.

268

S. Bhattacharya, T. De, and A. Pal

The relationship between the network throughput and the number of wavelengths per link for the two algorithms is shown in Fig. 3. We observe that the proposed algorithm TG2 provides a higher network throughput than the existing MST algorithm. The throughput increases with the increase of wavelength and due to transceiver constraint there is no signiﬁcant change in throughput after the number of wavelengths reaches a certain limit for both algorithms. The relationship between the network throughput and the number of transceivers per node for the proposed and existing algorithms is shown in Fig. 4. We observe that initially the throughput increases with the increase in the number of transceivers and there is no signiﬁcant change in the throughput as the number of transceivers is increased beyond certain value, due to capacity of wavelengths is exhausted. However, the proposed TG2 algorithm performs better in terms of network throughput compared to the existing MST algorithm.

5

Conclusions

This study was aimed at traﬃc-grooming problem in a WDM mesh network.We have studied the problem of static single-hop and multi-hop GRWA with the objective of maximizing the network throughput for wavelength routed mesh networks. We have proposed a algorithm TG2, using the concept of single hop and multi hop grooming in static GRWA problem [6]. The performance of our proposed algorithm is evaluated through extensive simulation on diﬀerent sets of traﬃc demands with diﬀerent bandwidth granularities under diﬀerent network topologies.

References 1. Bahri, A., Chamberland, S.: A global approach for designing reliable WDM networks and grooming the traﬃc. Computers & Operations Research 35(12), 3822– 3833 (2008) 2. De, T., Pal, A., Sengupta., I.: A Traﬃc grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning. Photonic Network Communications (February 2010) 3. Mohan, G., Murthy, C.S.: WDM optical networks: concepts, design and algorithms. Prentice Hall, India (2001) 4. Yoon, Y., Lee, T., Chung, M., Choo, H.: Traﬃc grooming based on shortest path in optical WDM mesh networks. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 1120–1124. Springer, Heidelberg (2005) 5. Zang, H., Jue, J., Mukherjee, B.: A review of routing and wavelength assignment approaches for wavelength-routed optical WDM networks. SPIE Opt. Netw. Mag. 1(1), 47–60 (2000) 6. Zhu, K., Mukherjee, B.: Traﬃc grooming in an optical WDM mesh network. IEEE J. Sel. Areas Commun. 20(1), 122–133 (2002)

Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks Bala Kalyanasundaram and Mahe Velauthapillai Department of Computer Science, Georgetown University, Washington DC, USA [email protected], [email protected]

Abstract. Co-operative computations in a network of sensor nodes rely on an established, interference free and repetitive communication between adjacent sensors. This paper analyzes a simple randomized and distributed protocol to establish a periodic communication schedule S where each sensor broadcasts once to communicate to all of its neighbors during each period of S. The result obtained holds for any bounded degree network. The existence of such randomized protocols is not new. Our protocol reduces the number of random bits and the number of transmissions by individual sensors from Θ(log 2 n) to O(log n) where n is the number of sensor nodes. These reductions conserve power which is a critical resource. Both protocols assume upper bound on the number of nodes n and the maximum number of neighbors B. For a small multiplicative (i.e., a factor ω(1)) increase in the resources, our algorithm can operate without an upper bound on B.

1

Introduction

A wireless sensor network (WSN) is a network of devices called sensor nodes that communicate wirelessly. The WSN is used in many applications including environment monitoring, traﬃc management, wild-life monitoring, etc. [1,2,4,7,8,9,5]. Depending on the application, a WSN can consist of a few nodes to millions of nodes. The goal of the network is to monitor the environment continuously to detect and/or react to certain predeﬁned events or patterns. When an application requires millions of nodes, individually programming each node is impractical. When deployed, it is often diﬃcult to control the exact location of each sensor. Even if we succeed in spreading the sensors evenly, it is inevitable that some nodes will fail and the resulting topology is no longer uniform. It may be reasonable to assume that the nodes may know an upper bound on the number of nodes in the network but nothing else about the network. This paper analyzes the performance of a randomized and distributed protocol that establishes communication among neighbors of a bounded degree network of sensors. We assume that B is a constant.

Supported in part by Craves Family Professorship. Supported in part by McBride Chair.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 269–280, 2011. c Springer-Verlag Berlin Heidelberg 2011

270

B. Kalyanasundaram and M. Velauthapillai

The following wireless transmission model is considered in this paper. A node cannot transmit and receive information simultaneously. Each node has a transmission range r, and any node within that range can receive information from this node. Also, each node A has an interference range r+ . It means that a transmission from node C to any node B within this range r+ of A can interfere with the transmission from node A. In general r + ≥ r. For the ease of presentation, we assume r+ = r. However, the proofs can be easily extended to r+ ≥ r as long as the number of nodes in the interference range is big-oh of number of nodes in the transmission range. In the literature of wireless/sensor network, there are many diﬀerent media access control (MAC) protocols. In general, the protocols fall into the following three categories [6]: ﬁxed assignment, demand assignment, and random access. The protocol which we will present is a ﬁxed assignment media access protocol. However, the protocol that we use to derive a ﬁxed assignment media access protocol is a random access protocol. In the random access protocol, the time is divided into slots of equal length. Intuitively, the sensors, using randomization, will ﬁrst establish a schedule for pair-wise communication with their neighbors. The sensors run the second phase of the protocol where the schedule is compressed such that each sensor broadcasts once to communicate to all of its neighbors. After the compression, the resultant protocol is a ﬁxed assignment protocol that can be used by the sensor network to communicate and detect patterns. We consider uniform transmission range for the sensors. One can view the network of sensors as a graph where each node is a sensor and there is an edge between two nodes if the nodes are within the transmission range. The resultant graph is often called a disk graph DG, and unit-disk graph UDG when the transmission range is same for all sensors. The problem addressed in this paper can be thought of as the problem of ﬁnding a interference-free communication schedule for a given UDG where the graph is unknown to the individual nodes. Gandhi and Parthasarathy [3] considered this problem and proposed a natural distance-2 coloring based randomized and distributed algorithm to establish an interference-free transmission schedule. Each node in the network ran Θ(log2 n) rounds of transmissions and used Θ(log2 n) random bits to establish the schedule. Comparing these two protocols, it is interesting to note that our protocol exhibits better performance. The major diﬀerence between the two approaches is in the way we split the protocol in two phases where pair-wise communication is established in the ﬁrst phase and compression takes place in the second. Our protocol reduces the number of transmissions as well as the number of random bits to O(log n). Moreover, the number of bits transmitted by our protocol is O(log n) per node, whereas the protocol by Gandhi and Parthasarathy uses O(log2 n) bits per node. These reductions conserve power, a critical resource in sensor network. It is worthy to note that the running time of both protocols is O(log2 n) where each transmission is considered to be O(1) step. Let b be the maximum number of neighbors of any node. CDSColor protocol explicitly uses an upper bound on b and the number of nodes in the graph. After

Analysis of a Simple Randomized Protocol

271

Table 1. Comparing Algorithms Case CDSColor(see [3]) Our Alg. Random Bits O(log2 n) O(log n) # of Transmissions O(log2 n) O(log n) Bits Transmitted per Node O(log2 n) O(log n) Number of Steps O(log2 n) O(log2 n)

the ﬁrst phase of our protocol, each node will know the exact number of its neighbors with high probability. In order to increase the conﬁdence/probability of total communication, the length of the ﬁrst phase will be set to O(log n) where the constant in big-oh depends on b. If we do not have a clear upper bound on b, the length of the ﬁrst phase can be ω(log n) (e.g., O(log n log log n)). By increasing the number of transmissions in the ﬁrst phase, our protocol will establish communication with high probability (i.e., (1 − 1/nc ) for any given constant c). Our analysis for the ﬁrst phase of the algorithm uses a recurrence relation to derive accurate/exact probability in establishing a pair-wise communication. One can write a simple program to calculate this probability accurately. Using the probability on pair-wise communication, we can ﬁnd the expected number of transmissions needed to establish communication. The bound obtained using the recurrence relation closely matches the value observed in simulation. For instance, the number of transmissions needed to establish communication between every pair of neighbors in an entire network with million nodes is around 400. For incredibly large network, n = 1050 , the bound on the number of transmissions is less than 3000. From this observation, we can safely say that our protocol does not need to know n for any real-life sensor network. Definition 1. Given a sensor network G = (V, E), we define H1 (v) to be set of nodes in V that are either immediate neighbor or neighbor’s neighbor (i.e., 1-hop away) of node v. For ease of presentation, we refer {v}∪H1 (v) to be 1-hop neighborhood of v.

2

First Phase - Establishing Pair-Wise Communication

Each sensor node v selects an id which should be unique among the id’s of nodes in {v} ∪ H1 (v). This can be accomplished with high probability by selecting c log n random bits as id where c ≥ 1 is a carefully chosen constant. This is captured in the following lemma which is trivial to establish. Lemma 1. Suppose there are n nodes in the network and for each node v we have |H1 (v)| ≤ c, a fixed constant. Each node chooses c1 log n random bits as its id where c1 ≥ 1. The probability that every node in the network will choose a unique id in the 1-hop neighborhood is at least 1 − nc1c−1 .

272

B. Kalyanasundaram and M. Velauthapillai

After establishing an id, each node executes the following simple protocol for c2 log n steps where c2 ≥ 1 is a constant. The choice of c2 depends on the conﬁdence parameter in the high probability argument. This will become clear when we present the analysis of the protocol. TorL(p): (one step) Toss a biased coin where probability for head is p and tail is (1-p). Transmit the node’s id if the outcome is head and listen if the outcome is tail. We could present the analysis of this protocol for the arbitrary bounded degree network now. But we choose to consider both line topology and grid topology before presenting the arbitrary case. There are two reasons for this choice. The analysis is a bit more exact for the simpler case. We ran simulations for grid topology to see the eﬀectiveness of the protocol for reasonably large-sized network. Our simulation and calculation showed that c2 log n is around only 3000 for network of size n = 1050 . 2.1

Line Topology

The analysis of the ﬁrst phase of the protocol contains both high probability argument as well as expected case argument. Bounds we get from both arguments are not diﬀerent in the asymptotic sense. However, our simulation result for grid network shows that the expectation value given by the recurrence relation matches very closely to the value we observed in the simulation. Hence, we choose to present both arguments. Definition 2. After running a protocol to establish communication, we say a sensor node is unsuccessful if it failed to receive information (that is id of the neighbor) from one of its neighbor nodes. 1 Theorem 1. Let p be be positive real in the range (0, 12 ] and let b = 1−p(1−p) 2. Suppose n sensor nodes are uniformly distributed on a line. Assume omnidirectional transmission where the signal from each node reaches both neighbors on the line. 1. After c2 log n steps, probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant. 2. The number of steps of TorL(p) needed so that the expected number of unsuccessful sensor nodes is less than 1 is (1 + logb (2n)).

Proof. Consider a sensor node with a neighbor on both sides. For i = 0, 1 and 2, let R(i, k) be the probability that the sensor node successfully received information from exactly i neighbors on or before k rounds. According to protocol TorL(p), p is the probability that a sensor node transmits at each time step. So, in order to receive information at time t, a sensor node must not transmit at t and one neighbor must be transmitting while the other is not. Let α = 2p(1 − p)(1 − p) be the probability of a sensor node successfully receiving information from one of its neighbor. Since coin tosses are independent, we can express R(i, k) in the form of a recurrence relation shown below:

Analysis of a Simple Randomized Protocol

273

R(0, k) = (1 − α)k R(1, k) = R(0, k − 1)α + R(1, k − 1)(1 − α2 ) = (1 − α)k−1 α + R(1, k − 1)(1 − α2 ) R(1, 0) = 0 R(2, k) = R(1, k − 1) α2 + R(2, k − 1) R(2, 1) = 0. We now show some steps to solve the recurrence relation: R(1, k) = (1 − α)k−1 α + (1 − α2 )(1 − α)k−2 α + . . . + (1 − α2 )k−1 α = α(1 − α)k−1

k−1 1− α2 i i=0

1−α

= 2 [(1 − α2 )k − (1 − α)k ]. By expanding recursively, the function R(2, k) and substituting for R(1, j) results in: k−1 R(2, k) = α2 j=1 R(1, j) =

α 2

=α

k−1 j=1

2 [(1 − α2 )j − (1 − α)j ]

k−1

j=1 (1

− α2 )j −

k−1

j=1 (1

− α)j ]

= 2[1 − (1 − α2 )k ] − [1 − (1 − α)k ] = 1 − [2(1 − α2 )k − (1 − α)k ]. The probability that a node is not successful after k steps is at most 1−R(2, k) = 2(1 − α2 )k − (1 − α)k ≤ 2(1 − α2 )k ). We now ﬁnd a bound on k such that 1 2 2(1 − α2 )k ≤ nd+1 for the given d. Simplifying, we get 2nd+1 ≤ ( 2−α )k . Simple algebra show that it holds for k ≥ c2 such that c2 log n ≥

(d+1) 2 log( 2−α )

(d+1) 2 log( 2−α )

log2 (2n). Choose a smallest constant

log2 (2n). So, if each node runs the protocol for

c2 log n steps, the probability that it will be unsuccessful is at most 1/nd+1 . There are n nodes in the network. Therefore, the probability that there exists 1 an unsuccessful node is at most n × nd+1 = n1d . We deﬁne a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. Suppose there are n sensor nodes in the network. Using R(2, k), we get that the expected value of β(i, k) is E[β(i, k)] = R(2, k) for each 1 ≤ i ≤ n. Observe

274

B. Kalyanasundaram and M. Velauthapillai

that a sensor node receives information from a neighbor not only depends on the random bit of the sensor node but also the random bits of its neighbor nodes. As a result, the random variables β(i, k) are not independent random variables. But, applying linearity of expectation we get E[

n

β(i, k)] =

i=1

n

E[β(i, k)] = n R(2, k).

i=1

Therefore, the number of steps k needed to reach E[ ni=1 β(i, k)] > n − 1 is given by the inequality n R(2, k) > n − 1. By substituting the bound on R(2, k) it suﬃces to satisfy the inequality 1 − [2(1 − α2 )k − (1 − α)k ] > n−1 n . Therefore it suﬃces to show that 2(1 − α2 )k − (1 − α)k < 1/n. For α ≥ 0, we have (1 − α α k 2 ) > (1 − α). Hence, it suﬃces to satisfy the equation (1 − 2 ) < 1/(2n) or k 2n < [2/(2 − α)] . This happens for k = 1 + logb (2n) where b = (2/(2 − α)). Observe that if α increases then k decreases. 2.2

Technical Lemmas

The following results and recurrence relation will help us establish the expected time needed to establish a pair-wise communication between all adjacent sensors for a general bounded degree network. Lemma 2. Let 0 < p < 1 , integer b ≥ 1 and α = bp(1 − p)b . Maximum value b (b+1) 1 of α is ( b+1 ) ≤ 12 and it occurs when p = b+1 . Proof. Diﬀerentiating the function bp(1 − p)b with respect to p and setting it equal to 0 we get zeros at p = 0, 1 and 1/(b + 1). Diﬀerentiating again, it is not hard to verify that maximum occurs when p = 1/(b + 1) and the maximum value is (b/(b + 1))b+1 . Lemma 3. Let 0 < p < 1, integer 2 ≤ b, α = bp(1 − p)b < 1 and integers i, k ≥ 0. Given the following recurrence relation and boundary conditions: R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((b+1)−i) αR(i − 1, k − 1) b + (1 − (b−i) α)R(i, k − 1) b the following hypothesis I(i, k) is true: αi R(i, k) ≤ ki (b−1)! (1 − (b−i) α)k−i (b−i)! bi−1 b

(k ≥ 0) (i > k) (0 ≤ i ≤ b − 1) ∧ (k ≥ i)

(0 ≤ i ≤ (b − 1)) ∧ (k ≥ i).

Proof. We will prove this by double induction (i.e., on i and k) where the desired inequality is the inductive hypothesis.

Analysis of a Simple Randomized Protocol

Base case where i = 0 is easy to verify. Observe that

k i

(b−i) α)k−i b

275

(b−1)! αi (1 (b−i)! bi−1 k

−

= (1 − α)k . The base case condition R(0, k) ≤ (1 − α) is true since R(0, k) is deﬁned to be equal to (1 − α)k . Assume that the hypothesis holds for all i such that i ≤ x ≤ (b − 1) and for all k ≥ i. We will show that the hypothesis holds for i = x + 1 ≤ (b − 1) and for all k ≥ i. This is again proved by induction on k ≥ i = x + 1. The base case of this induction is when k = i = x + 1. x+1 (b−1)! (b−(x+1)) αx R(x + 1, x + 1) ≤ α)(x+1)−(x+1) x + 1 (b−(x+1))! bx (1 − b =

(b−1)! αx (b−(x+1))! bx

Now from the recurrence relation: R(x + 1, x + 1) = ((b+1)−(x+1)) αR(x, x) + (1 − (b−(x+1)) α)R(x + 1, x) b b (b−(x+1)) R(x + 1, x + 1) = b−x αR(x, x) + (1 − α)R(x + 1, x) b b x−1

(b−1)! α Substituting for R(x, x) ≤ ( (b−(x))! ) and R(x + 1, x) = 0, we get bx−1

R(x + 1, x + 1)

≤

b−x (b−1)! αx−1 b α (b−x)! bx−1

=

(b−1)! αx (b−(x+1))! bx .

Hence the base case is true. For the inductive step, we assume that the hypothesis I(i, k) holds for ( i < x < b and i ≤ k) or (i = x + 1 ≤ b and i ≤ k ≤ y). We will prove that the hypothesis holds for i = x + 1 and k = y + 1, that is I(x + 1, y + 1) is also true. Hypothesis I(x + 1, y) states that y (x+1)! α(x+1) (b−(x+1)) R(x + 1, y) ≤ x+ α)y−(x+1) 1 (b−(x+1))! bx (1 − b Now consider the recurrence relation with i = x + 1 and k = y + 1: R(x + 1, y + 1) = R(x + 1, y + 1) =

((b+1)−(x+1)) (b−(x+1)) αR(x, y) + (1 − α)R(x b b (b−(x+1)) b−x αR(x, y) + (1 − α)R(x + 1, y) b b

+ 1, y)

Substituting for R(x + 1, y) from hypothesis I(x + 1, y) we have R(x + 1, y + 1)

y (b−1)! b−(x+1) αx+1 + (1 − b−(x+1) α) α)y−(x+1) x + 1 (b−(x+1))! bx (1 − b b y (b−1)! (b−(x+1)) α(x+1) = b−x α)(y+1)−(x+1) b αR(x, y) + x + 1 (b−(x+1))! bx (1 − b ≤

b−x αR(x, y) b

Substituting for R(x, y) from hypothesis I(x, y) we have R(x + 1, y + 1) ≤ =

y x

y (b−1)! αx (b−1)! αx+1 (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x (b−x)! bx−1 b bx b (x+1) x+1 y (b−1)! (b−1)! α α (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x . (b−(x+1))! bx b bx b

b−x α b

y x

276

B. Kalyanasundaram and M. Velauthapillai (b−x) α) ≤ (1− (b−(x+1)) α), using this in the above expression: b b (x+1) y y (b−1)! 1) ≤ [ x + x + 1 ][ (b−(x+1))! α bx (1 − (b−(x+1)) α)y−x ] b (x+1) 1 (b−1)! (b−(x+1)) α = xy + α)y−x . + 1 (b−(x+1))! bx (1 − b

Note that (1 − R(x + 1, y +

Hence the result. Lemma 4. Let 0 < p < 1, integer b ≥ 2, and α = bp(1−p)b . Given the recursive b−1 definition of R(i, k) for 0 ≤ i < b and k ≥ 0, let R(b, k) = 1 − i=0 R(i, k). There exist constants c > 0 and 1 − αb < < 1 such that for integer k ≥ c, we have R(b, k) ≥ 1 − k That is, lim R(b, k) = 1, and the convergence rate is k→∞

exponential in k. Constants and c depends on constant b. Proof. From Lemma 2, we have α ≤ 12 . Applying Lemma 3 we get k (b−1)! αi (b−i) k−i R(i, k) ≤ (0 ≤ i ≤ (b − 1)) ∧ (k ≥ i) i (b−i)! bi−1 (1 − b α) ≤ αi k i (1 − αb )k−i ≤ k (b−1) (1 − αb )k = (1 −

since α ≤

(b−1) log k

1 2

and i ≤ (b − 1)

α k− log b−log(b−α) . b)

Recall that R(0, k) = (1 − α)k . Hence, lim R(i, k) = 0 for (0 ≤ i ≤ b − 1). k→∞

Substituting the bound for R(i, k), we get R(b, k) ≥ 1 − b(1 − αb )k−O(log k) which converges to 1 exponentially. Observe that there exists constants 1 − αb < < 1 and c > 0 such that for k ≥ c, we have b(1 − αb )k−O(log k) ≤ k . As a consequence we get R(b, k) ≥ 1 − k . 2.3

Arbitrary Bounded Degree Topology

Looking carefully at the protocol and proof techniques in the previous section, it will become clear that they can be extended to any arbitrary bounded degree sensor network. Theorem 2. Let G be an arbitrary sensor network with n nodes and each node has at most b neighbors where b is a constant. Let p be a positive real in the range (0, 12 ]. Each node repeatedly runs TorL(p). 1. After c3 log n steps, the probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant and c3 is a positive constant that depends on d. 2. The number of steps needed so that the expected number of unsuccessful sensor is less than 1 is c4 log n where c4 is another constant. 3. The number of bits transmitted by a node is O(log2 n) and the number of random bits used by a node is O(log n).

Analysis of a Simple Randomized Protocol

277

Proof. Suppose a node x has a neighbors where 1 ≤ a ≤ b. Let R(i, k) for 0 ≤ i ≤ a be the probability that the sensor node has received information from exactly i neighbors at the end of k rounds. This probability obeys the following recurrence relation. R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((a+1)−i) αR(i − 1, k − 1) + (1 − a

(k ≥ 0) (i > k) (a−i) α)R(i, k a

− 1) when (0 ≤ i ≤ a − 1) ∧ (k ≥ i)

Applying Lemma 4, we know R(a, k) ≥ 1 − k where is a positive constant less than 1. Therefore, the probability that the sensor is unsuccessful after k rounds is 1 1 1 at most k = 2−k log2 ( ) . Substituting k = c3 log n, we get 2k log2 ( ) = c3 log 1 . 2( ) n

Choose c3 such that c3 log2 ( 1 ) ≥ d + 1. For this choice of c3 , the probability that 1 the node is unsuccessful is at most nd+1 . Since there are n nodes, the probability that any node is unsuccessful is at most n1d . Assume that we number the nodes i = 1 through n. Let ai ≤ b the number of neighbors for node i In order to calculate the expected number of steps of T orL(p) to have less than one unsuccessful node, we deﬁne a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. The expected value of βp (i, k), denoted by E[βp (i, k)], is equal to R(a, k). Observe that R(a, k) ≥ R(b, k) for all a ≤ b. The expected number of nodes in the entire network n that receives communication from all of its neighbors after k rounds is E[ i=1 βp (i, k)]. Applying linearity of expectation, we get E[

n i=1

βp (i, k)] =

n i=1

E[βp (i, k)] =

n

R(ai , k)

i=1

Applying Theorem 4, the number of steps k needed to reach the bound n k i=1 E[βp (i, k)] > n − 1 is given by the inequality n(1 − ) > n − 1 . Simk k log2 ( 1 ) plifying we get , < 1/n or 2 > n. The result follows if we choose k = c4 log2 n where c4 ≥ log 1( 1 ) . 2 Finally, observe that each node uses random bits to select id and O(1) random bits per step of T orL(p). Since id is O(log n), and number of steps of T orL(p) is also O(log n), the total number of random bits is O(log n). It is easy to see that the number of transmissions per node is O(log n). 2.4

Simulation and Practical Bounds for Grid Network

We ran simulations to estimate the number of steps needed for a large sensor network in practice. For a network with one million sensors, we needed approximately 373 time slots to establish communication between every pair of adjacent

278

B. Kalyanasundaram and M. Velauthapillai

nodes. This is an average over 20 random runs. For three million nodes, the number of rounds is approximately 395 time slots. Based on our recurrence relation, our calculations for network of size 1012 show that communication will be established in 650 time slots or steps. The table below gives the average number of rounds needed to establish communication for diﬀerent sized network. Here the probability of transmission is set to p = 1/(B + 1) = 1/(8 + 1) = 1/9. Here B = 8 is the number of neighbors for any node. Table 2. Simulation On Grid Network Network Size 360,000 640,000 1,000,000 1,440,000 1,960,000 2,560,000 3,240,000 Avg. # Steps 342 359 373 385 394 392 395

Table 3. Probability Bounds for Grid - Based on Recurrence Relation Steps 501 1001 1501 2001 2501 3001 Prob of Failure of a Node 1.86e-09 4.541e-19 1.10e-28 2.69e-38 6.57e-48 1.601e-57

However, when we set the probability of transmission p = 12 , the number of rounds needed to establish communication exceeds 3000 for a small network of size one hundred. So it is critical to set p close to 1/(B + 1).

3

Second Phase: Compression Protocol

Given any schedule that establishes pair-wise communication between neighbors (e.g., the schedule of length O(log n) from last section), we will show how to compress the schedule to a small constant length where each node broadcasts once to communicate to its neighbors. After the ﬁrst run of the protocol T orL(p) for c3 log n steps, each node has already communicated its id to the neighbors with high probability. However, each node does not know when it succeeds in its communication attempt. Let T (x) (resp. L(x)) be the set of transmission (resp. listening) steps of the sensor x. Each sensor x runs another c3 log n steps to communicate to its neighbors. In this case, no random bits are used and each sensor x transmits only during steps in T (x). Each sensor x transmits the following two pieces of information when it transmits during this iteration: 1. List of ids of its neighbors. 2. For each neighbor y of x, (id of neighbor y, earliest time in L(x) that x listens to a transmission from y). At the end of this second iteration, each sensor knows its b neighbors and its neighbor’s neighbor. Each sensor also knows at most b transmission times that it must use from now on to communicate to its neighbors during c3 log n steps.

Analysis of a Simple Randomized Protocol

279

It is important to observe that no more random bits are used to run the transmission schedule after the ﬁrst round. During each round, each sensor must listen at most b times and transmit at most b times. This conserves power. However, the biggest drawback is that the communication between neighbors takes place once in c3 log n steps. Let us call such a long one-to-one communication schedule by the name Long1to1. After the compression, each node transmits once, and listens b times. Communication between neighbors take place once in every O(1) steps. Compressor: Protocol For a Node 1. Let b the number of neighbors and be the number of neighbor’s neighbor. 2. Let T = {1, 2, 3, . . . , (b + + 1)}. 3. Maintain a set of AV of available slots and initially AV = T . 4. Repeat the following until a slot is chosen for the Node: Choose a random number x in AV . Run one round of schedule Long1to1 to communicate (id, x) to the neighbors. Let N be the set of pairs (id, x) received from the b neighbors. Run one round of schedule Long1to1 to communicate (id, N ). to the neighbors. Let M be the set of all random numbers chosen by the neighbors or neighbor’s neighbor during this iteration. If x is not in M then x is set to be the chosen slot for the sensor. Run one round of schedule Long1to1 to communicate to the neighbors. Transmit (id, x) if x is the chosen slot and empty otherwise. Let C be the set of pairs (id, x) received from the neighbors. Run again one round of schedule Long1to1 to communicate (id, C) to the neighbors. Let P be the set of chosen slot numbers during this round by its neighbor or neighbor’s neighbor. Update AV = AV − P . End Compressor Protocol Theorem 3. Suppose there are n nodes in the network. For any d > 0, the probability that a node does not choose a slot after c5 loge n iterations of the loop at step 4 is at most 1/nd+1 where c5 = 2b+ (d + 1) loge n. With the probability at least 1 − n1d , every node in the network will successfully choose a number. Proof. Consider an arbitrary node z and one iteration of the loop at step 4. Without loss of generality, let 0 ≤ k ≤ b + be the number of neighbors or neighbor’s neighbor of z without a chosen slot for communication. Observe that if there is only one choice then the node will choose the only remaining slot. Otherwise, each node will have at least two choices. Hence, the probability of choosing a number is at most 12 . So, when a node chooses a slot, the probability that k other nodes in the neighborhood do not choose the number is at least (1 − 12 )k = 21k . Therefore the probability that z will succeed in choosing a 1 number is at least 2b+ since k ≤ b + . The probability that a node z fails

280

B. Kalyanasundaram and M. Velauthapillai

1 to succeed in choosing a number after c5 log n steps is at most (1 − 2b+ )c5 loge n . 1 b+ 2b+ Set c5 = 2 (d + 1) loge n. Observe that (1 − 2b+ ) ≤ 1/e. Therefore, the probability that z fails to succeed in choosing a number after c5 loge n steps is 1 at most e(d+1)1loge n = nd+1 . The result follows since there are at most n nodes.

4

Conclusion

This paper provides a tight analysis of a randomized protocol to establish a single interference-free broadcast schedule for nodes in any bounded degree networks. Our protocol is simple and it reduces the number of random bits and number of broadcasts from O(log2 n) to O(log n). Experimental results show that the bounds predicted by the analysis is reasonably accurate.

References 1. Abadi, D.J., Madden, S., Lindner, W.: Reed: Robust, eﬃcient ﬁltering and event detection in sensor networks. In: VLDB, pp. 769–780 (2005) 2. Bonnet, P., Gehrke, J., Seshadri, P.: Querying the physical world. IEEE Personal Communication Magazine, 10–15 (2000) 3. Gandhi, R., Parthasarathy, S.: Distributed algorithms for connected domination in wireless networks. Journal of Parallel and Distributed Computing 67(7), 848–862 (2007) 4. Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L., Rubenstein, D.: [duplicate] energy-eﬃcient computing for wildlife tracking: Design tradeoﬀs and early experiences with zebranet (2002) 5. Kalyanasundaram, B., Velauthapillai, M.: Communication complexity of continuous pattern detection. Unpublished manuscript (January 2009) 6. Karl, H., Willig, A.: Protocols and Architectures for Wireless Sensor Networks. John Wiley & Sons, Chichester (2005) 7. Kim, S., Pakzad, S., Culler, D., Demmel, J., Fenves, G., Glaser, S., Turon, M.: Wireless sensor networks for structural health monitoring. In: SenSys 2006: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems, pp. 427–428. ACM, New York (2006) 8. Mainwaring, A., Polastre, J., Szewczyk, R., Culler, D.: Wireless sensor networks for habitat monitoring. In: Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications, pp. 88–97 (2002) 9. Paek, J., Chintalapudi, K., Govindan, R., Caﬀrey, J., Masri, S.: A wireless sensor network for structural health monitoring: Performance and experience. In: The Second IEEE Workshop on Embedded Networked Sensors, EmNetS-II 2005, pp. 1–10 (May 2005)

Reliable Networks with Unreliable Sensors Srikanth Sastry1 , Tsvetomira Radeva2 , Jianer Chen1 , and Jennifer L. Welch1 1

2

Department of Computer Science and Engineering Texas A&M University College Station, TX 77840, USA {sastry,chen,welch}@cse.tamu.edu Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA [email protected]

Abstract. Wireless sensor networks (WSNs) deployed in hostile environments suﬀer from a high rate of node failure. We investigate the eﬀect of such failure rate on network connectivity. We provide a formal analysis that establishes the relationship between node density, network size, failure probability, and network connectivity. We show that as network size and density increase, the probability of network partitioning becomes arbitrarily small. We show that large networks can maintain connectivity despite a signiﬁcantly high probability of node failure. We derive mathematical functions that provide lower bounds on network connectivity in WSNs. We compute these functions for some realistic values of node reliability, area covered by the network, and node density, to show that, for instance, networks with over a million nodes can maintain connectivity with a probability exceeding 99% despite node failure probability exceeding 57%.

1

Introduction

Wireless Sensor Networks (WSNs) [2] are being used in a variety of applications ranging from volcanology [21] and habitat monitoring [18] to military surveillance [10]. Often, in such deployments, premature uncontrolled node crashes are common. The reasons for this include, but are not limited to, hostility of the environment (like extreme temperature, humidity, soil acidity, and such), node fragility (especially if the nodes are deployed from the air on to the ground), and the quality control in the manufacturing of the sensors. Consequently, crash fault tolerance becomes a necessity (not just a desirable feature) in WSNs. Typically, a suﬃciently dense node distribution with redundancy in connectivity and coverage provides the necessary fault tolerance. In this paper, we analyze the connectivity fault tolerance of such large scale sensor networks and show how, despite high unreliability, ﬂaky sensors can build robust networks. The results in this paper address the following questions: Given a static WSN deployment (of up to a few million nodes) where (a) the node density is D nodes

This work was supported in part by NSF grant 0964696 and Texas Higher Education Coordinating Board grant NHARP 000512-0130-2007.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 281–292, 2011. c Springer-Verlag Berlin Heidelberg 2011

282

S. Sastry et al.

per unit area, (b) the area of the region is Z units, and (c) each node can fail1 with an independent and uniform probability ρ: what is the probability P that the network is connected (that is, the network is not partitioned)? What is the relationship between P , ρ, D, and Z? Motivation. The foregoing questions are of signiﬁcant practical interest. A typical speciﬁcation for designing a WSN is the area of coverage, an upper bound on the (ﬁnancial) cost, and QoS guarantees on connectivity (and coverage). High reliability sensor nodes oﬀer better guarantees on connectivity but also increase the cost. An alternative is to reduce the costs by using less reliable nodes, but the requisite guarantees on connectivity might necessitate greater node density (that is, greater number of nodes per unit area), which again increases the cost. As a network designer, it is desirable to have a function that accepts, as input, the speciﬁcations of a WSN and outputs feasible and appropriate design choices. We derive the elements of such a function in Sect. 6 and demonstrate the use of the results from Sect. 6 in Sect. 7. Contribution. This paper has three main contributions. First, we formalize and prove the intuitive conjecture that as node reliability and/or node density of a WSN increases, the probability of connectivity also increases. We provide a probabilistic analysis for the relationship between node reliability (ρ), node density (D), area of the WSN region (Z), and the probability of network connectivity(P ); we provide lower bounds for P as a function of ρ, D, and Z. Second, we provide concrete lower bounds for expected connectivity probability for various reasonable values of ρ, D, and Z. Third, we use a new technique of hierarchical network analysis to derive the lower bounds on a non-hierarchical WSN. To our knowledge, we are the ﬁrst to utilize this approach in wireless sensor networks. The approach, model, and proof techniques themselves may be of independent interest. Organization. The rest of this paper is organized as follows: The related work is described next in Section 2. The system model assumptions are discussed in Section 3. The methodology includes tiling the plane with regular hexagons. The analysis and results in this paper use a topological object called a level-z polyhex that is derived from a regular hexagon. The level-z polyhex is introduced in Section 4. Section 5 introduces the notion of level-z connectedness of an arbitrary WSN region. Section 6 uses this notion of level-z to formally establish the relationship between P , ρ, D, and Z. Finally, section 7 provides lower bounds on connectivity for various values of ρ, D, and Z.

2

Related Work

There is a signiﬁcant body of work on static analysis of topological issues associated with WSNs [12]. These issues are discussed in the context of coverage [13], connectivity [19], and routing [1]. 1

Node is said to fail if it crashes prior to its intended lifetime. See Sect. 3 for details.

Reliable Networks with Unreliable Sensors

283

The results in [19] focus on characterizing the fault tolerance of sensor networks by establishing the k-connectivity of a WSN. However, such characterization results in a poor lower bound of k − 1 on the fault tolerance which corresponds to the worst-case behavior of faults. It fails to characterize the expected probability of network partitioning in practical deployments. In other related results, Bhandari et al. [5] focus on optimal node density (or degree) for a WSN to be connected w.h.p, and Kim et al. [11] consider connectivity in randomly duty-cycled WSNs in which nodes take turns to be active to conserve power. A variant of network connectivity, called partial connectivity, is explored in [6] which derives derives the relationship between node density and the percentage f of the network expected to be connected. Our research addresses a diﬀerent, but related question: given a ﬁxed WSN region with a ﬁxed initial node density (and hence, degree) and a ﬁxed failure probability, what is the probability that the WSN will remain connected? The results in [16,4,22,20,3] establish and explore the relationship between coverage and connectivity. The results in [22] and [20] show that in large sensor networks if the communication radius rc is at least twice the coverage radius rs , then coverage of a convex area implies connectivity among the non-faulty nodes. In [4], Bai et al. establish optimal coverage and connectivity in regular patterns including square grids and hexagonal lattice where rc /rs < 2 by deploying additional sensors at speciﬁc locations. Results from [16] show that even if rc = rs , large networks in a square region can maintain connectivity despite high failure probability; however, connectivity does not imply coverage. Ammari et al., extend these results in [3] to show that if rc /rs = 1 in a k-covered WSN, then the network fault tolerance is given by 4rc (rc + rs )k/rs2 − 1 for a sparse distribution of node crashes. Another related result [17] shows that in a uniform random deployment of sensors in a WSN covering the entire region, the probability of maintaining connectivity approaches 1 as rc /rs approaches 2. Our work diﬀers from the works cited above in three aspects: (a) we focus exclusively on maintaining total connectivity, (b) while the results in [16,4,22,20] apply to speciﬁc deployment patterns or shape of a region, our results and methodology can be applied to any arbitrary region and any constant node density, and (c) our analysis is probabilistic insofar as node crashes are assumed to be independent random events, and we focus on the probability of network connectivity in the average case instead of the worst case. The tiling used in our model induces a hierarchical structure which can be used to decompose the connectivity property of a large network into connectivity properties of constituent smaller sub-networks of similar structure. This approach was ﬁrst introduced in [9], and subsequently used to analyze fault tolerance of hypercube networks [7] and mesh networks [8]. Our approach diﬀers from those in [7] and [8] as we construct higher order polyhex tiling using the underlying hexagons to derive a recursive function that establishes a lower bound on network connectivity as a function of ρ and D.

284

3

S. Sastry et al.

System Model

We make the following simplifying assumptions: – Node. The WSN has a ﬁnite ﬁxed set of n nodes. Each node has a communication radius R. – Region and tiles. A WSN region is assumed to be a ﬁnite plane tiled by regular hexagons whose sides are of length l such that nodes located in a given hexagon can communicate reliably2 with all the nodes in the same hexagon and adjacent hexagons. We assume that each hexagon contains at least D nodes. – Faults. A node can fail only by crashing before the end of its intended lifetime. Faults are independent and each node has a constant probability ρ of failing. – Empty tile. A hexagon is said to be empty if it contains only faulty nodes. We say that two non-faulty nodes p and p are connected if either p and p are in the same or neighboring hexagons, or there exists some sequence of nonfaulty nodes pi , pi+1 , . . . , pj such that p (and p , respectively) and pi (and pj , respectively) are in adjacent hexagons, and pk and pk+1 are in adjacent hexagons, where i ≤ k ≤ j. We say that a region is connected if every pair of non-faulty nodes p and p in the region are connected.

4

Higher Level Tilings: Polyhexes

For the analysis of WSNs in an arbitrary region, we use of the notion of higher level tilings by grouping sets of contiguous hexagons into ‘super tiles’ such that some speciﬁc properties (like the ability to tile the Euclidean plane) are preserved. Such ‘super tiles’ are called level-z polyhexes. Diﬀerent values of z specify diﬀerent level-z polyhexes. In this section we deﬁne a level-z polyhex and specify its properties. The following deﬁnitions are borrowed from [14]: A tiling of the Euclidean plane is a countable family of closed sets called tiles, such that the union of the sets is the entire plane and such that the interiors of the sets are pairwise disjoint. We are concerned only with monohedral tilings — tilings in which every tile is congruent to a single ﬁxed tile called the prototile. In our case, a regular hexagon is a prototile. We say that the prototile admits the tiling. A patch is a ﬁnite collection of non-overlapping tiles such that their union is a closed topological disk3 . A translational patch is a patch such that the tiling consists entirely of a lattice of translations of that patch. 2 3

We assume that collision resolution techniques are always successful in ensuring reliable communication. A closed topological disk is the image of a closed circular disk under a homeomorphism. Roughly speaking, homeomorphism is a continuous stretching and bending of the object into a new shape (you are not allowed to tear or ‘cut holes’ into the object). Thus, any two-dimensional shape that has a closed boundary, ﬁnite area, and no ‘holes’ is a closed topological disk. This includes squares, circles, ellipses, hexagons, and polyhexes.

Reliable Networks with Unreliable Sensors

(a) The gray tiles form a level-2 polyhex.

285

(b) A level-3 polyhex formed by 7 level-2 polyhexes A–F.

Fig. 1. Examples of Polyhexes

We now deﬁne a translational patch of regular hexagons called level-z polyhexes for z ∈ N as follows: – A level-1 polyhex is a regular hexagon: a prototile. – A level-z polyhex for z > 1 is a translational patch of seven level-(z − 1) polyhexes that admits a hexagonal tiling. Note that each level-z polyhex is made of seven level-(z −1) polyhexes. Therefore, the total number of tiles in a level-z polyhex is size(z) = 7z−1 . Figure 1(a) illustrates the formation of a level-2 polyhex with seven regular hexagons, and Fig. 1(b) illustrates how seven level-2 polyhexes form a level-3 polyhex. A formal proof that such level-z polyhexes exist for arbitrary values of z (in an inﬁnite plane tessellated by regular hexagons) is available at [15].

5

Level-z Polyhexes and Connectivity

The analysis in Section 6 is based on the notion of level-z connectedness that is introduced here. First, we deﬁne a ‘side’ to each level-z polyhex. Second, we introduce the concepts of connected level-z polyhexes and level-z connectedness in a WSN region. Finally, we show how level-z connectedness implies that all non-faulty nodes in a level-z polyhex of a WSN are connected. We use this result and the deﬁnition of level-z connectedness to derive a lower bound on the probability of network connectivity in Section 6. Side. The set of boundary hexagons that are adjacent to a given level-z polyhex are said be a ‘side’ of the level-z polyhex. Since a level-z polyhex can have 6 neighboring level-z polyhexes, every level-z polyhex has 6 ‘sides’. The number of hexagons along each ‘side’ (also called the ‘length of the side’) is given by z−2 sidelen(z) = 1 + i=0 3i where z ≥ 2.4 4

The proof for this equation is a straightforward induction on z and the proof has been omitted.

286

S. Sastry et al.

We now deﬁne what it means for a level-z polyhex to be connected. Intuitively, we say that a level-z polyhex is connected if the network of nodes in the level-z polyhex is not partitioned. Connected level-z polyhex. A level-z polyhex Tzi is said to be connected if, given the set Λ of all hexagons in Tzi that contain at least one non-faulty node, for every pair of hexagons p and q from Λ, there exists some (possibly empty) sequence of hexagons t1 , t2 , . . . , tj such that {t1 , t2 , . . . , tj } ⊆ Λ, and t1 is a neighbor of p, every ti is a neighbor of ti+1 , and tj is a neighbor of q. Note that if a level-z polyhex is connected, then all the non-faulty nodes in the level-z polyhex are connected as well. We are now ready to deﬁne the notion of level-z connectedness in a WSN region. Level-z connectedness. A WSN region W is said to be level-z connected if there exists some partitioning of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 We are now ready to prove the following theorem: Theorem 1. Given a WSN region W, if W is level-z connected, then all nonfaulty nodes in W are connected. Proof. Suppose that the region W is level-z connected. It follows that there exists some partitioning Λ of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . sidelen(z) Additionally, each ‘side’ of Tzi has at least non-empty hexagons. 2 To prove the theorem, it is suﬃcient to show that for any two non-faulty nodes in W in hexagons p and q, respectively, the hexagons p and q are connected. Let hexagon p lie in a level-z polyhex Tzp (∈ Λ), and let q lie in a level-z polyhex Tzq (∈ Λ). Note that since Λ is a partitioning of W, either Tzp = Tzq or Tzp and Tzq are disjoint. If Tzp = Tzq , then since Tzp is connected, it follows that p and q are connected. Hence, all non-faulty nodes in p are connected with all non-faulty nodes in q. Thus, the theorem is satisﬁed. If Tzp and Tzq are disjoint, then it follows from the deﬁnition of level-z connectedness that there exists some sequence of connected level-z polyhex Tz1 , Tz2 , . . . , Tzj such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 Consider any two neighboring level-z polyhexes (Tzm , Tzn ) ∈ Λ × Λ . Each ‘side’ of Tzm and Tzn has sidelen(z) hexagons. Therefore, Tzm and Tzn have

Reliable Networks with Unreliable Sensors

287

sidelen(z) boundary hexagons such that each such hexagon from Tzm (and respectively, Tzn ) is adjacent to two boundary hexagons in Tzn (and respectively, Tzm ), except for the two boundary hexagons on either end of the ‘side’ of Tzm (and respectively, Tzn ); these two hexagons are adjacent to just one hexagon in Tzn (and respectively, Tzm ). We know that at least sidelen(z) of these bound2 ary hexagons are non-empty. It follows that there exists at least one non-empty hexagon in Tzm that is adjacent to a non-empty hexagon in Tzn . Such a pair of non-empty hexagons (one in Tzm and the other in Tzn ) form a “bridge” between Tzm and Tzn allowing nodes in Tzm to communicate with nodes in Tzn . Since Tzm and Tzn are connected level-z polyhexes, it follows that nodes within Tzm and Tzn are connected as well. Additionally, we have established that there exist at least two hexagons, one in Tzm and one in Tzn that are connected. It follows that nodes in Tzm and Tzn are connected with each other as well. Thus, it follows that Tzp and Tz1 are connected, every Tzi is connected with Tz(i+1) , and Tzj is connected with Tzq . From the transitivity of connectedness, it follows that Tzp is connected with Tzq . That is, all non-faulty nodes in hexagon p are connected with all non-faulty nodes in q. Since p and q are arbitrary hexagons in W, it follows that all the nodes in W are connected. Theorem 1 provides the following insight into connectivity analysis of a WSN: for appropriate values of z, a level-z polyhex has fewer nodes than the entire region W. In fact, a level-z polyhex could have orders of magnitude fewer nodes than W. Consequently, the analysis of connectedness of a level-z polyhex is simpler and easier than the connectedness of the entire region W. Using Theorem 1, we can leverage such an analysis of a level-z polyhex to derive a lower bound on the connectivity probability of W. The foregoing motivation is explored next.

6

On Fault Tolerance of WSN Regions

We are now ready to derive a lower bound on the connectivity probability of an arbitrarily-shaped WSN region. Let W be a WSN region with node density of D nodes per hexagon such that the region is approximated by a patch of x level-z polyhexes that constitute a set Λ. Let each node in the region fail independently with probability ρ. Let ConnW denote the event that all the non-faulty nodes in the region W are connected. Let Conn(T,z,side) denote the event that a level-z polyhex T is connected and each ‘side’ of T has at least sidelen(z)/2 nonempty hexagons. We know that if W is level-z connected, then all the non-faulty nodes in W are connected. Also, W is level-z connected if: ∀T ∈ Λ :: Conn(T,z,side) . Therefore, the probability that W is connected is bounded by: P r [ConnW ] ≥ (P r Conn(T,z,side) )x . Thus, in order to ﬁnd a lower bound on P r [ConnW ], we have to ﬁnd the lower bound on (P r Conn(T,z,side) )x . Lemma 2. In a level-z polyhex T with node density of D nodes per hexagon, suppose each node fails independently with a probability ρ. Then the probability

288

S. Sastry et al.

that T is connected and each ‘side’ of T has at least sidelen(z)/2 non-empty size(z) hexagons is given by P r Conn(T,z,side) = i=0 Nz,i (1 − ρD )size(z)−i ρD×i , where Nz,i is the number of ways by which we can have i empty hexagons and size(z) − i non-empty hexagons in a level-z polyhex such that the level-z polyhex is connected and each ‘side’ of the level-z polyhex has at least sidelen(k)/2 non-empty hexagons. Proof. Fix i hexagons in T to be empty such that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons. Since nodes fail independently with probability ρ, and there are D nodes per hexagon, the probability that a hexagon is empty is ρD . Therefore, the probability that exactly i hexagons are empty in T is given by (1 − ρD )size(z)−i ρD×i . By assumption, there are Nz,i ways to ﬁx i hexagons to be empty. Therefore, the probability that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons despite i empty hexagons is given by Nz,i (1 − ρD )size(z)−i ρD×i . However, note that we can set i (the number of empty hexagons) to be anything from 0 to size(z). Therefore, P r Conn(T,z,side) is given by size(z) Nz,i (1 − ρD )size(z)−i ρD×i . i=0 Given the probability of Conn(T,z,side) , we can now establish a lower bound for the probability that the region W is connected. Theorem 3. Suppose each node in a WSN region W fails independently with probability ρ, W has a node density of D nodes per hexagon and tiled by a patch of x level-z polyhexes. Then that all non-faulty nodes in W are the probability connected is at least (P r Conn(T,z,side) )x Proof. There are x level-z polyhexes in W. Note that if W is level-z connected, then all non-faulty nodes in W are connected. However, observe that W is level-z connected if each such level-z polyhex is connected and each ‘side’ of each such level-z polyhex has at least sidelen(z)/2 non-empty hexagons. Recall from Lemma 2 that the probability of such an event for each polyhex is given by P r Conn(T,z,side) . Since there are x such level-z polyhex, and failure probability of nodes (and hence disjoint level-z polyhexes) is independent, it follows that the probability of W being connected is at least (P r Conn(T,z,side) )x . Note that the lower bound we have established depends on the function Nz,i deﬁned in Lemma 2. Unfortunately, to the best of our knowledge, there is no known algorithm that computes Nz,i in a reasonable amount of time. Since this is a potentially infeasible approach for large WSNs with millions of nodes, we provide an alternate lower bound for P r Conn(T,z,side) . from Lemma 2 is bounded below Lemma 4. The value of P r Conn(T,z,side) 7 by: P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) ) + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Proof. Recall that a level-z polyhex consists for seven level-(z−1) polyhexes with one internal level-(z − 1) polyhex and six outer level-(z − 1) polyhexes. Observe

Reliable Networks with Unreliable Sensors

289

that a level-z polyhex satisﬁes Conn(T,z,side) if either (a) all the seven level(z − 1) polyhexes satisfy Conn(T,z−1,side) , or (b) the internal level-(z − 1) polyhex is empty and the six outer level-(z − 1) polyhexes satisfy Conn(T,z−1,side) . From Lemma 2 we know that the probability of alevel-(z − 1) polyhex satisfying Conn(T,z−1,side) is given by P r Conn(T,z−1,side) and the probability of a level(z − 1) polyhex being empty is ρD×size(z−1) . For a level-1 polyhex (which is a regular hexagon tile), the probability that the hexagon is not empty is 1 − ρD . Therefore, for z > 1 is given the probability that cases (a) or (b) is satisﬁed by (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) . There fore, P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Analyzing the connectivity probability for WSN regions that are level-z connected where z is large, can be simpliﬁed by invoking Lemma 4, and reducing the complexity of the computation to smaller values of z for which P r Conn(T,z,side) can be computed (by brute force) fairly quickly.

7

Discussion

Choosing the size of the hexagon. For the results from the previous section to be of practical use, it is important that we choose the size of the hexagons in our system model carefully. On the one hand, choosing very large hexagons could violate the system model assumption that nodes can communicate with nodes in neighboring hexagons, and on the other hand, choosing small hexagons could result in poor lower bounds and thus result in over-engineered WSNs that incur high costs but with incommensurate beneﬁts. If we make no assumptions about the locations of nodes within √ hexagons, then the length l of the sides of a hexagon must be at most R/ 13 to ensure connectivity between non-faulty nodes in neighboring hexagons. However, if the nodes are “evenly” placed within each hexagon, then l can be as large as R/2 while still ensuring connectivity between neighboring hexagons. In both cases, the requirement is that the distance between two non-faulty nodes in neighboring hexagons is at most R. Computing Nz,i from Lemma 2. The function Nz,i does not have a closedform solution. It needs to be computed through exhaustive enumeration. We computed Nz,i for some useful values of z and i and included them in Table 1. Using these values, we applied Theorem 3 and Lemma 4 to sensor networks of diﬀerent sizes, node densities, and node failure probabilities. The results are presented in Table 2. Next, we demonstrate how to interpret and understand the entries in these tables through an illustrative example. Practicality. Our results can be utilized in the following two practical scenarios. (1) Given an existing WSN with known node failure probability, node density, and area of coverage, we can estimate the probability of connectivity of the entire network. First, we decide on the size of a hexagon as discussed previously,

290

S. Sastry et al. Table 1. Computed Values of Nz,i z k>2 3 3 3 3 3

i

Nz,i

z i k−1

1 size(k) = 7 2 1176 3 18346 4 208372 5 1830282 6 12899198

3 3 3 4 4 5

Nz,i

7 74729943 8 361856172 9 1481515771 2 58653 3 6666849 2 2881200

Table 2. Various values of node failure probability ρ, node density D, and level-z polyhex that yield network connectivity probability exceeding 99% Node No. of Node failure No. of Node failure density D Nodes prob. ρ Nodes prob. ρ 3 5 10 3 5 10 3 5 10

z = 2 (level-2 polyhex) z = 5 (level-5 polyhex) 21 35% 7203 24% 35 53% 12005 40% 70 70% 24010 63% z = 3 (level-3 polyhex) z = 6 (level-6 polyhex) 137 37% 50421 19% 245 50% 84035 36% 490 70% 24010 63% z = 4 (level-4 polyhex) z = 7 (level-7 polyhex) 1029 29% 352947 15% 1715 47% 588245 31% 3430 67% 1176490 57%

and then we consider level-z polyhexes that cover the region. Next, we apply Theorem 3 and Lemma 4 to compute the probability of connectivity of the network for the given values of ρ, D and z, and the precomputed values of Nz,i in Table 1. (2) The results in this paper can be used to design a network with a speciﬁed probability of connectivity. In this case, we decide on a hexagon size that best suits the purposes of the sensor network and determine the level of the polyhex(es) needed to cover the desired area. As an example, consider a 200 sq. km region (approximately circular, so that there are no ‘bottle neck’ regions) that needs to be covered by a sensor network with a 99% connectivity probability. Let the communication radius of each sensor be 50 meters. The average-case value of the length l of the side of the hexagon is 25 meters, and the 200 sq. km region is tiled by a single level-7 polyhex. From Table 2, we see that if the network consists of 3 nodes per hexagon, then the region will require about 352947 nodes with a failure probability of 15% (85% reliability). However, if the node redundancy is increased to 5 nodes per hexagon, then the region will require about 588245 nodes with a failure probability of 31% (69% reliability). If the node density is

Reliable Networks with Unreliable Sensors

291

increased further to 10 nodes per hexagon, then the region will require about 1176490 nodes with a failure probability of 57% (43% reliability). On the lower bounds. An important observation is that these values for node reliability are lower bounds, but are deﬁnitely not tight bounds. This is largely because in order to obtain tighter lower bounds, we need to compute the probability of network connectivity from Theorem 3. However, this requires us to compute the values for Nz,i for all values of i ranging from 1 to z, which is expensive for z exceeding 3. Consequently, we are forced to use the recursive function in Lemma 4 for computing the network connectivity for larger networks. This reduces the accuracy of the lower bound signiﬁcantly. A side eﬀect of this error is that in Table 2, we see that for a given D, ρ decreases as z increases. If we were to invest the time and computing resources to compute Nz,i for higher values of z (5, 6, 7, and greater), then the computed values for ρ in Table 2 would be signiﬁcantly larger.

References 1. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. Ad Hoc Networks 3(3), 325–349 (2005), http://dx.doi.org/10.1016/j.adhoc.2003.09.010 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38(4), 393–422 (2002), http://dx.doi.org/10.1016/S1389-12860100302-4 3. Ammari, H.M., Das, S.K.: Fault tolerance measures for large-scale wireless sensor networks. ACM Transactions on Autonomous and Adaptive Systems 4(1), 1–28 (2009), http://doi.acm.org/10.1145/1462187.1462189 4. Bai, X., Kumar, S., Xuan, D., Yun, Z., Lai, T.H.: Deploying wireless sensors to achieve both coverage and connectivity. In: MobiHoc 2006: Proceedings of the 7th ACM International Symposium on Mobile and Hoc Networking and Computing, pp. 131–142. ACM, New York (2006), http://doi.acm.org/10.1145/1132905.1132921 5. Bhandari, V., Vaidya, N.H.: Reliable broadcast in wireless networks with probabilistic failures. In: Proceedings of the 26th IEEE International Conference on Computer Communications, pp. 715–723 (2007), http://dx.doi.org/10.1109/INFCOM.2007.89 6. Cai, H., Jia, X., Sha, M.: Critical sensor density for partial connectivity in large area wireless sensor networks. In: Proceedings of the 27th IEEE International Conference on Computer Communications, pp. 1–5 (2010), http://dx.doi.org/10.1109/INFCOM.2010.5462211 7. Chen, J., Kanj, I.A., Wang, G.: Hypercube network fault tolerance: A probabilistic approach. Journal of Interconnection Networks 6(1), 17–34 (2005), http://dx.doi.org/10.1142/S0219265905001290 8. Chen, J., Wang, G., Lin, C., Wang, T., Wang, G.: Probabilistic analysis on mesh network fault tolerance. Journal of Parallel and Distributed Computing 67, 100–110 (2007), http://dx.doi.org/10.1016/j.jpdc.2006.09.002 9. Chen, J., Wang, G., Chen, S.: Locally subcube-connected hypercube networks: theoretical analysis and experimental results. IEEE Transactions on Computers 51(5), 530–540 (2002), http://dx.doi.org/10.1109/TC.2002.1004592

292

S. Sastry et al.

10. Kikiras, P., Avaritsiotis, J.: Unattended ground sensor network for force protection. Journal of Battleﬁeld Technology 7(3), 29–34 (2004) 11. Kim, D., Hsin, C.F., Liu, M.: Asymptotic connectivity of low duty-cycled wireless sensor networks. In: Military Communications Conference, pp. 2441–2447 (2005), http://dx.doi.org/10.1109/MILCOM.2005.1606034 12. Li, M., Yang, B.: A survey on topology issues in wireless sensor network. In: Proceedings of the 2006 International Conference On Wireless Networks, pp. 503–509 (2006) 13. Meguerdichian, S., Koushanfar, F., Potkonjak, M., Srivastava, M.: Coverage problems in wireless ad-hoc sensor networks. In: Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 1380– 1387 (2001), http://dx.doi.org/10.1109/INFCOM.2001.916633 14. Rhoads, G.C.: Planar tilings by polyominoes, polyhexes, and polyiamonds. Journal of Computational and Applied Mathematics 174(2), 329–353 (2005), http://dx.doi.org/10.1016/j.cam.2004.05.002 15. Sastry, S., Radeva, T., Chen, J.: Reliable networks with unreliable sensors. Tech. Rep. TAMU-CSE-TR-2010-7-4, Texas A&M University (2010), http://www.cse.tamu.edu/academics/tr/2010-7-4 16. Shakkottai, S., Srikant, R., Shroﬀ, N.B.: Unreliable sensor grids: coverage, connectivity and diameter. Ad Hoc Networks 3(6), 702–716 (2005), http://dx.doi.org/10.1016/j.adhoc.2004.02.001 17. Su, H., Wang, Y., Shen, Z.: The condition for probabilistic connectivity in wireless sensor networks. In: Proceedings of the Third International Conference on Pervasive Computing and Applications, pp. 78–82 (2008), http://dx.doi.org/10.1109/ICPCA.2008.4783653 18. Szewczyk, R., Polastre, J., Mainwaring, A., Culler, D.: Lessons from a sensor network expedition. In: Proceedings of the First European Workshop on Wireless Sensor Networks, pp. 307–322 (2004), http://dx.doi.org/10.1007/978-3-540-24606-0_21 19. Vincent, P., Tummala, M., McEachen, J.: Connectivity in sensor networks. In: Proceedings of the Fortieth Hawaii International Conference on System Sciences, p. 293c (2007), http://dx.doi.org/10.1109/HICSS.2007.145 20. Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., Gill, C.: Integrated coverage and connectivity conﬁguration in wireless sensor networks. In: SenSys 2003: Proceedings of the 1st International Conference on Embedded Networked Sensor Systems, pp. 28–39 (2003), http://doi.acm.org/10.1145/958491.958496 21. Werner-Allen, G., Lorincz, K., Welsh, M., Marcillo, O., Johnson, J., Ruiz, M., Lees, J.: Deploying a wireless sensor network on an active volcano. IEEE Internet Computing 10, 18–25 (2006), http://dx.doi.org/10.1109/MIC.2006.26 22. Zhang, H., Hou, J.: Maintaining sensing coverage and connectivity in large sensor networks. Ad Hoc & Sensor Wireless Networks 1(1-2) (2005), http://oldcitypublishing.com/AHSWN/AHSWNabstracts/AHSWN1.1-2abstracts/ AHSWNv1n1-2p89-124Zhang.html

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada {bari1,arunita,subir}@uwindsor.ca

Abstract. Design of fault-tolerant sensor networks is receiving increasing attention in recent times. In this paper we point out that simply ensuring that a sensor network can tolerate fault (s) is not suﬃcient. It is also important to ensure that the network remains viable for the longest possible time, even if a fault occurs. We have focussed on the problem of designing 2-tier sensor networks using relay nodes as cluster heads. Our objective is to ensure that the network has a communication strategy that extends, as much as possible, the period for which the network remains operational when there is a single relay node failure. We have described an Integer Linear Program (ILP) formulation and have used this formulation to study the eﬀect of single faults. We have compared our results to those obtained using standard routing protocols (Minimum Transmission Energy Model (MTEM)) and the Minimum Hop Routing Model (MHRM)). We have shown that our routing algorithm performs signiﬁcantly better, compared to the MTEM and the MHRM.

1

Introduction

A wireless sensor network (WSN) is a network of battery powered, multi-functional devices, known as sensor nodes. Each sensor node typically consists of a micro-controller, a limited amount of memory, sensing device(s), and wireless trans-ceiver(s) [2]. A sensor network performs its tasks through the collaborative eﬀorts of a large number of sensor nodes that are densely deployed within the sensing ﬁeld [2], [3], [4]. Data from each node in a sensor network are gathered at a central entity, called the base station [2], [5]. Sensor nodes are powered by batteries, and recharging or replacing the batteries is usually not feasible due to economic reasons and/or environmental constraints [2]. Therefore it is extremely important to design communication protocols and algorithms that are energy eﬃcient, so that the duration of useful operation, often called the lifetime [6], of a network, can be extended as much as possible [3], [4], [5], [7], [24]. The lifetime of a sensor network is deﬁned as the time interval from the inception of the operation of the network, to the time when a number of critical nodes “die” [5], [6].

A. Jaekel and S. Bandyopadhyay have been supported by discovery grants from the Natural Sciences and Engineering Research Council of Canada.

M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 293–302, 2011. c Springer-Verlag Berlin Heidelberg 2011

294

A. Bari, A. Jaekel, and S. Bandyopadhyay

Recently some special nodes, called relay nodes have been proposed for sensor networks [8] - [17]. The use of relay nodes, provisioned with higher power, has been proposed as cluster heads in two-tiered sensor networks [10], [12], [17], [18], [19] where each relay node is responsible for collecting data from the sensor nodes belonging to its own cluster and for forwarding the collected data to the base station. The model for transmission of data from a relay node to the base station may be categorized either as the single-hop data transmission model (SHDTM) or the multi-hop data transmission model (MHDTM) [15], [16], [17], [20]. In MHDTM, each relay node, in general, uses some intermediate relay node(s) to forward the data to the base station. The MHDTM is considered in this paper since it is particularly suitable for larger networks. In the non-flowsplitting model a relay node is not allowed to split the traﬃc, and forwards all its data to a single relay node (or to the base station), and there is always a single path from each relay node to the base station. This is a more appropriate model for 2-tier networks, with important technological advantages [8], [15], and has been used in this paper. In the periodic data gathering model [8] considered in this paper, each period of data gathering (starting from sensing until all data reach the base station) is referred to as a round [20]. Although provisioned with higher power, the relay nodes are also battery operated and hence, are power constrained [16], [17]. In 2-tier networks, the lifetime is primarily determined by the duration for which the relay nodes are operational [10], [19]. It is very important to allocate the sensor nodes to the relay nodes appropriately, and ﬁnd an eﬃcient communication scheme that minimizes the energy dissipation of the relay nodes. We have measured the lifetime of a 2-tier network, following the N-of-N metric [6], by the number of rounds the network operates from the start, until the ﬁrst relay node depletes its energy completely. In a 2-tier network using the N-of-N metric, assuming equal initial energy provisioning in each relay node, the lifetime of the network is deﬁned by the ratio of the initial energy to the maximum energy dissipated by any relay node in a round. Thus, maximizing the lifetime is equivalent to minimizing the maximum energy dissipated by any relay node in a round [8], [19]. In the ﬁrst-order radio model [5], [6] used here, energy is dissipated at a rate of α1 /bit (α2 /bit) for receiving (transmitting ) the data. The transmit ampliﬁer also dissipates β amount of energy to transmit unit bit of data over unit distance. The energy dissipated to receive b bits (transmit b bits over a distance d) is given by ERx = α1 b (ETx (b, d) = α2 b + βbdq ), where q is the path loss exponent, 2 ≤ q ≤ 4, for free space using short to medium-range radio communication. Due to the nature of the wireless media, and based on the territory of the deployment, nodes in a sensor network are prone to faults [23]. A sensor network should ideally be resilient with respect to faults. In 2-tier networks the failure of a single relay node may have a signiﬁcant eﬀect on the overall lifetime of the network [8]. In a fault-free environment, it is suﬃcient that each sensor node is able to send the data it collects to at least one relay node. To provide fault tolerance, we need a placement strategy that allows some redundancy of the relay nodes, so that, in the event of any failure(s) in relay node(s), each sensor

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

295

node belonging to the cluster of a failed relay node should be able to send its data to another fault-free relay node, and data from all fault-free relay nodes should still be able to reach the base station successfully. In [22], the authors have proposed an approximation algorithm to achieve single-connectivity and double-connectivity. In [25], authors have presented a two-step approximation algorithm to obtain a 1-connected and a 2-connected network. In [17], a 2-tier architecture is considered and an optimal placement of relay nodes for each cell is computed to allow 2-connectivity. Even though a signiﬁcant amount of work has focussed on extending the lifetime of a fault-free sensor network, including two-tier networks [5], [10], [15], [19] the primary objective of research on fault tolerant sensor networks was to ensure k-connectivity of the network, for some pre-speciﬁed k > 1. Such a design ensures that the network can handle k −1 faults, since there exists at least one route of fault-free relay nodes from every sensor node to the base station. However, when a fault occurs, some relay nodes will be communicating more data compared to the fault-free case and it is quite likely that the fault may signiﬁcantly aﬀect the lifetime of the network. To the best of our knowledge, no research has attempted to minimize the eﬀect of faults on the lifetime of a sensor network. Our approach is diﬀerent from other research in this area since our objective is to guarantee that the network will be operational for the maximum possible period of time, even in the presence of faults. We have conﬁned our work to the most likely scenario of single relay node failure and have shown how to design the network to maximize the lifetime of the network when a single relay node becomes faulty. Obviously, to handle single faults in any relay node, all approaches, including ours, must design a network which guarantees that each sensor node has a fault-free path to the base station avoiding the faulty relay node. In our approach, for any case of single relay node failure, we select the paths from all sensor nodes to the base station in such a way that a) we avoid the faulty relay node and b) the selected paths are such that the lifetime of the network is guaranteed to be as high as possible.

2 2.1

Fault Tolerant Routing Design Network Model

We consider a two-tiered wireless sensor network model with n relay nodes and a base station. All data from the network are collected at the base station. For convenience we assign labels 1, 2, 3, 4, ..., n to the relay nodes and label n + 1 to the base station. If a sensor node i can send it data to relay node j, we will say that j covers i. We assume that relay nodes are placed in such a way that each sensor node is covered by at least two relay nodes. This ensures that when a relay node fails, all sensor nodes in its cluster can be reassigned to other cluster (s), and the load (in terms of the number of bits) generated in the cluster of the failed node is redistributed among the neighboring relay nodes. A number of recent papers have addressed the issue of fault-tolerant placement of relay nodes to implement double (or multiple) coverage of each sensor node [17], [21], [22],

296

A. Bari, A. Jaekel, and S. Bandyopadhyay

[25]. Such fault-tolerant placement schemes can also indicate the “backup” relay node for each sensor node, when its original cluster head fails. We assume that the initial placement of the relay nodes has been done according to one of these existing approaches, so that the necessary level of coverage is achieved. Based on the average amount of data generated by each cluster and the location of the relay nodes, the Integer Linear Program (ILP) given below calculates the optimal routing schedule such that the worst case lifetime for any single fault scenario is maximized. In the worst case situation, a relay node fails from the very beginning. We therefore consider all single relay node failures, occurring when the network starts operating and determine which failure has the worst eﬀect on the lifetime, even if an optimal routing schedule is followed to handle the failure. This calculation is performed oﬄine, so it is reasonable to use an ILP to compute the most energy-eﬃcient routing schedule. The backup routing schedule for each possible fault can be stored either at the individual relay nodes or at the base station. In the second option, the base station, which is not power constrained, can transmit the updated schedule to the relay nodes, when needed. For each relay node, the energy required for receiving the updated schedules is negligible, compared to the energy required for data transmission, and hence is not expected to aﬀect the overall lifetime signiﬁcantly. In our model, applications are assumed to have long idle time and are able to tolerate some latency [26], [27]. The nodes sleep during the idle time, and transmit/receive when they are awake. Hence, energy is dissipated by a node only during while the nodes are either transmitting, or receiving. We further assume that both sensor and relay nodes communicate through an ideal shared medium. As in [12], [13], we assume that communication between nodes, including the sleep/wake scheduling and the underlying synchronization protocol, is handled by appropriate state-of-the-art MAC protocols, such as those proposed in [26], [27], [28], [29]. 2.2

Notation Used

In our formulation we are given the following data as input: • α1 (α2 ): Energy coeﬃcient for reception (transmission). • β: Energy coeﬃcient for ampliﬁer. • q: Path loss exponent. • bi : Number of bits generated per round by the sensor nodes belonging to cluster i, in the fault-free case. that are reas• bki : Number of bits per round, originally from cluster k, n signed to cluster i, when relay node k fails. Clearly, i=1;i=k bki = bk . • n: Total number of relay nodes.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

297

• n + 1: Index of the base station. • C: A large constant, greater than the total number of bits received by the base station in a round. • dmax : Transmission range of each relay node. • di,j : Euclidean distance from node i to node j. We also deﬁne the following variables: k : A binary variable deﬁned as follows: • Xi,j 1 if node i selects j to send its data when relay node k fails, k Xi,j = 0 otherwise.

• Tik : Number of bits transmitted by relay node i when relay node k fails. • Gki : Amount of energy needed by the ampliﬁer in relay node i to send its data to the next hop in its path to the base station when relay node k fails. • Rik : Number of bits received by relay node i from other relay nodes when relay node k fails. k • fi,j : Amount of ﬂow from relay node i to relay node j, when relay node k fails. • Fmax : The total energy spent per round by the relay node which is being depleted at the fastest rate when any one relay node fails. 2.3

ILP Formulation for Fault Tolerant Routing (ILP-FTR) Minimize Fmax

(1)

Subject to: a) The range of transmission from a relay node is dmax . k · di,j ≤ dmax Xi,j

∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1

(2)

b) Ensure that the non-ﬂow-splitting model is followed so that all data from relay node i are forwarded to only one other node j. n+1

k Xi,j =1

∀i, k, 1 ≤ i, k ≤ n; k = i

(3)

j=1;j=i,k

c) Only one outgoing link from relay node i can have non-zero data ﬂow. k k fi,j ≤ C · Xi,j

∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1; j = k;

(4)

298

A. Bari, A. Jaekel, and S. Bandyopadhyay

d) Satisfy ﬂow constraints. n+1

n

k fi,j −

j=1;j=i,k

k fj,i = bi + bki

∀i, k, 1 ≤ i, k ≤ n; k = i

(5)

j=1;j=i,k

e) Calculate the total number of bits transmitted by relay node i. n+1

Tik =

k fi,j

∀i, k, 1 ≤ i, k ≤ n; k = i

(6)

j=1;j=i,k

f) Calculate the ampliﬁer energy dissipated by relay node i to transmit to the next hop. Gki = β

n+1

k fi,j · (di,j )q

∀i, k, 1 ≤ i, k ≤ n; k = i

(7)

j=1;j=i,k

g) Calculate the number of bits received by node i from other relay node(s). Rik =

n

k fj,i

∀i, k, 1 ≤ i, k ≤ n; k = i

(8)

j=1;j=i,k

h) Energy dissipated per round by relay node i, when node k has failed, must be less than Fmax . α1 (Rik + bki ) + α2 Tik + Gki ≤ Fmax 2.4

∀i, k, 1 ≤ i, k ≤ n; k = i

(9)

Justification of the ILP Equations

Equation (1) is the objective function for the formulation that minimizes the maximum energy dissipation by any individual relay node in one round of data gathering, for all possible fault scenarios. Constraint (2) ensures that a relay node i cannot transmit to a node j, if j is outside the transmission range of node i. Constraints (3) and (4) indicate that, for any given fault (e.g. a fault in node k), a non-faulty relay node i can only transmit data to exactly one other (nonfaulty) node j. Constraint (5) is the standard ﬂow constraint [1], used to ﬁnd a route to the base station, for the data originating in each cluster, when node k fails. We note that the total data (number of bits) generated in cluster i, when node k fails, is given by the number of bits bi , originally generated in cluster i, plus the additional number of bits bki , reassigned from cluster k to cluster i, due to the failure of relay node k. Constraint (6) speciﬁes the total number of bits Tik transmitted by the relay node i, when node k has failed. Constraint (7) is used to calculate Gki , the total ampliﬁer energy needed at relay node i when node k fails, by directly applying the ﬁrst order radio model [5], [6]. Constraint (8) is used to calculate the total number bits Rik received at relay node i from other relay node(s), when node k fails. Finally, (9) gives the total energy dissipated by each relay node, when node k fails. The total energy dissipated by a relay node, for any possible fault scenario (i.e. any value of k), cannot exceed Fmax , which the formulation attempts to minimize.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

3

299

Experimental Results

In this section, we present the simulation results for our fault-tolerant routing scheme. We have considered a 240×240m networking area with seventeen relay nodes, and with sensor nodes randomly distributed in the area. The results were obtained using the ILOG CPLEX 9.1 solver [30]. For each fault scenario (i.e. a speciﬁed relay node becomes faulty), we have measured the achieved lifetime of the network by the number of rounds until the ﬁrst fault-free relay node runs out of battery power. The lifetime in the presence of a fault can vary widely, depending on the location and load of the faulty node. When reporting the lifetime achieved in the presence of a fault, we have taken the worst case value, i.e. the lowest lifetime obtained after considering all possible single node failures, with the node failure occurring immediately after the network has been deployed. For experimental purposes, we have considered a number of diﬀerent sensor node distributions, with the number of sensor nodes in the network varying from 136 to 255. We have assumed that 1. the communication energy dissipation is based on the ﬁrst order radio model, described in Section 1 and 2. the values for the constants are the same as in [5], so that: (a) α1 = α2 = 50nJ/bit, (b) β = 100pJ/bit/m2 and (c) the path-loss exponent, q = 2. 3. the range of each sensor node is 80m, 4. the range of each relay node is 200m, as in [17], and 5. the initial energy of each relay node was 5J, as in [17]. We also assumed that a separate node placement and clustering scheme (as in [17], [21]) is used to ensure that each sensor and relay node has a valid path to the base station for all single fault scenarios, and to pre-assign the sensor nodes to clusters. Under these assumptions, we have compared the performance of our scheme with two existing well-known schemes that are widely used in sensor networks. i) Minimum transmission energy model (MTEM) [5], where each node i transmits to its nearest neighbor j, such that node j is closer to the base station than node i. ii) Minimum hop routing model (MHRM) [12], [14], where each node ﬁnds a path to the base station that minimizes the number of hops. Figure 1 compares the obtained network lifetime using ILP-FTR, MTEM and MHRM schemes. As shown in the ﬁgure, our formulation substantially outperforms both MTEM and MHRM approaches, under any single relay node failure. Furthermore, the ILP guarantees the “best” solution (with respect to the objective being optimized). The results show that, under any single relay node failure, our method can typically achieve an improvement of more than 2.7 times the

300

A. Bari, A. Jaekel, and S. Bandyopadhyay

Lifetime in rounds

2500

2000

1500

MTEM MHRM ILP-FTR

1000

500

0 136

170

255

Number of sensor nodes

Fig. 1. Comparison of the lifetimes in rounds, obtained using the ILP-FTR, MTEM and MHRM on networks with diﬀerent number of sensor nodes

2500

Lifetime in rounds

2000

1500

MTEM MHRM ILP-FTR

1000

500

0 1

3

5

7

9

11

13

15

17

Index of relay nodes

Fig. 2. Variation of the lifetimes in rounds, under the fault of diﬀerent relay nodes, obtained using the ILP-FTR, MTEM and MHRM on a network with 170 sensor nodes

network lifetime, compared to MTEM, and 2.3 times the network lifetime compared to MHRM. Figure 2 shows how the network lifetime varies with the failure of a relay node under ILP-FTR, MTEM and MHRM schemes on a network with 170 sensor nodes. As the ﬁgure shows, our approach provides substantial improvement over the other approaches, in terms of the network lifetime, considering all failure scenarios. Using our approach, it is also possible to identify the relay node(s) that is(are) most critical, and possibly, provide some additional protection (e.g., deployment of back-up node(s)) to guarantee the lifetime. Finally, we note that MTEM appears to be much more vulnerable to ﬂuctuations in lifetime, depending on the particular node that failed, compared to the other two schemes.

Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks

4

301

Conclusions

In this paper we have addressed the problem of maximizing the lifetime of a two-tier sensor network, in the presence of faults. Although many papers have considered energy-aware routing for the fault-free case, and also proposed deploying redundant relay nodes for meeting connectivity and coverage requirements, we believe this is the ﬁrst paper to investigate energy-aware routing for diﬀerent fault-scenarios. Our approach optimizes the network lifetime that can be achieved, and provides the corresponding routing scheme to be followed to achieve this goal, for any single node fault. The simulation results show that the proposed approach can signiﬁcantly improve network lifetime, compared to standard schemes such as MTEM and MHRM.

References 1. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliﬀs (1993) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38, 393–422 (2002) 3. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. IEEE Transactions On Mobile Computing 3(3), 325–349 (2005) 4. Chong, C.-Y., Kumar, S.P.: Sensor Networks: Evolution, Opportunities, and Challenges. The Proceeding of IEEE 91(8), 1247–1256 (2003) 5. Heinzelman, W., Chandrakasan, A., Balakrishnan, H.: Energy eﬀcient communication protocol for wireless micro-sensor networks. In: 33rd HICSS, pp. 3005–3014 (2000) 6. Pan, J., Hou, Y.T., Cai, L., Shi, Y., Shen, S.X.: Topology Control for Wireless Sensor Networks. In: The Proceedings of International Conference on Mobile Computing and Networking, pp. 286–299 (2003) 7. Duarte-Melo, E.J., Liu, M.: Analysis of energy consumption and lifetime of heterogeneous wireless sensor networks. In: The Proceeding of IEEE Global Telecommunications Conference, vol. 1, pp. 21–25 (2002) 8. Bari, A.: Energy Aware Design Strategies for Heterogeneous Sensor Networks. PhD thesis, University of Windsor (2010) 9. Bari, A., Jaekel, A., Bandyopadhyay, S.: Integrated Clustering and Routing Strategies for Large Scale Sensor Networks. In: Akyildiz, I.F., Sivakumar, R., Ekici, E., de Oliveira, J.C., McNair, J. (eds.) NETWORKING 2007. LNCS, vol. 4479, pp. 143–154. Springer, Heidelberg (2007) 10. Bari, A., Jaekel, A., Bandyopadhyay, S.: Optimal Placement and Routing Strategies for Resilient Two-Tiered Sensor Networks. Wireless Communications and Mobile Computing 9(7), 920–937 (2008), doi:10.1002/wcm.639 11. Cheng, X., Du, D.-Z., Wang, L., Xu, B.B.: Relay Sensor Placement in Wireless Sensor Networks. Wireless Networks 14(3), 347–355 (2008) 12. Gupta, G., Younis, M.: Load-balanced clustering of wireless sensor networks. In: IEEE International Conference on Communications, vol. 3, pp. 1848–1852 (2003) 13. Gupta, G., Younis, M.: Fault-tolerant clustering of wireless sensor networks. In: IEEE WCNC, pp. 1579–1584 (2003)

302

A. Bari, A. Jaekel, and S. Bandyopadhyay

14. Gupta, G., Younis, M.: Performance evaluation of load-balanced clustering of wireless sensor networks. In: International Conference on Telecommunications, vol. 2, pp. 1577–1583 (2003) 15. Hou, Y.T., Shi, Y., Pan, J., Midkiﬀ, S.F.: Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing. IEEE Transactions on Mobile Computing 5(9), 1255–1266 (2006) 16. Hou, Y.T., Shi, Y., Sherali, H.D., Midkiﬀ, S.F.: On Energy Provisioning and Relay Node Placement for Wireless Sensor Networks. In: IEEE International Conference on Sensor and Ad Hoc Communications and Networks (SECON), vol. 32 (2005) 17. Tang, J., Hao, B., Sen, A.: Relay node placement in large scale wireless sensor networks. Computer Communications 29(4), 490–501 (2006) 18. Bari, A., Jaekel, A., Bandyopadhyay, S.: Clustering Strategies for Improving the Lifetime of Two-Tiered Sensor Networks. Computer Communications 31(14), 3451– 3459 (2008) 19. Bari, A., Jaekel, A., Bandyopadhyay, S.: A Genetic Algorithm Based Approach for Energy Eﬃcient Routing in Two-Tiered Sensor Networks. Ad Hoc Networks Journal, Special Issue: Bio-Inspired Computing 7(4), 665–676 (2009) 20. Kalpakis, K., Dasgupta, K., Namjoshi, P.: Eﬃcient algorithms for maximum lifetime data gathering and aggregation in wireless sensor networks. Computer Networks 42(6), 697–716 (2003) 21. Bari, A., Wu, Y.: Jaekel, A. Integrated Placement and Routing of Relay Nodes for Fault-Tolerant Hierarchical Sensor Networks. In: IEEE ICCCN - SN, pp. 1–6 (2008) 22. Hao, B., Tang, J., Xue, G.: Fault-tolerant relay node placement in wireless sensor networks: formulation and approximation. In: Workshop on High Performance Switching and Routing (HPSR), pp. 246–250 (2004) 23. Alwan, H., Agarwal, A.: A Survey on Fault Tolerant Routing Techniques in Wireless Sensor Networks. In: SensorComm, pp. 366–371 (2009) 24. Wu, Y., Fahmy, S., Shroﬀ, N.B.: On the Construction of a Maximum-Lifetime Data Gathering Tree in Sensor Networks: NP-Completeness and Approximation Algorithm. In: INFOCOM, pp. 356–360 (2008) 25. Liu, H., Wan, P.-J., Jia, X.: Fault-Tolerant Relay Node Placement in Wireless Sensor Networks. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 230– 239. Springer, Heidelberg (2005) 26. Ye, W., Heidemann, J., Estrin, D.: An Energy-Eﬃcient MAC protocol for Wireless Sensor Networks. In: IEEE INFOCOM, pp. 1567–1576 (2002) 27. Ye, W., Heidemann, J., Estrin, D.: Medium access control with coordinated adaptive sleeping for wireless sensor networks. IEEE/ACM Transactions on Networking 12(3), 493–506 (2004) 28. Wu, Y., Fahmy, S., Shroﬀ, N.B.: Optimal Sleep/Wake Scheduling for timesynchronized sensor networks with QoS guarantees. In: The Proceedings of IEEE IWQoS, pp. 102–111 (2006) 29. Wu, Y., Fahmy, S., Shroﬀ, N.B.: Energy Eﬃcient Sleep/Wake Scheduling for MultiHop Sensor Networks: Non-Convexity and Approximation Algorithm. In: The Proceedings of IEEE INFOCOM, pp. 1568–1576 (2007) 30. ILOG CPLEX 9.1 Documentation. Available at the website, http://www.columbia.edu/~ dano/resources/cplex91_man/index.html

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time Congduc Pham University of Pau, LIUPPA Laboratory, Avenue de l’Universit´e - BP 1155 64013 PAU CEDEX, France [email protected] http://web.univ-pau.fr/~ cpham

Abstract. This paper proposes to use video sensor nodes to provide an eﬃcient intrusion detection system. We use a scheduling mechanism that takes into account the criticality of the surveillance application and present a performance study of various cover set construction strategies that take into account cameras with heterogeneous angle of view and those with very small angle of view. We show by simulation how a dynamic criticality management scheme can provide fast event detection for mission-critical surveillance applications by increasing the network lifetime and providing low stealth time of intrusions. Keywords: Sensor networks, video surveillance, coverage, mission-critical applications.

1

Introduction

The monitoring capability of Wireless Sensor Networks (WSN) make them very suitable for large scale surveillance systems. Most of these applications have a high level of criticality and can not be deployed with the current state of technology. This article focuses on Wireless Video Sensor Networks (WVSN) where sensor nodes are equipped with miniaturized video cameras. We consider WVSN for mission-critical surveillance applications where sensors can be thrown in mass when needed for intrusion detection or disaster relief applications. This article also focuses on taking into account cameras with heterogeneous angle of view and those with very small angle of view. Surveillance applications [1,2,3,4,5] have very speciﬁc needs due to their inherently critical nature associated to security . Early surveillance applications involving WSN have been applied to critical infrastructures such as production systems or oil/water pipeline systems [6,7]. There have also been some propositions for intrusion detection applications [8,9,10,11] but most of these studies focused on coverage and energy optimizations without explicitly having the application’s criticality in the control loop which is the main concern in our work. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 303–314, 2011. c Springer-Verlag Berlin Heidelberg 2011

304

C. Pham

For instance, with video sensors, the higher the capture rate is, the better relevant events could be detected and identiﬁed. However, even in the case of very mission-critical applications, it is not realistic to consider that video nodes should always capture at their maximum rate when in active mode. The notion of cover set has been introduced to deﬁne the redundancy level of a sensor nodes that monitor the same region. In [12] we developed the idea that when a node has several covers, it can increase its frame capture rate because if it runs out of energy it can be replaced by one of its cover sets. Then, depending on the application’s criticality, the frame capture rate of those nodes with large number of cover sets can vary: a low criticality level indicates that the application does not require a high video frame capture rate while a high criticality level does. According to the application’s requirements, an R0 value that indicate the criticality level could be initialized accordingly into all sensors nodes prior to deployment. Based on the criticality model we developed previously in [12], this article has 2 contributions. The ﬁrst contribution is an enhanced model for determining sensor’s cover sets that takes into account cameras with heterogeneous angle of view and those with very small angle of view. The performance of this approach is evaluated through simulation. The second contribution is to show the performance of the multiple cover sets criticality-based scheduling method proposed in [12] for fast event detection in mission-critical applications. The paper is then organized as follows: Section 2 present the coverage model and our approach for quickly building multiple cover sets per sensor. In section 3 we quickly present the dynamic criticality management model and then present the main contribution of this paper that focuses on fast event detection in section 4. We conclude in section 5.

2

Video Sensor Model

A video sensor node v is represented by the FoV of its camera. In our approach, we consider a commonly used 2-D model of a video sensor node where the FoV − → is deﬁned as a triangle (pbc) denoted by a 4-tuple v(P, d, V , α). Here P is the → − position of v, d is the distance pv (depth of view, DoV), V is the vector rep-

(a) Coverage model

(b) Heterogeneous AoV Fig. 1. Coverage model

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

305

resenting the line of sight of the camera’s FoV which determines the sensing → − direction, and α is the angle of the FoV on both sides of V (2α can be denoted as the angle of view, AoV). The left side of ﬁgure 1(a) illustrates the FoV of a video sensor node in our model. The AoV (2α) is 30o and distance bc is the linear FoV which is usually expressed in ft/1000yd or millimeters/meter. By using simple trigonometry relations we can link bc to pv with the following relation sin α bc = 2cos α .pv. We deﬁne a cover set Coi (v) of a video node v as a subset of video nodes such that: v ∈Coi (v) (v ’s FoV area) covers v’s FoV area. Co(v) is deﬁned as the set of all the cover sets Coi (v) of node v. One of the ﬁrst embedded camera on a wireless sensor hardware is the Cyclops board designed for the CrossBow Mica2 sensor [13] which is advertized to have an AoV of 52o . Recently, the IMB400 multimedia board has been designed for the Intel Mote2 sensor and has an AoV of about 20o , which is rather small. Obviously, the linear FoV and the AoV are important criteria in video sensor networks deployed for mission-critical surveillance applications. The DoV is a more subjective parameter. Technically, DoV could be very large but practically it is limited by the fact that an observed object must be suﬃciently big to be identiﬁed. 2.1

Determining Cover Sets

In the case of an omnidirectional sensing, a node can simply determine what parts of the coverage disc is covered by its neighbors. For the FoV coverage the task is more complex and determining whether a sensor’s FoV is completely covered or not by a subset of neighbor sensors is a time consuming task which is usually too resource-consuming for autonomous sensors. A simple approach presented in [14] is to use signiﬁcant points of a sensor’s FoV to quickly determine cover sets that may not completely cover sensor v’s FoV but a high percentage of it. First, sensor v can classify its neighbors into 3 categories of nodes, (i) those that cover point p, (ii) those cover point b and (iii) those that cover point c. Then, in order to avoid selecting neighbors that cover only a small portion of v’s FoV, we add a fourth point taken near the center of v’s FoV to construct a fourth set and require that candidate neighbors covers at least one of the 3 vertices and the fourth point. It is possible to use pbc’s center of gravity, noted point g, as depicted in ﬁgure 1(a)(right). In this case, a node v can practically computes Co(v) by ﬁnding the following sets, where N (v) represents the set of neighbors of node v: – P/B/C/G = {v ∈ N (v) : v covers point p/b/c/g of the FoV} – P G = {P ∩ G}, BG = {B ∩ G}, CG = {C ∩ G} Then, Co(v) can be computed as the Cartesian product of sets P G, BG and CG ({P G × BG × CG}). However, compared to the basic approach described in [14], point g may not be the best choice in case of heterogeneous camera’s AoV and very small AoV as will be explained in the next subsections.

306

2.2

C. Pham

The Case of Heterogeneous AoV

It is highly possible that video sensors with diﬀerent angles of view are randomly deployed. In this case, a wide-angle FoV could be covered by narrow-angle FoV sensors and vice-versa. Figure 1(b) shows these cases and the left part of the ﬁgure shows the most problematic case when a wide FoV (2α = 60o ) has to be covered by a narrow FoV (2α = 30o ). As we can see, it becomes very diﬃcult for a narrow angle node to cover pbc’s center of gravity g and one of the vertice at the same time.

(a) Heterogeneous AoV

(b) Very small AoV

Fig. 2. Using more alternate points

The solution we propose in this paper is to use alternate points gp , gb and gc that are set in ﬁgure 2(a)(left) as the mid-point of segment [pg], [bg] and [cg] respectively. It is also possible to give diﬀerent weights as shown in the right part of the ﬁgure. When using these additional points, it is possible to require that a sensor vx either covers both c and gc or gc and g (the same for b and gb , and p and gp ) depending on whether the edges or the center of sensor v’s FoV are privileged. Generalizing this method by using diﬀerent weights to set gc , gb and gp closer or farther from there respective vertices can be useful to set which parts v’s FoV has more priority as depicted in ﬁgure 2(a)(right) where gc has moved closer to g, gb closer to b and gp closer to p. 2.3

The Case of Very Small AoV

On some hardware, the AoV can be very small. This is the case for instance with the IMB400 multimedia board on the iMote2 which has an AoV of 2α = 20o . Figure 2(b)(left) shows that in this case, the most diﬃcult scenario is to be able to cover both point p and point gp if gp is set too far from p. As it is not interesting to move gp closer to p with such a small AoV, the solution we propose is to discard point p and only consider point gp that could move along segment [pg] as previously. Therefore in the scenario depicted in ﬁgure 2(b)(right), we have P G = {v3 , v6 }, BG = {v1 , v2 , v5 } and CG = {v4 } resulting in Co(v) = {{v3 , v1 , v4 }, {v3 , v2 , v4 }, {v3 , v5 , v4 }, {v6 , v1 , v4 }, {v6 , v2 , v4 }, {v6 , v5 , v4 }}.

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

2.4

307

Accuracy of the Proposed Method

Using speciﬁc points is of course approximative and a cover can satisfy the speciﬁc points coverage conditions without ensuring the coverage of the entire FoV. To evaluate the accuracy of our cover sets construction technique, especially for very small AoV, we conducted a series of simulations based on the discrete event simulator OMNet++ (http://www.omnetpp.org/). The results were obtained from iterations with various node populations on a 75m.75m area. Nodes → − have random position P , random line of sight V , equal communication ranges of 30m (which determines neighbor nodes), equal DoV of 25m and an oﬀset angle α. We will test with 2α = 20o (α = π/18), 2α = 36o (α = π/10) and 2α = 60o (α = π/6). We run each simulation 15 times to reduce the impact of randomness. The results (averaged over the 15 simulation runs) are summarized in table 1. We will denote by COpbcG , COpbcApbc and CObcApbc the following respective strategies: (i) the triangle points are used with g, which is pbc’s center of gravity, when determining eligible neighbors to be included in a sensor’s cover sets, (ii) alternates points gp , gb and gc are used with the triangle points and, (iii) same as previously except that point p is discarded. The ”stddev of %coverage” column is the standard deviation over all the simulation runs. A small standard deviation value means that the various cover sets have percentages of coverage of the initial FoV close to each other. When ”stddev of %coverage” is 0, it means that each simulation run gives only 1 node with 1 cover set. This is usually the case when the strategy to construct cover sets is too restrictive. Table 1 is divided in 3 parts. The ﬁrst part shows the COpbcG strategy with 2α = 60o , 2α = 36o and 2α = 20o . We can see that using point g gives very high percentage of coverage but with 2α = 36o very few nodes do have cover sets compared to the case when 2α = 60o . With very small AoV, the position of point g is not suitable as no cover sets are found. The second part of table 1 shows the COpbcApbc strategy, where alternate points gp , gb and gc are used along with the triangle vertices, with 2α = 36o and 2α = 20o . For 2α = 36o , this strategy succeeds in providing both a high percentage of coverage and a larger number of nodes with cover sets. When 2α = 20o the percentage of coverage is over 70% but once again very few nodes do have cover sets. This second part also shows CObcApbc (point p is discarded) with 2α = 20o . We can see that this strategy is quite interesting as the number of nodes with cover sets increases for a percentage of coverage very close to the previous case. In addition, the mean number of cover sets per node greatly increases which is highly interesting as nodes with high number of cover sets could act as sentry nodes in the network. The last part of table 1 uses a mixed AoV scenario where 80% of nodes have an AoV of 20o and 20% of nodes an AoV of 36o . This last part shows the performance of the 3 strategies and we can see that CObcApbc presents the best tradeoﬀ in terms of percentage of coverage, number of nodes with cover sets and mean number of cover sets per nodes when many nodes have a small AoV.

308

C. Pham

Table 1. Results for COpbcG , COpbcApbc and CObcApbc . 2α = 20o , 2α = 36o and mixed AoV COpbcG #nodes 75 100 125 150 175

60o

COpbcG #nodes 75 100 125 150 175

36o

COpbcG #nodes all cases

20o

% nodes with mean coverset coverage 4.89 7.13 11.73 17.11 26.19

94.04 94.63 95.06 95.44 94.64

% nodes with mean coverset coverage 0 1 1.87 1.78 3.43

0 92,03 91.45 95.06 94.42

% nodes with mean coverset coverage 0

0

COpbcApbc 36o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

12.44 20.13 30.67 35.11 48.57

COpbcApbc 20o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

1.13 2 2.67 4 7.43

CObcApbc 20o #nodes

% nodes with mean coverset coverage

75 100 125 150 175

7.56 9.13 12.53 21.13 25.13

COpbcG 20o (80%) 36o (20%)

% nodes with mean coverset coverage

#nodes 75,100,125 150 175 COpbcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175 CObcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175

0 0.66 0.57

77.48 79.62 76.89 78.47 77.76

70.61 73.89 71.78 71.67 75.50

73.79 67.16 70.12 70.10 71.79

0 92.13 93.45

% nodes with mean coverset coverage

3.11 3 4.80 8.67 10.19

81.89 69.83 78.58 78.12 76.60

% nodes with mean coverset coverage

9.13 6 10.93 20 20.95

81.48 80.10 73.15 72.12 75.15

% min,max % cover- stddev of age/coverset coverage 90.16,98.15 86.99,98.49 85.10,99.52 84,99.82 83.57,99.89

3.67 4.40 4.12 3.98 4.01

% min,max % cover- stddev of age/coverset coverage 0,0 89.78,98.64 88.83,93.15 91.47,98.19 87.60,99.03

nan 0 2.97 4.06 4.40

% min,max % cover- stddev of age/coverset coverage 0,0

nan

% min,max % cover- stddev of age/coverset coverage 56.46,91.81 53.65,98.98 50.53,97.92 52.07,96.09 49.97,98.10

13.13 12.05 11.58 10.60 10.54

% min,max % cover- stddev of age/coverset coverage 57.60,91.54 69.45,79.80 58.67,84.98 54.18,92.19 54.69,94.01

0 9.50 12.45 14.10 12.87

% min,max % cover- stddev of age/coverset coverage 56.18,88.54 47.78,88.71 40.41,87.46 45.72,91.57 44.15,94.18

12.45 13.80 13.11 11.57 11.91

% min,max % cover- stddev of age/coverset coverage

0,0 83.64,95.83 85.75,96.14

nan 0 0

% min,max % cover- stddev of age/coverset coverage

78.13,89.02 65.50,74.55 69.52,90.92 56.41,97.59 50.4,95.47

8.15 8.18 8.03 13.71 13.48

% min,max % cover- stddev of age/coverset coverage

69.18,93.72 62.82,90.16 47.14,92.14 45.53,95.94 43.01,97.57

9.72 11.81 14.43 12.19 12.59

% min,max #coverset/node 1,5.66 1,6 1,13 1,16.13 1,35.66 % min,max #coverset/node 0,0 1,1 1.13,2 1,3 1.13,2.66 % min,max #coverset/node 0,0 % min,max #coverset/node 1.13,9.13 1,10.66 1,34 1,31.13 1,50.13 % min,max #coverset/node 1,1 1.13,2 1.13,2 1,3.66 1,8 % min,max #coverset/node 1,5 1,4.66 1,11.13 1,19.13 1,37 % min,max #coverset/node

0,0 1,1 1,1 % min,max #coverset/node

1.13,2 1,3.66 1,3.13 1,5 1,8.66 % min,max #coverset/node

1,5.66 1,3.66 1.13,9.13 1,16.66 1,18.13

mean #coverset/node 2.10 2.99 3.53 4.15 6.40 mean #coverset/node 0 1 1.56 1.94 1.92 mean #coverset/node 0 mean #coverset/node 3.62 3.94 5.40 6.90 11.57 mean #coverset/node 1 1.58 1.75 1.91 2.74 mean #coverset/node 2.10 2.14 3.17 4.18 7.05 mean #coverset/node

0 1 1 mean #coverset/node

1.58 1.89 1.56 1.95 2.62 mean #coverset/node

2.06 1.94 3.65 4.83 5.15

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

3

309

Criticality-Based Scheduling of Randomly Deployed Nodes with Cover Sets

As said previously, the frame capture rate is an important parameter that deﬁnes the surveillance quality. In [12], we proposed to link a sensor’s frame capture rate to the size of its cover set. In our approach we deﬁne two classes of applications: low and high criticality applications. This criticality level can oscillate from a concave to a convex shape as illustrated in Figure 3 with the following interesting properties: – Class 1 ”low criticality”, does not need high frame capture rate. This characteristic can be represented by a concave curve (ﬁgure 3(a) box A), most projections of x values are gathered close to 0. – Class 2 ”high criticality”, needs high frame capture rate. This characteristic can be represented by a convex curve (ﬁgure 3(a) box B), most projections of x values are gathered close to the max frame capture rate.

(a) Application classes

(b) The Behavior curve functions

Fig. 3. Modeling criticality

[12] proposes to use a Bezier curve to model the 2 application classes. The advantage of using Bezier curves is that with only three points we can easily deﬁne a ready-to-use convex (high criticality) or concave (low criticality) curve. In ﬁgure 3(b) P0 (0, 0) is the origin point, P1 (bx , by ) is the behavior point and P2 (hx , hy ) is the threshold point where hx is the highest cover cardinality and hy is the maximum frame capture rate determined by the sensor node hardware capabilities. As illustrated in Figure 3(b), by moving the behavior point P1 inside the rectangle deﬁned by P0 and P2 , we are able to adjust the curvature of the Bezier curve, therefore adjusting the risk level r0 introduced in the introduction of this paper. According to this level, we deﬁne the risk function called Rk which operates on the behavior point P1 to control the BV function curvature. According to the position of point P1 the Bezier curve will morph between a convex and a concave form. As illustrated in ﬁgure 3(b) the ﬁrst and the last points delimit the curve frame. This frame is a rectangle and is deﬁned by the source

310

C. Pham Table 2. Capture rate in fps when P2 is at (12,3) r0 0 .1 .4 .6 .8 1

1 .01 .07 .17 .16 .75 1.5

2 .02 .15 .15 .69 1.1 1.9

3 .05 .15 .55 1.0 1.6 2.1

4 0.1 .17 .75 1.1 1.9 2.4

5 .17 .51 .97 1.5 2.1 2.6

6 .16 .67 1.1 1.8 2.1 2.7

7 .18 .86 1.4 2.0 2.5 2.8

8 .54 1.1 1.7 2.1 2.6 2.9

9 .75 1.4 2.0 2.4 2.7 2.9

10 1.1 1.7 2.1 2.6 2.8 2.9

11 1.5 2.1 2.6 2.8 2.9 2

12 3 3 3 3 3 3

point P0 (0, 0) and the threshold point P2 (hx , hy ). The middle point P1 (bx , by ) deﬁnes the risk level. We assume that this point can move through the second −h diagonal of the deﬁned rectangle bx = hxy ∗ by + hy . Table 2 shows the corresponding capture rate for some relevant values of r0 . The cover set cardinality |Co(v)| ∈ [1, 12] and the maximum frame capture rate is set to 3fps.

4

Fast Event Detection with Criticality Management

We are evaluating in this section the performance of an intrusion detection system by investigating the stealth time of the system. For these set of simulations, 150 sensor nodes are randomly deployed in a 75m ∗ 75m area. Unless speciﬁed, sensors have an 36o AoV and the COpbcApbc strategy is used to construct cover sets. Each sensor node captures with a given number of frames per second (between 0.01fps and 3fps) according to the model deﬁned in ﬁgure 3(b). Nodes with 12 or more cover sets will capture at the maximum speed. Simulation ends when there are no active nodes anymore. 4.1

Static Criticality-Based Scheduling

We ran simulations for 4 levels of criticality: r0 = 0.1, 0.4, 0.6 and 0.8. The corresponding capture rates are those shown in table 2. Nodes with high capture rate will use more battery power until they run out of battery (initial battery level is 100 units and 1 captured image consumes 1 unit) but, according to the scheduling model, nodes with high capture rate are also those with large number of cover sets. Note that it is the number of valid cover sets that deﬁnes the capture rate and not the number of cover sets found at the beginning of the cover sets construction procedure. In order to show the beneﬁt of the adaptive behavior, we computed the mean capture rate for each criticality level and then used that value as a ﬁxed capture rate for all the sensor nodes in the simulation model. r0 = 0.1 gives a mean capture rate of 0.12fps, r0 = 0.4 gives 0.56fps, r0 = 0.6 gives 0.83fps and r0 = 0.8 gives 1.18fps. Table 3 shows the network lifetime for the various criticality and frame capture rate values. Using the adaptive frame rate is very eﬃcient as the network lifetime is 2900s for r0 = 0.1 while the 0.12fps ﬁxed capture rate last only 620s. In order to evaluate further the quality of surveillance we show in ﬁgure 4(top) the mean

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes

311

Table 3. Network lifetime r 0 = 0.1 0.12 fps r0 = 0.4 0.56 fps r 0 = 0.6 0.83 fps r0 = 0.8 1.18 fps 2900s 620s 1160s 360s 560s 240s 270s 170s

stealth time when r0 = 0.1, f ps = 0.12, r0 = 0.4 and f ps = 0.56, and in ﬁgure 4(bottom) the case when r0 = 0.6, f ps = 0.83, r0 = 0.8 and f ps = 1.18. The stealth time is the time during which an intruder can travel in the ﬁeld without being seen. The ﬁrst intrusion starts at time 10s at a random position in the ﬁeld. The scan line mobility model is then used with a constant velocity of 5m/s to make the intruder moving to the right part of the ﬁeld. When the intruder is seen for the ﬁrst time by a sensor, the stealth time is recorded and the mean stealth time computed. Then a new intrusion appears at another random position. This process is repeated until the simulation ends. 4

meanStealthTime r°=0.2

mean stealth time (second)

meanStealthTime 0.32fps meanStealthTime r°=0.4 meanStealthTime 0.56fps

3

2

1

0 1E2

1E3 time (second)

2

meanStealthTime r°=0.6

mean stealth time (second)

meanStealthTime 0.83fps meanStealthTime r°=0.8 1.5

meanStealthTime 1.18fps

1

0.5

0 1E2 time (second)

Fig. 4. Mean stealth time. Top: r0 = 0.1, f ps = 0.12, r0 = 0.4, f ps = 0.56. Bottom: r 0 = 0.6, f ps = 0.83, r0 = 0.8, f ps = 1.18.

Figure 5(left) shows for a criticality level r0 = 0.6 the special case of small AoV sensor nodes. When 2α = 20o , we compare the stealth time under the COpbcGpbc and the CObcGpbc strategies. Discarding point p in the cover set construction procedure gives a larger number of nodes with larger number of cover sets, as shown previously in table 1. In ﬁgure 5(left) we can see that the stealth time is very close to the COpbcGpbc case while the network lifetime almost doubles to reach 420s instead of 212s. The explanation is as follows: as more nodes have cover sets, they act as sentry nodes allowing the other nodes to be in sleep mode while ensuring a high responsiveness of the network.

312

C. Pham

1.6

stealthTime, AoV=20°, bcGpbc (discard point p) stealthTime, AoV=20°, pbcGpbc

stealth time (second)

1.4

1.2

1

0

100

200

300

400

time (second)

Fig. 5. Left: Stealth time, sliding winavg with 20 samples batch, r 0 = 0.6, AoV=20o , COpbcGpbc and CObcGpbc . Right: Rectangle with 8 signiﬁcant points, initial sensor v and 2 diﬀerent cover sets.

In addition, for the particular case of disambiguation, we introduce a 8m.4m rectangle at random positions in the ﬁeld. COpbcGpbc is used and 2α = 36o . The rectangle has 8 signiﬁcant points as depicted in ﬁgure 5(right) and moves at the velocity of 5m/s in a scan line mobility model (left to right). Each time a sensor node covers at least 1 signiﬁcant point or when the rectangle reaches the right boundary of the ﬁeld, it appears at another random position. This process starts at time t = 10s and is repeated until the simulation ends. The purpose is to determine how many signiﬁcant points are covered by the initial sensor v and how many can be covered by using one of v’s cover set. For instance, ﬁgure 5(right) shows a scenario where v’s FoV covers 3 points, the left cover set ({v3 , v1 , v4 }) covers 5 points while the right cover set ({v3 , v2 , v4 }) covers 6 points. In the simulations, each time a sensor v covers at least 1 signiﬁcant point of the intrusion rectangle, it determines how many signiﬁcant points are covered by each of its cover sets. The minimum and the maximum number of signiﬁcant points covered by v’s cover sets are recorded along with the number of signiﬁcant points v was able to cover initially. Figure 6 shows these results using a sliding window averaging ﬁlter with a batch window of 10 samples. We can see that node’s cover sets always succeed in identifying more signiﬁcant points. Figure 7 shows that with the rectangle intrusion (that could represent a group of intruders instead of a single intruder) the stealth time can be further reduced.

8

max number of points (slidingwinavg10) min number of points (slidingwinavg10)

number of covered points

initial number of points (slidingwinavg10)

6

4

2

0

100

200

time (second)

Fig. 6. Number of covered points of an intrusion rectangle. Sliding winavg of 10.

Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes 4

313

stealthTime r°=0.8 (winavg10) stealthTime 1.18fps (winavg10) stealthTime r°=0.8 (winavg10), rectangle intrusion

stealth time (second)

3

2

1

0 100

200

time (second)

Fig. 7. Stealth time, winavg with 10 samples batch, r0 = 0.8, f ps = 1.18 and r 0 = 0.8 with rectangle intrusion

4.2

Dynamic Criticality-Based Scheduling

In this section we are presenting preliminary results in dynamically varying the criticality level during the network lifetime. The purpose is to only set the surveillance network in an alerted mode (high criticality value) when needed, i.e. on intrusions. With the same network topology than the previous simulations, we set the initial criticality level of all the sensor nodes to r0 = 0.1. As shown in the previous simulations, some nodes with large number of cover sets will act as sentries in the surveillance network. When a sensor node detects an intrusion, it sends an alert message to its neighbors and increases its criticality level to r0 = 0.8. Alerted nodes will then also increase their criticality level to r0 = 0.8. Both the node that detects the intrusion and the alerted nodes will run at a high criticality level for an alerted period, noted Ta , before going back to r0 = 0.1. Nodes may be alerted several times but an already alerted nodes will not increase its Ta value any further in this simple scenario. As said previously, we do not attempt here to optimize the Ta value nor using several level of criticality values. Figure 8shows the mean stealth time with this dynamic behavior. Ta is varied from 5s to 60s. We can see that this simple dynamic scenario already succeeds in reducing the mean stealth time while increasing the network lifetime when compared to a static scenario that provides the same level of service.

4

mean stealthTime Ta=5s

mean stealth time (second)

mean stealthTime Ta=10s mean stealthTime Ta=15s mean stealthTime Ta=20s

3

mean stealthTime Ta=30s mean stealthTime Ta=40s 2

1

500

1000

1500

time (second)

Fig. 8. Mean stealth time with dynamic criticality management

314

5

C. Pham

Conclusions

This paper presented the performances of cover sets construction strategies and dynamic criticality scheduling that enable fast event detection for mission-critical surveillance with video sensors. We focused on taking into account cameras with heterogeneous angle of view and those with very small angle of view. We show that our approach improves the network lifetime while providing low stealth time in case of intrusion detection systems. Preliminary results with dynamic criticality management also show that the network lifetime can further be increased. These results show that besides providing a model for translating a subjective criticality level into a quantitative parameter, our approach for video sensor nodes also optimize the resource usage by dynamically adjusting the provided service level. Acknowledgment. This work is partially supported by the FEDER POCTEFA EFA35/08 PIREGRID project, the Aquitaine-Aragon OMNI-DATA project and by the PHC Tassili project 09MDU784.

References 1. Collins, H.F.R., Lipton, A., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE 89(10) (2001) 2. Yan, T., He, T., Stankovi, J.A.: Diﬀerentiated surveillance for sensor networks. In: ACM SenSys (2003) 3. He, T., et al.: Energy-eﬃcient surveillance system using wireless sensor networks. In: ACM MobiSys (2004) 4. Oh, S., Chen, P., Manzo, M., Sastry, S.: Instrumenting wireless sensor networks for real-time surveillance. In: International Conference on Robotics and Automation (2006) 5. Cucchiara, R., et al.: Using a wireless sensor network to enhance video surveillance. J. Ubiquitous Computing and Intelligence 1(2) (2007) 6. Stoianov, L.N.I., Madden, S.: Pipenet: A wireless sensor network for pipeline monitoring. In: ACM IPSN (2007) 7. Albano, M., Pietro, R.D.: A model with applications for data survivability in critical infrastructures. J. of Information Assurance and Security 4 (2009) 8. Dousse, O., Tavoularis, C., Thiran, P.: Delay of intrusion detection in wireless sensor networks. In: ACM MobiHoc (2006) 9. Zhu, Y., Ni, L.M.: Probabilistic approach to provisioning guaranteed qos for distributed event detection. In: IEEE INFOCOM (2008) 10. Freitas, E., et al.: Evaluation of coordination strategies for heterogeneous sensor networks aiming at surveillance applications. In: IEEE Sensors (2009) 11. Keally, M., Zhou, G., Xing, G.: Watchdog: Conﬁdent event detection in heterogeneous sensor networks. In: IEEE Real-Time and Embedded Technology and Applications Symposium (2010) 12. A. Makhoul, R. Saadi, and C. Pham: Risk management in intrusion detection applications with wireless video sensor networks. In: IEEE WCNC (2010). 13. Rahimi, M., et al.: Cyclops: In situ image sensing and interpretation in wireless sensor networks. In: ACM SenSys (2005) 14. Makhoul, A., Pham, C.: Dynamic scheduling of cover-sets in randomly deployed wireless video sensor networks for surveillance applications. In: IFIP Wireless Days (2009)

An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices Nicholas Martin1, Yamin Al-Mousa1, and Nirmala Shenoy2 1 College of Computing and Information Science, Networking Security and Systems Administration Dept, Rochester Institute of Technology, 1 Lomb Dr, Rochester, NY USA 14623 [email protected], {ysa49,nxsvks}@rit.edu 2

Abstract. In this paper we present an integrated solution that combines routing, clustering and medium access control operations while basing them on a common meshed tree algorithm. The aim is to achieve an efficient airborne surveillance network of unmanned aerial vehicles, wherein any loss of captured data is kept to the minimum while maintaining low latency in packet and data delivery. Surveillance networks of varying sizes were evaluated with varying numbers of senders, while the physical layer was maintained invariant. Keywords: meshed trees, burst forwarding medium access control, surveillance.

1 Introduction Mobile Ad Hoc Networks (MANETs) of unmanned aerial vehicles (UAVs) face severe challenges to deliver surveillance data without loss of information to specific aggregation nodes. Depending on the time sensitivity of the captured data, the end to end packet and file delivery latency could also be critical metrics. Two major protocols from a networking perspective that can impact lossless and timely delivery are the medium access control (MAC) and routing protocols. Physical layer and transport layer protocols will certainly play a major role; however, we limit the scope of this work to MAC and routing protocols. These types of surveillance networks require several UAVs to cover a wide area while the UAVs normally travel at speeds of 300 to 400 km/h. These features pose additional significant challenges to the design of MANET routing and MAC protocols as they now must be both scalable and resilient: being able to handle the frequent route breaks due to node mobility. The predominant traffic pattern in surveillance networks is converge-cast, where data travels from several nodes to an aggregation node. We leverage this property in the proposed solution. We also integrate routing and MAC functions into a single protocol layer, where both routing and MAC operations are achieved with a single address. The routing protocol uses the inherent path information contained in the addresses, while the MAC uses the same addresses for hop by hop packet forwarding. Data aggregation or converge-cast types of traffic are best handled through multi hop clustering, wherein a cluster head (CH) is the special type of node that aggregates M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 315–327, 2011. © Springer-Verlag Berlin Heidelberg 2011

316

N. Martin, Y. Al-Mousa, and N. Shenoy

the data and manages the cluster. The solution we propose uses one such clustering scheme that is based on a ‘meshed tree’ principle [1], where the root of the meshed tree is the CH. As the cluster is a tree, the branches connecting the cluster clients (CCs) to the CH provide a path to send the data from the CCs to the CH. Thus, a clustering mechanism is integrated into the routing and MAC framework. The ‘meshing’ of the tree branches allows one node to reside in multiple tree branches that originate from the root, namely the CH. The duration of residency on a branch depends on the movement pattern and speeds of the nodes. Thus, as nodes move, they may leave one or more branches and connect to new branches. Most importantly, even if a node loses one path to the CH, it likely remains connected to the CH via another branch and thus has an alternate path. The clustering scheme also allows for the creation of several overlapped multi hop clusters leading to the notion of multi meshed trees (MMT). The overlap is achieved by allowing the branches of one meshed tree to further mesh with the branches in neighboring clusters. This provides connectivity to cluster clients moving across clusters. It also helps extend the coverage area of the surveillance network to address scalability.

2 Related Work The topic areas of major contribution in this article relate to routing protocols, clustering algorithms and MAC protocols for mobile ad hoc networks. The significance in this framework solution lies in the closely integrated operations of routing, clustering and MAC. To the best of our knowledge in the literature published thus far no solution targets such an approach. Cross layered approaches, which break down the limitations of inter-layer communications to facilitate more effective integration and coordination between protocol layers, is one approach that has similar goals. However, our solution is not a cross layered approach. We felt that in a dedicated and critical MANET application, such as a surveillance network, one should not be constrained by the protocol layers or stacks, but achieve the operations through efficient integration of required functions. For the above reasons, it becomes difficult to cite and discuss related work that has an approach similar to ours. However, as we use a multi hop clustering scheme we will highlight multi hop clustering algorithms discussed in the literature. Our solution includes a routing scheme, so we will discuss some proactive, reactive and hybrid routing algorithms to highlight the differences in the proposed routing scheme. We will cite some framework solutions that combine clustering and routing to explain the difference in the approaches. The MAC adopted in this work, is based of CSMA/CA, but uses the addresses adopted in our solution to achieve efficient data aggregation by sending several packets in a burst i.e. a sequence of packets. Several survey articles published on MANET routing and clustering schemes from different perspectives indicate the continuing challenges in this topic area [2, 3]. Proactive routing protocols require dissemination of link information periodically so that a node can use standard algorithms such as Dijkstra’s to compute routes, to all other nodes in the network or in a given zone [4]. Link information dissemination requires flooding of messages that contain such information. In large networks such transmissions or control messages can consume significant amounts of the bandwidth

An Integrated Routing and Medium Access Control Framework

317

making the proactive routing approach not scalable. Several proactive routing protocols thus target mechanisms to reduce this control overhead i.e. bandwidth used for control messages. Fisheye State Routing (FSR) [8] Fuzzy Sighted Link State Hazy Sighted Link State [10] Optimized Link State [6] and Topology Broadcast Reverse Path Forwarding [9] are some such approaches. Reactive routing protocols avoid the periodic link information dissemination and allow a node to discover routes to a destination node only when it has data to send to that destination node. The reactive route discovery process can result in the source node receiving several route responses which it may cache. As mobility increases, route caching may become ineffective as pre-discovered routes may become stale and unusable. Dynamic Source Routing (DSR) [5], Ad Hoc On-demand Distance Vector (AODV) [4] Temporally Ordered Routing Algorithm [3] and Light-Weight Mobile Routing [13] are some the more popular reactive routing approaches. Partitioning a MANET physically or logically and introducing hierarchy has been used to limit message flooding and also addresses scalability. Mobile Backbone Networks (MBNs) [14] use hierarchy to form a higher level backbone network by utilizing special backbone nodes with low mobility to have an additional powerful radio to establish wireless links amongst themselves. LANMAR [13], Sharp Hybrid Adaptive Routing Protocol (SHARP) [15] Hybrid Routing for Path Optimality [11], Zone Routing Protocol (ZRP) [12] are protocols under this category. Nodes physically close to each other form clusters with a CH communicating to other nodes on behalf of the cluster. Different routing strategies can be used inside and outside a cluster. Several cluster based routing protocols address scalability issues faced in MANETs. Cluster head Gateway Switch Routing (CGSR) [16] and Hierarchical State Routing (HSR) [17] are two popular cluster based routing schemes.

3 Meshed Tree Clustering It is important to understand the cluster formation in the clustering scheme under consideration and the routing capabilities within the cluster for data aggregation at the CH. The multi hop clustering scheme and the cluster formation based on ‘meshed’ tree algorithm is described with the aid of Figure 1. The dotted lines connect nodes that are in communication range with one another at the physical layer. The data aggregation node or cluster head is labeled ‘CH’. Nodes A through G are the CCs. 142

F

141

C 14 142

E

143 131

CH 111

13 D

G

B

11

132

Fig. 1. Cluster Formation Based on Meshed Trees

12

A

121

318

N. Martin, Y. Al-Mousa, and N. Shenoy

At each node several ‘values’ have been noted. These are the virtual IDs (VIDs) assigned to the node when it joins the cluster. In Figure 1, each arrow from the CH is a branch of connection to the CCs. Each branch is a sequence of VIDs that is assigned to CCs connecting at different points of the branch. The branch denoted by VIDs 14, 142 and 1421, connects nodes C (via VID 14), F (via VID 142) and E (via VID 1421) respectively to the CH. Assuming that the CH has a VID ‘1’, the CCs in this cluster will have ‘1’ as the first prefix in their VIDs. Any CC that attaches to a branch is assigned a VID, which will inherit its prefix from its parent node, followed by an integer, which indicates the child number under that parent. This pattern of inheriting the parent’s VID will be clear if the reader follows through the branches identified in Figure 1 by the arrows. The meshed tree cluster is formed in a distributed manner, where a node listens to its neighbor nodes advertising their VIDs, and decides to join any or all of branches as noted in the advertised VIDs. A VID contains information about number of hops from the CH. This is inherent in the VID length that can then be used by a node to decide the branch it would like to join if shortest hop count is a criterion. Once a node decides to join a branch it has to inform the CH. The CH then registers the node as its CC and confirms its admittance to the cluster and accordingly updates a VID table of its CCs. A CH can restrict admittance of nodes that are within a certain number of hops or not admit new nodes to keep the number of CCs in the cluster under a certain value. This is useful to contain the data collection zone of a cluster. Routing in the Cluster: The branches of the meshed tree provide the route to send and receive data and control packets between the CCs and the CH. As an example, consider packet routing where the CH has a packet to send to node E. The CH may decide to use the path given by VID 1421 to E. The CH will include its VID ‘1’ as the source address and E’s VID 1421, as the destination address and broadcast the packet. The nodes that will perform the hop by hop forwarding are nodes C and F. This is so, as from the source VID and destination VID, C will know that it is the next hop en route, because it has a VID 14 and the packet came from VID ‘1’ and is destined to 1421 i.e. it uses a path vector concept. When C broadcasts the packet subsequently, F will receive and eventually forward to E. The VID of a node thus provides a virtual path vector from the CH to itself. Note that the CH could have also used VIDs 143 or 131 for node E, in which case the path taken by the packet would have been CH-C-E or CH-D-E respectively. Thus between the CH and node E there are multiple routes as identified by the multiple VIDs. The concept of support for multiple routes through multiple VIDs, allows for robust and dynamic route adaptability to topology changes in the cluster. 221 J

251 F

21

K

142

2

H

142 E

CH2

143 131

1 1 CH1

25

23 L 241

M

24

B

1

14

211

22

14

C

111 252 G

D

13

132

Fig. 2. Overlapped Cluster Formation Based on Meshed Trees

12

A

121

An Integrated Routing and Medium Access Control Framework

319

Route failures: Capturing all data without loss is very important in surveillance networks used in tactical applications. Loss of data can be caused due to route failures or collisions at the MAC. There are two cases of route failures that can occur, yet be swiftly rectified, in the proposed solution. In the first case, a node may be in the process of sending data, and has even sent part of the data using a particular VID, only to discover that said VID or path is not valid anymore. In the second case, a node may be forwarding data for another node, but after collecting and forwarding a few data packets, this forwarding node also loses the VID which was being used. Case I: Source node loses a route: For example, node B in Figure 2 is sending a 1 MB file to the CH using its shortest VID ‘11’. Assume that node B was able to send ½ MB, at which time due to its mobility it lost its VID ‘11’ but was still able to continue with VID ‘121’ and send the rest ½ MB of data using VID ‘121’. Case II: Intermediate node loses a route: Let us continue the above example. Node A is forwarding the data from node B on its VID 12 (data comes from node B via its VID 121). After sending a ¼ MB assume that node A moves in the direction of node D, loses its VID 12 but gains a new VID ‘131’ as it joins the branch under node D. Node A can continue sending the rest of the file using its new VID 131. As the knowledge about the destination node is consistent (i.e. it is the CH with VID ‘1’) any node is able to forward the collected data towards the CH. Disconnects: In a disconnect situation, a missing VID link may first be noticed by the parent or child of a node with whom the link is shared. In such cases, the parent node will inform the CH of the missing child VID, such that the CH will not send any messages to it. Meanwhile the child node, which is downstream on the branch, will notify its children about their lost VIDs (VIDs derived from the missing VID) so that they will invalidate those VIDs and hence not use them to send data to the CH. Inter-cluster Overlap and Scalability: As a surveillance network can have several tens of nodes the solution proposed must be scalable. We assume that several ‘data aggregation nodes (i.e. CHs)’ be uniformly distributed among the non-data aggregation nodes during deployment of the surveillance network. Meshed tree clusters can be formed around each of the data aggregation nodes by assuming them to be the CHs. Nodes bordering two or more clusters are allowed to join the branches originating from different CHs, and will accordingly inform their respective CHs about their multiple VIDs under the different clusters. When a node moves away from one cluster, it can still be connected to other clusters, and the surveillance data collected by that node will not be lost. Also, by allowing nodes to belong to multiple clusters, the single meshed tree cluster-based data collection can be extended to multiple overlapping meshed tree clusters that can collect data from several nodes deployed over a wider area with a very low loss probability of the captured data. Figure 2 shows two overlapped clusters and some border nodes that share multiple VIDs across the two clusters. The concept is extendable to several neighboring clusters. Nodes G and F have VIDs 142, 132 under CH1 and VIDs 251 and 252 under CH2, respectively. Note that a node is aware of the cluster under which it has a VID as the information is inherent in the VIDs it acquires, thus a node has some intelligence to decide which VIDs it would like to acquire – i.e. it can decide to have several VIDs under one cluster, or acquire VIDs that span several clusters and so on.

320

N. Martin, Y. Al-Mousa, and N. Shenoy

Significance of the Approach: From the meshed tree based clustering and routing scheme described thus far, it should be clear that our scheme adopts a proactive routing approach, where the proactive routes between CCs and CH in a cluster are established as the meshed trees or clusters are formed around each CH. Thus using a single algorithm during the cluster joining process a node automatically acquires routes to the CH. There is flexibility in dimensioning the cluster in terms of CCs in a cluster and the maximum hops a CC is allowed from a CH. The tree formation is different from the spanning trees discussed in the literature, as a node is allowed to simultaneously reside in several branches, and thus allowing for dynamic adaptability to route changes as nodes move. This also enhances robustness in connectivity to the CH. This approach is ideal for data aggregation from the CCs to the CH, and is very suitable for MANETs with highly mobile nodes.

4 Burst Forwarding Medium Access Control Protocol The Burst Forwarding Medium Access Control (BF-MAC) is primarily focused with reducing collisions while providing the capability of MAC forwarding of multiple data packets from one node to another node in the same cluster. Additionally, the MAC allows for sequential ‘node’ forwarding where all intermediate nodes forward a burst of packets one after another in a sequence between a source and destination node through multiple hops. These capabilities are created through careful creation of MAC data sessions, which encompass the time necessary to burst multiple packets across multiple hops. For non-data control packets, such as those from the routing and cluster formation process, the MAC uses a system based on Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA). 121

Data

12

Data

CH 1

Fig. 3. Illustration of traffic forwarding along a single tree branch

The above type of MAC forwarding is possible due to the VIDs, which carry information about a node’s CH, and the intermediate nodes which the MAC makes use of. As such, a node’s data will physically travel up a VID branch to the CH in that tree. Therefore, by knowing which VID was used by a node to send a data packet, and that packet’s intended destination (the CH), an overhearing node can determine the next VID in the path. This process is used by all overhearing nodes to forward in their turn a packet all the way to the CH. This is illustrated in Figure 3, where when the node with VID 121 has data to send to CH1, the intermediate node with VID 12 will pick up and forward to the CH. The MAC process at a node that has data to send creates a MAC data session. A Request to Send (RTS) packet is sent by the node and is forwarded by the intermediate nodes till it reaches the CH. When a recipient node (i.e. a forwarding

An Integrated Routing and Medium Access Control Framework

321

node) along the path receives the RTS, it becomes a part of the data session. A set of data packets may then be sent to the intended destination, in this case the CH, along the same path as the RTS packet. The final node in the path, the CH, will send an explicit acknowledgement (eACK) packet to the previous node for a reliability check. eACKs are not forwarded back to the initial sender. Nodes in the path of the data session, except for the penultimate node, instead listen for the packet just sent to the next node. This packet will be the same packet being forwarded by the next node in the data session path (be it either an RTS or a data packet). Receiving this packet is an indication of an implicit acknowledgment (iACK), as the next node must have received the sent packet if it is now attempting to forward it. Note that the iACK is really the forwarded RTS or data packet. Not receiving any type of acknowledgment will cause a node to use the MAC retry model, discussed below. During a data session collisions from neighboring nodes are prevented in the same way as the collision avoidance mechanism in CSMA/CA. Nodes that hear a session in progress keep silent. When a node overhears an RTS, eACK or data packet, for which it is not the destination or the next node in line to forward, it will switch to a Not Clear to Send (NCTS) mode. This will prevent a node from sending any control packets or joining a data session. If a node is already part of a separate data session, the node will continue with that data session. The NCTS mode lasts for a duration specified as the Session on Wait (SOW) time, noted in the packets being transmitted during the session. The SOW time is calculated by the initial sender within a data session, and marks the amount of time left for a particular data session. At each hop, it is decremented by the transmission time of the current packet to send plus a guard time to account for propagation delay as shown in Figure 4. When SOW time has elapsed, the data session is over and all nodes return to a Clear to Send (CTS) mode. A node in CTS mode may start a new data session, join a data session via forwarding, or send control packets.

SOW p-n A

X

Y SOW p-2n

SOW p-n

B

SOW p-3n

SOW p-2n

Z

eACK SOW p-5n

C SOW p-3n D

SOW p-5n SOW p-4n

CH

Fig. 4. Illustration of dissemination of SOW timings. ‘p’ represents the time necessary for all packets remaining to be sent, while n represents the time to transmit a single packet plus propagation delay

Control packets from the routing and clustering process are queued and sent using CSMA/CA whenever a node is in CTS mode. To take further advantage of the MAC’s data sessions in preventing possible collisions, nodes are also allowed to send control packets within a data session by extending SOW time a fixed amount. Retry Model: The MAC stores any RTS or data packet sent into a retry queue. Until an eACK or iACK is heard for that packet, the packet will be retried up to three times within a single data session. Nodes will continue to receive data and issue eACKs for

322

N. Martin, Y. Al-Mousa, and N. Shenoy

data packets while retrying the other packet. At the end of the data session, nodes will move any outstanding packets into their own data queues and will send them subsequently pretending to be the initial sender. If a packet fails to be sent in two separate data sessions, an error report is sent to the routing and clustering process for further action. The MAC brings the added capability of any node taking over and forwarding the packets to the destination, which is the CH, and uses the VIDs, which burst forward packets from CCs to the CH. This is the uniqueness of the proposed solution and the primary reason for integrating the different operation due to the natural dependency of all three schemes upon the one algorithm. Separating them into different layers would have resulted in suboptimal performance of the framework, which may not be an efficient solution for such critical applications as surveillance networks.

5 Simulation and Performance While there are numerous routing and cluster based routing algorithms proposed in the literature, they have not been evaluated for the type of surveillance applications stated in this article, nor are the performance metrics the same as ours. Hence the results published for these algorithms cannot be compared with our results. Neither would it be reasonable for us to model the solutions that we decide to be suitable solutions for a comparative study with the proposed solution. Hence we decided to conduct our comparison with two well known routing protocols OLSR and AODV. The first is a proactive routing protocol and the second is a reactive routing protocol. We use the proactive routing protocol OLSR to evaluate and compare with the performance of our solution to small networks of size around 20 nodes. Furthermore to make the studies comparable, we designated certain nodes as data collection nodes and the destination for data sending nodes in it vicinity, such that we overcome the cluster formation problem. We used the reactive routing protocol to evaluate and compare the performances from the control overhead perspective in networks of sizes 50 and 100. In this case also the collection nodes were designated as destination for nodes in its vicinity. For completeness we evaluated OLSR, AODV and the MMT for all 20, 50 and 100 node scenarios, with varying numbers of senders. For OLSR and AODV we used the custom developed 802.11 CSMA/CA models available with Opnet. This work was conducted as part of an ONR funded project, where it was expected for us to use the ns2 simulation tool. We used ns2.3.4. The OLSR and AODV models available in ns2 were not designed to operate in network scenarios as those outlined above, hence we used the custom developed models of OLSR and AODV in Opnet. These Opnet models provide flexibility in selecting optimal parameters and thus optimal operational conditions through proper setting of retry times, intervals for sending ‘hello’, ‘topology control’ and other control messages for OLSR and AODV. The scenario set up in the MMT solution in ns2 however faced constraints due to the random placement and selection of sending nodes as compared to selecting the closest nodes to send to the closest designated destination (alias the CH) as in Opnet. We therefore recorded the average hops between a source and destination node in all our test scenarios, to serve as a baseline for comparison.

An Integrated Routing and Medium Access Control Framework

323

Simulation parameters: The transmission range was maintained at approximately 10 km. The data rate was set to 11 Mbps, the standard 802.11 data rates. No error correction was used for the transmitted packets and any packet with a single bit error was dropped. Circular trajectories with radii of 10 Km were used. The reasons for using circular trajectories was to introduce more stress into the test scenarios, as these trajectories would result in more route breaks than elliptical trajectories, which should have been used normally. Some of the trajectories used clockwise movement, while others used an anti-clockwise movement. This was done again to introduce stress in the test scenarios. The UAV speeds of the nodes varied between 300 and 400 km/h. Hello interval was maintained at 10 seconds. The above scenario parameters were maintained consistent for all test scenarios. The performance metrics targeted were • • •

Success rate, calculated as the percentage of number of packets successfully delivered to the destination node. Average end to end packet delivery latency calculated in seconds. Overhead was calculated as the ratio of control bits to the sum of control and data bits during data delivery for compatibility of comparisons with reactive routing.

All the above performance metrics were recorded along with the average hops between sender and receiver nodes, for 20, 50 and 100 nodes, where the number of sending nodes was varied depending on the test scenario. The file sizes used for data sessions were each 1 MB, and the packet sizes were 2 KB. In a session all senders would start sending the 1 MB file simultaneously towards the CH. We provide indepth explanation for the 20 node graphs; the graphs in 50 and 100 nodes have a similar trend, hence we do not repeat the explanations.

Fig. 5. Performance Graphs for 20 Node Scenario

324

N. Martin, Y. Al-Mousa, and N. Shenoy

Analysis of results for the 20 nodes test scenario: Figure 5 shows the four performance graphs based on results collected under the 20 nodes scenario. The number of senders was varied from 5 to 10 to 16, where in the last case as there were 4 data aggregation nodes, all other nodes i.e. all CCs were sending data to their respective CHs. The first graph is the plot of the success rate versus the number of sending nodes. In the MMT based framework, the success rate was 100% as the number of sending nodes was increased from 5, 10 to 16. For AODV and OLSR the success rate was high with 5 senders but decreased with an increase in the number of senders. While the success rate for AODV drops to 82%, for OLSR it dropped only to 87%. The success rate for OLSR with 10 senders is less than with 16 senders. This discrepancy will be clear if we look at the average number of hops between sending and receiving nodes: with 5 senders the average hops recorded was 1.38, for 10 senders it was 1.32 and for 16 senders it dropped down to 1.22. This happened because the 5 senders selected first were further away from the designated destination node. In the case of 10 senders the added 5 senders were now closer to the destination node but when the last 6 senders were included they were still closer to the destination node bringing down the average hops, and thus were able to increase the success rate in packet delivery. However between 5 senders and 10 senders, due to the increase in traffic in the network, the average hops dropped by 0.6, yet the success rate still experienced a decrease. A similar explanation holds for the MMT framework too, where the average hops with 10 senders is lower than with 16 senders; however this did not affect the success rate and all packets were delivered successfully. MMT and AODV show very low latency as compared to OLSR. Due to the reduced success rates in the case of AODV, fewer packets were delivered and thus there is a dip in the average latency for 10 sending nodes, as the amount of traffic due to data packets is less in the network and also the packets which were taking longer did not make it to the destination. OLSR shows a higher latency due to the control traffic which delays the data traffic. The MMT solution has very low overhead compared to OLSR and AODV in all 3 cases of 5, 10 and 16 senders. The reason for this can be attributed to the local recovery of any link failures as handled by MMT as compared to OLSR which requires resending the updated link information, or in the case of AODV, which has to rediscover routes if the cached routes are stale. The second reason could be the reduced collision and better throughput due to the BF-MAC. A point worth noting is that though MMT adopts a proactive routing approach, its overhead is very much lower than the reactive routing used in AODV even with fewer number of sending nodes i.e. 5 senders. Validation of the Comparison Process: It may seem to the reader that there are several improved variations of OLSR and AODV that have may have performed better than just OLSR and AODV. However, it should be noted that the proposed framework outperforms OLSR and AODV significantly in all performance aspects, especially for the type of surveillance applications considered in the work. This is despite the fact that the average number of hops encountered between the sending and receiving nodes in the MMT framework is higher than OLSR by a significant amount in all 3 cases of 5, 10 and 16 senders and comparable with AODV for the 10 and 16 senders, but higher in the case of 5 senders.

An Integrated Routing and Medium Access Control Framework

325

Fig. 6. Performance Graphs for 50 Node Scenario

Analysis of results for the 50 node test scenario: Figure 6 shows the four graphs for the 50 node scenario. The MMT based solution continues to maintain the success rate very close to 100% as the number of senders increased to 40, where all CCs send to their respective CHs. OLSR and AODV show a decrease in the success rate with AODV drop being higher than OLSR with 40 senders, which can be attributed to the increased number of senders, which is a well known phenomenon with reactive routing protocols. The average end to end packet delivery latency for OLSR is higher than AODV, because of the higher number of average hops with 20 senders and higher successful packets transmitted at 40 senders. The end to end packet delivery latency for MMT is still quite low and comparable to that achieved with AODV, in which 15 to 35% of the packets were not delivered. The overhead with MMT is now at 10% compared with OLSR’s around 20% and AODV with over 30%. Analysis of results for the 100 node test scenario: Figure 10 shows the four graphs for the 100 node scenario. While MMT consistently exhibits a similar performance as seen for the 20 and 50 nodes with a slight increase in the overhead and latency with increased number of senders, while the average hops still being greater than AODV and OLSR. OLSR shows a further drop in the success rate as compared to the 50 node scenario, which is due to the limitations faced when flooding the topology control messages. While the AODV success rate starts at 75% and drops to 68% for 40 senders and 47.5% for 80 senders, which is as expected. Overhead for AODV is higher than for the 50 nodes scenario as there are more discovery messages, while OLSR maintains the overhead between 20% to 30%.

326

N. Martin, Y. Al-Mousa, and N. Shenoy

Fig. 7. Performance Graphs for 100 Node Scenario

6 Conclusion In this paper, we presented an integrated routing, clustering and MAC framework based on a meshed tree principle, where all three operations use features of the meshed tree algorithm for their operation. The framework was designed especially to handle airborne surveillance networks for collection of surveillance data with the least data loss and in a timely manner. We evaluated the framework and compared it with the two standard protocols, OLSR and AODV, by providing comparable network settings in each case. The performance of the proposed solution indicates its high suitability to such surveillance applications.

References 1. Shenoy, N., Pan, Y., Narayan, D., Ross, D., Lutzer, C.: Route Robustness of a Multimeshed Tree Routing Scheme for Internet MANETs. In: Proceeding of IEEE Globecom 2005, St Louis, November 28-December 2 (2005) 2. Abolhasan, M., Wysocki, T., Dutkiewicz, E.: A review of routing protocols for mobile ad hoc networks. Journal of ad hoc networks (2004) 3. Royer, E.M., Toh, C.-K.: A Review of Current Routing Protocols for Ad Hoc Mobile Wireless Networks. IEEE Personal Communications Magazine, 46–55 (April 1999) 4. Perkins, C.E., Royer, E.M., Das, S.R.: Ad Hoc On-Demand Distance Vector (AODV) Routing, IETF Mobile Ad Hoc Networks Working Group. IETF RFC 3561 5. Johnson, D.B., Maltz, D.A., Hu, Y.-C.: Dynamic Source Routing Protocol for Mobile Ad Hoc Networks. IETF MANET Working Group, Internet Draft (February 24, 2003)

An Integrated Routing and Medium Access Control Framework

327

6. Clausen, T., Jacquet, P.: Optimized Link State Routing Protocol (OLSR), Network Working Group, Request for Comments: 3626 7. Das, S., Castaneda, R., Yan, J.: Simulation-Based Performance Evaluation of Routing Protocols for MANETs. In: Mobile Networks and Applications 2000, vol. 5, pp. 179–189 (2000) 8. Pei, G., Gerla, M., Chen, T.-W.: Fisheye State Routing: A Routing Scheme for Ad Hoc Wireless Networks. In: IEEE ICC, vol. 1, pp. 70–74 (2000) 9. Bellur, B., Ogier, R.G.: A Reliable, Efficient Topology Broadcast Protocol for Dynamic Networks. In: Proc. IEEE INFOCOM 1999, New York (March 1999) 10. Santivanez, C., Ramanathan, R., Stavrakakis, I.: Making Link-State Routing Scale for Ad Hoc Networks. In: Proceedings of Mobihoc 2001, Long Beach, California (October 2001) 11. Pei, G., Gerla, M., Hong, X., Chiang, C.-C.: A Wireless Hierarchical Routing Protocol with Group Mobility. In: Proceedings of IEEE WCNC 1999, New Orleans, LA (September 1999) 12. Haas, Z.J., Pearlman, M.R.: Performance of Query Control Schemes for Zone Routing Protocol. ACM/IEEE Transactions on Networking 9(4), 427–438 (2001) 13. Pei, G., Gerla, M., Hong, X.: LANMAR: Landmark Routing for Large Scale Wireless Ad Hoc Networks with Group Mobility. In: Proceedings of IEEE/ACM MobiHOC 2000, Boston, MA, pp. 11–18 (August 2000) 14. Xu, K., Hong, X., Gerla, M.: An Ad hoc Network with Mobile Backbone. In: Proceeding ICC 2002, April 28-May 2, vol. 5, pp. 3138–3143 (2002) 15. Ramasubramanian, V., Haas, Z.J., Sirer, E.G.: SHARP: A Hybrid Adaptive Routing Protocol for Mobile Ad Hoc Networks 16. Lin, C.R., Gerla, M.: Adaptive clustering for mobile wireless networks. IEEE Journal on Selected Areas in Communications 15(7), 1265–1275 (1997) 17. Basagni, S.: Distributed and mobility-adaptive clustering for multimedia support in multihop wireless networks. In: IEEE VTS 50th Vehicular Technology Conference, VTC 1999 Fall, vol. 2, pp. 889–893 (1999)

Security in the Cache and Forward Architecture for the Next Generation Internet G.C. Hadjichristofi1, C.N. Hadjicostis1, and D. Raychaudhuri2 1 University of Cyprus, Cyprus WINLAB, Rutgers University, USA [email protected], [email protected], [email protected] 2

Abstract. The future Internet architecture will be comprised predominately of wireless devices. It is evident at this stage that the TCP/IP protocol that was developed decades ago will not properly support the required network functionalities since contemporary communication profiles tend to be datadriven rather than host-based. To address this paradigm shift in data propagation, a next generation architecture has been proposed, the Cache and Forward (CNF) architecture. This research investigates security aspects of this new Internet architecture. More specifically, we discuss content privacy, secure routing, key management and trust management. We identify security weaknesses of this architecture that need to be addressed and we derive security requirements that should guide future research directions. Aspects of the research can be adopted as a step-stone as we build the future Internet. Keywords: wireless networks, security, cache and forward, key management, trust management, next generation Internet.

1 Introduction The number of wireless devices has increased exponentially in the last few years indicating that wireless will be the key driver to future communication paradigms. This explosion of wireless devices has shifted the Internet architecture from one whose structure is based mainly on wired communication to a hybrid of wired and wireless communication. Wireless devices are no longer considered the edge devices of the Internet, but are also shifting into the role of mobile routers that transmit data over multiple hops to other wireless devices. In the current Internet, TCP/IP was designed as the network protocol for transmitting information and has served the Internet well for several decades. However, wireless connections are characterized by intermittent, error-prone, and low bandwidth connectivity, which causes TCP to fail [1]. Therefore, the nature of the networking problem now is different requiring a drastic shift in the solution space and, with that a new Internet architecture. These next-generation Internet architectures aim to shift away from TCP/IP-based communication that assumes stable connectivity between end-hosts, and instead move into a paradigm where communication is content-driven. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 328–339, 2011. © Springer-Verlag Berlin Heidelberg 2011

Security in the CNF Architecture for the Next Generation Internet

329

A recently proposed next-generation Internet architecture, is the Cache and Forward architecture (CNF). The objective of this architecture is to move files or packages from source to destination over both wired and wireless hops as connectivity becomes available, i.e., use opportunistic transport. The architecture is built on a subset of existing Internet routers and leverages the decreasing cost of memory storage. Due to the difference in the operation of this architecture, security aspects (such as key management) need to be revisited and augmented accordingly. This research aims to be an investigation of security aspects in the future Internet architecture. We investigate ways with which the CNF architecture can be used to provide the required security regarding issues of data privacy, secure routing of files at higher OSI layers (i.e., at the CNF layers), key management, and trust management. The aim of this paper is not to present complete solutions for the aforementioned security areas, but rather to present security strength and weaknesses of the CNF architecture and discuss possible solution scenarios as a means to point security vulnerabilities, and to motivate and direct future research. Based on the discussion we extract key challenges that need to be addressed to provide a more complete system security solution. It is important to ensure that security is built into systems to allow the secure and dynamic access of information. To the best of our knowledge, it is the first investigation of security issues in this architecture. Section 2 describes the CNF architecture. Section 3 provides the security analysis of the CNF architecture and extracts the security requirements for this new architecture. Topics covered are content privacy, secure routing, key management, and trust management. Section 4 concludes the paper.

2 Cache and Forward Architecture Existing and, even more so, future Internet routers will have higher processing power and storage. In the CNF architecture it is envisioned that the wired core network consists of such high capacity routers [2]. It is not necessary that all the nodes in the network will have high capacity storage. Thus, we use the term CNF routers to signify that the router has higher capabilities. In addition to CNF routers, the future Internet will have edge networks with access points called post offices (POs), and multi-hop wireless forwarding nodes called Cache and Carry (CNC) routers. POs are CNF routers that link mobile nodes to the wired backbone and act as post offices by withholding files for the mobile nodes when disconnected or unavailable. CNC routes are mobile wireless routers with relatively smaller capacity compared to CNF routers. The storage cache of nodes in the CNF architecture is used to store packets in transit, as well as to offer in-network caching of popular content. The unit of transportation in the CNF architecture is a package. A package may represent an entire file or it may represent a portion of a file when the file is very large (e.g., a couple of Gigabytes). Therefore, it is expected that fragmentation of files will be executed by the CNF architecture. Fragmentation allows more flexibility in terms of routing and Quality of Service (QoS), and makes the data propagation more robust over single CNF hops especially between wireless devices. In this paper, we will use the terms package or file in an interleaved manner, to denote the unit of transportation within the CNF architecture.

330

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

The main service offered by this architecture is the provision of content delivery by overcoming the pitfalls of TCP/IP in the face of intermittent connectivity that characterizes wireless networks. Files are transferred hop-to-hop either in “push” or “pull” mode, i.e., a mobile end user may request a specific piece of content, or the content provider may push the content to one (unicast) or more (multicast) end-users. A performance evaluation of the CNF architecture is out of the scope of this paper and has been offered in [3]-[7]. Fig. 1 shows the concept of the CNF network. At the edge of the wired core network, POs, such as CNF1 and CNF4 serve as holding and forwarding points for content (or pointers to content) intended for mobiles, which may be disconnected at times. The sender, Mobile Node 1 (MN1) forwards the file, or portions of the file, to the receiver’s PO (CNF4) using conventional point-to-point routing. CNF4 holds the file or pointer until contacting MN3 to arrange delivery. Delivery from CNF4 could be by direct transmission if the mobile is in range, or by a series of wireless hops as determined by the routing protocol. In the latter case a CNC node, such as MN2 is used. A GUI-based demonstration of the operation of the CNF architecture as opposed to the traditional TCP type of communication has been developed and can be viewed at [8].

Sender’s Post Office

CNF2 CNF3

Receiver’s Post Office

CNC

Receiver

Sender CNF4

CNF1

MN2

MN3

MN1 CNF : Cache and Forward CNC: Cache and Carry MN: Mobile Node

Fig. 1. The Cache and Forward Architecture

CNF architecture operates above the IP layer and its operation is supported by a series of protocols that handle the propagation of packages (see Fig. 2). The role of the various CNF layers is similar to operation of the OSI layer stack, but their scope is different as they focus on handling of packages. The CNF Transport Protocol (CNF TP) is responsible for sending content query and receiving content. The content fragmentation and reassembly are also implemented here. It also checks the content error and maintains a content cache. The CNF Network Protocol (CNF NP) is responsible for content discovery and routing content towards the destination after it has been located in the network. The CNF Link Protocol (CNF LP) is designed to reliably deliver the IP packets of a package to the next hop. The control plane of the CNF architecture is supported by three protocols. The routing protocol is responsible for establishing routing paths across CNF routers. The Content Name Resolution Service (CNRS) provides an indexing mechanism to map the content ID to multiple locations of the content. The location closest to the client can be chosen for content

Security in the CNF Architecture for the Next Generation Internet

331

retrieval. The Cache Management Protocol (CMP) is used to facilitate content discovery by maintaining and updating a summary cache containing all the contents that are cached within an autonomous system (AS). Nodes within an AS update the summary cache and adjacent AS gateways exchange summary cache information. control plane CNF TP CNF NP CNRS

CNF LL

Protocol

Routing Protocol

Cache mgmt Protocol

802.11 / 802.3 (IP and MAC) Physical Layer (RF)

Fig. 2. The Cache and Forward protocol stack

3 Security Analysis In this section we look at security in the CNF architecture. Before proceeding with the security analysis, we describe some key aspects regarding the physiology of the CNF architecture, which are considered in our security analysis. The objective of the CNF architecture is to overcome intermittent connectivity over wireless links and facilitate communication among a multitude of wireless devices. Thus, we envision the existing wired Internet as a sphere surrounded by an increasing number of wireless devices as shown in Fig. 3. Outer layers represent wireless devices at different number of hops away from the wired Internet. For a specific flow, a specific number of outer layers or hops can represent the number of multiple wireless routers required to provide service to nodes (see Fig. 3). This figure emphasizes that communication may come in different forms depending on the type of communication over this spherical representation of the future Internet. We classify the communication patterns of the CNF architecture in 3 generic variations: 1) communication within strictly the wireless Internet or strictly the wired Internet 2) communication from the wireless Internet to a node that belongs to the wired Internet and vice versa, and 3) communication that will link two wireless parts of the Internet through the existing wired infrastructure. The third communication pattern poses more of a challenge in this architecture compared to the first and second pattern as it is more dynamic due to the change in connectivity. Furthermore, wireless nodes may also move and connect to the Internet through different CNC routers that may belong to different POs that are not even collocated. Therefore, the communication patterns change dynamically as connectivity to POs and other wireless devices vary. This architecture uses content caching, which introduces more complexity as the communication patterns for acquiring a specific content can vary with time due to the dynamic and changing number of CNF routers holding that specific file. Caching is vital in this architecture as it can decrease the bandwidth utilization in the Internet by increasing content availability. During content propagation over the Internet, CNF

332

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

routers may save a copy of a file prior to transmission. Nodes may obtain data from a cache that is closer compared to the original source. Even though, an investigation of cache optimization over the CNF architecture is offered in [3], security aspects were not taken into account.

Layers 1 to n End-user layer

End-user

CNF router Post Office

Traditional router CNC router

Fig. 3. Layering in the CNF architecture. Multiple layers representing intermediate wireless routers can exist.

In this research, we assume that the CNF architecture provides a methodology to name content. The development of supporting mechanisms for such a service has not been addressed and no issues of security have been taken into account. 3.1 Privacy and File Propagation One of the key aspects of data propagation is that entire files can be located on CNF routers. This functionality enables the caching of content at specific CNF locations while in transit. In terms of security this aspect provides the benefit of stopping malicious activity early in its transmission stage by having the CNF router check the content and validate that it is not virus or spam. Furthermore, it can counteract attacks whose aim is to overload the network and simply utilize its bandwidth, e.g., by checking for repeated transmissions of the same content. However, it concurrently enables access to content that is sensitive as entire files are located on routers. Thus, it bridges privacy policies for the specific file. Caching over this architecture complicates further privacy issues as it increases the exposure of sensitive content. A CNF router can be dynamically configured via a set of security policies to execute selective caching and stop caching sensitive content. However, there is a need to verify that such security policies have been correctly executed, i.e., a file has not been cached. Such verification is complicated as a control mechanism needs to be in place across CNF routers to check for possible propagation of specific content that should have been deleted by the routers. Furthermore, such a selective mechanism provides privacy from the perspective of

Security in the CNF Architecture for the Next Generation Internet

333

limiting the disclosure of information by limiting the number of cached copies of sensitive files. However, it does not really prevent CNF routers from viewing the content prior to transmission. Therefore, there still exists the issue of how privacy can be guaranteed, while harnessing the advantages of spam control or virus detection, which is an inherent security strength of the CNF architecture. To promote privacy, cryptographic methods need to be utilized such that propagation of entire files from one CNF router to the next minimizes the disclosure of information. Typically, cryptography utilized at the user level, i.e., between two end users can hinder the disclosure of information and provide privacy. However, Internet users do not tend to utilize cryptographic methods for a variety of reasons, such as lack of knowledge regarding security mechanisms, or the hidden belief that the Internet is safe, or carelessness in handling their company’s data. Regardless, at the Internet architecture level, end-to-end cryptography is not desirable because it removes the benefits that can be obtained through this architecture in terms of spam or virus control. CNF routers can no longer analyze the content of encrypted files. A simple way around this is to check the content for spam or viruses prior to transmission and then encrypt it. More specifically, have the CNF architecture protocol that assigns content identifiers to files, to also verify and sign that content with a public key signature certifying that it is spam or virus free. The assigned content identifier can also be checked by CNF routers for reply attacks that aim to consume the network bandwidth. This approach can be a good preliminary solution, but it does not account for the vast amount of data generated every second in the Internet. It is virtually impossible to have a service verify the validity of the content prior to assigning a content identifier because vast amounts of data are being generated from all possible locations of the Internet in the world and from all types of nodes and networks. A 2008 International Data Corporation analysis estimated that 281 exabytes of data were produced in 2007, which is equivalent to 281 trillion digitized novels [9]. Therefore, the challenge is to have the distributed service that provides globally certifiable content identifiers check the content and guarantee virus or spam free content. A variation to the above possible solution would be to assign the responsibility of checking content to the CNF routers. CNF routers close to the source (that are at the initial steps of the data propagation,) can check the data for spam and then carry out encryption and provide privacy. The above methodology of handling content provides spam or virus free content as intended by this architecture. However, relating these mechanisms back to the original scope of this architecture, encrypting the files to provide confidentially across CNF routers hinders caching. To provide privacy, content is encrypted at the first CNF router close to the source and decrypted by the CNF router closer to the destination. When using symmetric key cryptography between these two CNF routers, an encrypted file that is cached at any intermediate CNF router cannot be accessed. The file can only be decrypted by the two CNF routers that have the corresponding key, i.e., the key that was originally used to facilitate the communication. Even though privacy is provided, caching is infeasible. Caching on intermediate CNF routers along the routing path cannot work as an intermediate CNF router would need to have the symmetric key that was used by the two CNF routers to be able to redistribute the encrypted cached file. Utilizing public keys can still not address this issue. A package encrypted with the public key of the CNF router will provide privacy of the content as only the CNF router with the

334

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

corresponding private key can decrypt the content. Thus, a file on route cannot be cached and redistributed as needed. It is evident that more complex content distribution schemes are needed among CNF routers to balance the requirements of privacy and caching, while allowing the inherent security features (such as spam or virus control) of this architecture to exist. These solutions should strategically cache files on intermediary CNF routers while concurrently allowing them to decrypt and redistribute files to multiple parties. To achieve this, symmetric keys could be shared among dynamically selected groups of CNF routers. These group selections need to be integrated with caching algorithms that optimize content availability. In addition, the selection of groups of CNFs routers to handle specific content needs to be in accordance with the privacy requirements of the content. This is a topic of future work that needs to be addressed for this architecture. Summarizing, content in the CNF architecture must be handled at its initial stages of transmission to provide spam or virus free content. This aspect can be executed by checking the content during the provision of a content identifier or at the first CNF router close to the source. Issues of privacy need to revisited keeping caching in mind such that strategically chosen CNF routers can increase availability while complying with the content security requirements. The key management system should provide the necessary key distribution methodologies to support dynamic group keys for dynamically formed groups of CNF routers based on traffic patterns. 3.2 Secure Routing over the CNF Architecture Our focus in terms of secure routing is to look at ways that the selection of secure paths for the propagation of packages can be guided through the internal components of the CNF architecture. Even though packetization of data in the future Internet may still be facilitated by the IP protocol, the CNF architecture operates at higher layers as shown in Fig. 2 by the CNF TP, CNF NP, and the CNF LP. The CNF architecture deals with the propagation of data files or packages between CNF routers. Thus, the CNF routers create an overlay network with different topology as opposed to the actual connectivity of the existing routers. The aspect that needs to be investigated in terms of secure routing at the CNF layers is the content requirements in terms of security. The CNF architecture needs to facilitate the creation of content-based security classification. More specifically, information may have restrictions in terms of how, when, and where it is propagated. Some content may need to propagate the Internet though specific routes. It is evident that certain locations in the world may have more malicious activity compared to others or that specific data should not be disclosed to certain areas. In addition, data may need to propagate within specific time boundaries or within specific periods after which they have to be deleted. Moreover, data exposure requirements in terms of visibility may be different based on the content. These content security requirements create a complex problem as they need to be taken into account while trying to optimize caching and address other functional aspects of the CNF architecture such as QoS. Another aspect of secure routing is trust management. Trust can be used as a means to guide the selection of routing paths for packages. Over the years several methods have been developed that enable the dynamic assessment of the trustworthiness of nodes. This assessment is done by looking at certain functionalities,

Security in the CNF Architecture for the Next Generation Internet

335

such as packet forwarding, and allowing routers to check and balance one another. CNF routers can be graded for trust at the IP layer by counting IP packet forwarding (or generally packet forwarding below the CNF layers) However, not every router on the Internet will have CNF capabilities, which implies that non-CNF routers within one-hop connectivity of CNF routers need to evaluate CNF routers. This mechanism requires communication between the overlay of CNF routers and non-CNF routers, since reporting needs to be communicated and made believable to the CNF routers. If such integration is not feasible, then reporting will need to be executed among CNF routers. In this case, reputation in the CNF overlay will provide granularity at the CNF layer, meaning that assessments have to be made in term of packages and not IP packets (or lower layer packets). More specifically the control mechanism will need to exist among CNF-routers. The checks executed to evaluate trust can be extended beyond forwarded packages. They can be on the correctness of the applied cryptographic algorithms, the correctness of the secure routing operation at the CNF layers, and the compliance with the security policies regarding content requirements. If there is an overwhelmingly large amount of content that needs to be handled, the checks and balances may be confined to randomly chosen packages or to packages with highly sensitive content. Summarizing, there is a need to classify content based on its security and functional requirements, such as QoS, to effectively execute secure path selection over the CNF overlay. Trustworthiness information derived from internal characteristics of the operation of the CNF architecture operation can further guide routing decisions among CNF routers. 3.3 Key and Trust Management A Key Management System (KMS) creates, distributes, and manages the identification credentials used during authentication. The role of key management is important as it provide a means of user authentication. Authentication is needed for 2 main reasons: accountability and continuity of communication. Knowing a person ID during a communication provides an entity to hold accountable for any malicious activity or for any untrustworthy information shared. In addition, it provides a link to utilize to continue communication in the future. This continuity enables the communicating parties to assess and build trust towards one another based on their experiences of collaboration that they share. Thus, the verification of identity is linked with authentication and provides accountability, whereas the behavior grading is linked with continuity of communication. These aspects have been treated separately through key management and trust management. The authors in [10] support that in social networks as trust develops it obtains the form of an identity (i.e., identity-based trust) since two peers know each other well and know what to expect from one another. Thus, it is important for the future Internet that emphasis is placed on the link between the two areas. Verifying an identity’s credentials does not and should not imply that a peer is trustworthy as trust is a quantity that typically varies dynamically over time. Authentication implies some initial level of trust, but individual behavior should fluctuate dynamically that initial trust level. Therefore, trust management needs to be taken into account so as to assess the reliability of authenticated individuals.

336

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

In the CNF architecture, global user authentication is required for the multitude of wireless nodes that dynamically connect to the Internet via POs. Thus, in our description of trust and key management we focus on the wireless Internet that the CNF architecture aims to accommodate. More specifically, wireless nodes need to have an identity that is believable across multiple POs enabling secure communication between wireless nodes that may not be in the same wireless network. POs have a vital role in terms of key management because they are the last link on the wired Internet and are responsible for package delivery to MNs. Their location is empowering as it enables them to link wireless nodes to the rest of the Internet. POs deliver the data to wireless nodes by transmitting the files using opportunistic connectivity. Thus, they can act as delegated certificate authorities in a public key infrastructure, verifying the identification of nodes and overall managing certificates of wireless nodes. Their location is also key as it can facilitate the integration with trust management. Until now there has been no global methodology of quantifying trust. Requiring such a global metric desires an Internet architecture that enables the extraction of this information dynamically at a global level. One of the main differences between the existing Internet architecture and the CNF architecture is the paradigm of communication. In the CNF architecture the emphasis is placed on content. Another aspect that is different in this architecture is the usage of POs on the edges of the wired Internet. Using these structural characteristics, the CNF architecture can provide a base on top of which to address trust and key management. More specifically, the coupling of identification and trustworthiness can be achieved by utilizing the aforementioned two characteristics of the CNF architecture: (1) the content of the data (2) location of POs. Since the CNF architecture is content driven, the content can provide some form of characterization of an identity so as to better understand the trust that can be placed on a user. For example, an entity that downloads movies can be placed on a different trust level compared to an entity that downloads documents regarding explosives. Utilizing such a functionality to provide trust for this architecture requires the careful marking of content to indicate certain categories of data classifications that can characterize trust. In addition to this classification, the location of POs within the CNF architecture can further assist in assessing trust. Nowadays certain areas in the world suffer from higher crime tendencies compared to others. Therefore, there is an increasing possibility to utilize the Internet to reflect that behavior. Based on this conjecture, one can approximate that if sensitive data are acquired by POs at specific locations, the level of trust must be adjusted taking into account the locations of the POs as well. Trust metrics based on PO location can be carefully selected to reflect malicious activity in those locations. Those values would have to be monitored and dynamically adjusted over time. (Note that the IP address assignment is handled by IANA and therefore the location of POs can be obtained with some level of accuracy.) Utilizing the above architectural characteristics allows the POs to manage certificates for all the wireless nodes that it services. Based on specific data that nodes acquire POs can introduce some form of trust marking on certificates that they publish that characterizes the behavior of wireless nodes. That information can guide the decision of SA establishment during authentication. Some other criteria that the POs can record are the type of node (e.g., visiting, duration of visit), and activity in

Security in the CNF Architecture for the Next Generation Internet

337

bytes downloaded. In addition, those decisions of trust marking could be further guided by metrics of behavior obtained from local reputation mechanisms within a wireless network (e.g., whether wireless nodes collaborate with their peers by forwarding packets).Overall, this trustworthiness information can guide future interactions among nodes. An overview of previous work that deals with extracting trust information in mobile wireless environments is offered in [11]. At the global scale, there is also a need to grade and monitor the trustworthiness of POs that act on behalf of the wireless network. Similar to grading nodes in the wireless network, the type of content and the location of the POs can be considered. However, to grade a PO, paths of data content to/from that PO can be considered. A distributed mechanism can be introduced in the CNF overlay that uses the CNF architecture to mark the flows of trustworthiness grading of the POs in the Internet. Fig. 4 demonstrates the various flows that may exist. The POs form a circle at the edge of the wired Internet and data paths flow in multiple directions. MN PO MN

PO

PO

Wired Internet

PO

MN

PO

PO

MN

PO PO MN

PO MN

MN

PO

PO PO Data flow

MN

Fig. 4. Directions of flows in the CNF architecture

Based on the integration with trust information, a KMS can provide trust criteria for authenticated users at a global scale. One aspect that needs to be addressed is whether one can predict the behavior of nodes in the future based on their existing trust level. This need for behavior prediction opens up the question of whether trust can be modeled at a global scale and how accurately that can be assessed dynamically. An issue that also needs to be considered is the notion of positive grading for trust. For example, downloading information about helping other human beings or helping the environment may not indicate the presence of a trustworthy destination. If that aspect is considered to improve the trust of a destination, then the issue that needs to be resolved is that certain data may simply be requested to trick the grading mechanism to improve one’s trustworthiness in the Internet environment. Other aspects that need to be considered is the frequency that certain data types get directed to specific POs. Acquiring one file about making explosives may not be the same as acquiring a hundred documents. In terms of the type of information, it is very important that the categories that characterize trust are carefully selected. If a person acquires medical documents about

338

G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri

diseases, and someone else requests physics-related information that does not necessarily translate to a specific level of trust. There is a need to investigate how the type of data or content can represent the trust placed on nodes and indirectly on users. In addition, there is a need to link different types of content that may indicate certain trends of behavior. Such a classification is a complex issue as it lies in the complexity of human societal behavioral patterns. Summarizing, POs in the CNF architecture can provide identification criteria to the multitude of wireless devices and link them to the Internet. Identity verification coupled with behavior at the local and global scale can guide trust placed on the interaction among peers. Further research is required to come up with meaningful ways of assessing trust based on the content-driven paradigm of communication of the CNF architecture.

4 Conclusions In this research, we have looked at security aspects of the CNF architecture, identified strengths and weaknesses, and derived security requirements that need to taken into account. We have discussed possible security solutions for data privacy, secure routing, key management and trust management, while concurrently identifying future issues that need to be addressed. Even though the CNF architecture has functional benefits in terms of overcoming the limitation of intermittent connectivity in the now predominantly wireless Internet it also introduces security challenges. A balance needs to be obtained between data content privacy, caching, and secure routing. We need to ensure content privacy while taking advantage of security features that naturally emerge from the CNF architecture, such as spam control, virus control, and other related attacks. In addition, since the architecture is content driven there is a need to define content security requirements and route based on those requirements. However, for secure routing to exist there is also a need to be able to assess the trustworthiness of CNF routers, which requires additional mechanisms in place to assess the correct operation of CNF routers. The need for trustworthiness implies the existence of authentication as it provides the base to build trust. Authentication of the multitude of wireless devices in the CNF architecture may be facilitated by a key management system operated by the POs. Their location enables the integration of key management with trustworthiness information. Since the CNF architecture is content-driven, trust can be extracted by examining the content flows through POs. This aspect brings the challenge of coming up with meaningful ways of evaluating trust at a global scale to match the requirements of users or applications. These issues need to be carefully assessed in the future keeping also other aspects in mind, such as QoS. Overall, this analysis has brought into light the security requirements and open research issues that exist in the CNF architecture. Aspects of this investigation can serve as guidance for the design of secure future Internet architectures. Acknowledgments. George Hadjichristofi was supported by the Cyprus Research Promotion Foundation under grant agreement (ΤΠΕ/ΠΛΗΡΟ/0308(ΒΕ)/10). Christoforos Hadjicostis would also like to acknowledge funding from the European

Security in the CNF Architecture for the Next Generation Internet

339

Commission’s Seventh Framework Programme (FP7/2007-2013) under grant agreements INFSO-ICT-223844 and PIRG02-GA-2007-224877. Any opinions, f